### Timetable

### Description

PLEASE NOTE THAT THIS COURSE HAS BEEN POSTPONED TO THE AUTUMN!

Doctoral candidates from animal science, evolutionary/ecology science, veterinary science and human genetics, as well as more senior persons working in these groups.

The course will focus on quantitative (mathematical-statistical) models used to make predictions from genomic data. We are aiming at predicting phenotypes, e.g. to predict yield or performance of (young) animals and plants in agriculture, or to predict disease risk in humans.

In agriculture the recording of phenotypes can be costly and time-consuming (feed intake of animals or brewing quality of barley) and it is desirable to replace all that with a prediction based on genomic information from a DNA sample. Simple versions of genomic prediction have been used earlier when a single gene is known to be associated with a particular phenotype, e.g. disease or lethal defect. Instead of scoring a phenotype, it is easier to score a genotype and make early decisions (selection, nutrition, medication, criminal investigation) on the basis of the genotype.

It is a challenge to use genomic information for predictions in polygenic or quantitative traits affected by many genes of small effect. The course is centering on advanced quantitative methods utilizing genomic data and focuses on plant and animal breeding.

Genomic selection was introduced in the famous paper by Meuwissen, Hayes and Goddard (2001, Genetics). There are three new ideas for animal breeding which have been later adopted by plant breeders, and even in human genetics:

- building a prediction model based on genetic markers by simply using all available (genome-wide) markers without narrowing the set of markers around a putative QTL (with major effect of the genetic variation)

- prediction is the sum of marker genotypes’ tiny allele effects and can be considered an estimate of genetic (breeding) value – assuming that the marker density is high enough (say 50K SNP panel) to track all QTL

- cross-validation (well-known in machine learning) to check immediately the properties of the model

To be completed at any time during doctoral studies.

Tentative programme

Day 1: background on genomic prediction and genomic selection in animals and plant; simple approaches using GWAS results and introduction to mixed models for whole-genome prediction.

Day 2: tackling large p-small n using random/shrinkage effects and cross-validation. Building of the G-matrix and the GBLUP model.

Day 3: Details on adjustments and scaling of G- matrices, interpretation of relationships and inbreeding in the G-matrix and comparison and combination of G and A, and the single step GBLUP model. Journal club / literature review by students. General introduction to Bayesian statistics.

Day 4: Bayesian shrinkage models: BayesA and LASSO and their hyper parameters; Bayesian variable selection models and their hyper parameters. Background on implementation of Bayesian methods using MCMC and MCMC post-analysis and convergence assessment. GBLUP and Kernel methods to capture epistasis and special combining ability.

Day 5: Estimation of variance components and genomic heritability from genomic models, multi-trait models and genomic feature models. Practical analysis details: repeated and weighted records. Presentations by student on exercise results.

The daily classes would run from 9 to 16 with coffee breaks and lunch hour.

3 ECTS – with an option for 5 ECTS if a student returns homework and a short report on the results after the course.

At the end of the course, the student should be able to:

· describe the common uses of genomic prediction in animal and plant breeding

· analyse and discuss the statistical problems arising with large sets of predictors and common ways to handle these problems

· structure and explain strengths and weaknesses of various statistical and computational tools to build prediction models from high dimensional data

· apply software tools for mixed models, ridge regression, LASSO and Bayesian and machine learning methods

· perform cross validation studies and assess predictive ability of models by prediction correlation and accuracy

· explain and evaluate consequences of the data and population factors affecting predictive ability

· apply prediction tools in an empirical data set

According to the taught course content.

Pass/Fail

Background in linear models (regression, multiple regression) and preferably in mixed models (random effects, variance components) and basic mathematical statistics (joint, marginal, conditional distribution) and linear (matrix) algebra. Good skills in R programming.

Teacher: Dr Luc Janss (born 1965) has PhD in animal breeding from the Wageningen Agricultural University in 1997. He is currently working as Senior Researcher at the Center for Quantitative Genetics and Genomics of Aarhus University (Denmark). He is doing research in statistical, computational and quantitative genetics on complex traits (including longitudinal and binary traits and social effects), presently focusing on the utilization of genomic information and Bayesian statistical tools. His research covers animal, plant and human genetics. He has several widely applied software packages. His teaching and supervision activities are extensive and he has given also many international advanced courses in his expertise area.

Asko Mäki-Tanila, Professor (emeritus), Animal Breeding Science, Department of Agricultural Sciences, University of Helsinki