Basic information

The main goals of the course are to (i) prepare you for further studies in machine learning, and (ii) introduce you to methods and tools that are commonly used when solving machine learning problems in practice. You will gain both theoretical knowledge and understanding of machine learning, as well as practical skills.

LECTURES: Attending lectures (Wednesdays at 12-14 and Fridays at 10-12) is not compulsory, but everyone is of course encouraged to be present. We will discuss the course topics, and I do my best to explain difficult details in a clear and understandable manner. To get the most out of the lectures, I am suggesting that you take a look at the course textbook in advance. See the Timetable section below for a list of topics. If you have some particularly difficult questions, feel free to send me an email ( before the lecture so that I have time to think of an intelligent answer.

PROBLEM SESSIONS: Taking part in one of the problem session groups each week is a pretty much non-negotiable requirement. You will get points for solving exercises, and these will contribute to your final grade for the course. The problem sessions take place on: Wed 16-18, Wed 18-20, Thu 12-14, Thu 16-18, and Fri 12-14. You should attend the exercise group that you have registered to via Oodi. However, if you temporarily cannot attend your session you can choose some other group in the same week as well.

The problem sessions work as follows: You prepare your answers in advance, and mark at the beginning of the session for which problems you are ready to present your solution to the class. Then, in small groups you discuss your solutions, and together try to converge to a reasonable answer. Finally, each group presents their answer to the others. The problem sessions are hosted by Moritz Lange and Henri Suominen.

EXERCISE SET 0 (PREREQUISITE KNOWLEDGE): You need to complete the Exercise Set 0 to pass the course. Please submit your answer as a pdf file via Moodle by 5 November 2019. The purpose of the Exercise Set 0 is to check your prerequisite knowledge and indicate areas where you may need to do some self-study during the course.

Further details about practicalities are discussed in the first lecture.

GRADING: There are maximum of 100 points in total for the course, 40 from exercises and 60 from the exam. You need to pass the exercise set 0 and have at least 50% from both exercises (not counting exercise set 0) and exam to pass. You must pass the exercise set 0 and pass the threshold for exercises before taking the examination. To get full exercise points, you must have completed at least 80% of the problems (not counting exercise set 0). The exercise points E used in grading are given by E=MIN(40,FLOOR(X*40/(0.8*144))), where X is the total number of points from exercise sets 1-6 (max. 144; you must have E>=20 to participate to the examination). You are allowed to have a “cheat sheet” with you at the exam. The cheat sheet is one two-sided handwritten A4-sized paper where you can write any information whatsoever. The final grade is determined by the following intervals:
50-59: grade 1
60-69: grade 2
70-79: grade 3
80-89: grade 4
90-100: grade 5

Please use the course email alias for all course-related emails!




Kai Puolamäki

Julkaistu, 4.12.2019 klo 9:24

We will have two machine learning guest lectures in Exactum B123 on Wednesday 11 December at 12:15-14. The tentative programme includes two 15-20 minute presentations, followed by a discussion where you can ask questions from the speakers. This year's topics is machine learning in insurance. The speakers and topics are:

Janne Kaippio: Boosted Decision Trees applied to Non-life insurance business problems
Abstract: Insurance companies model their customer’s future expected cash flows (CLTV) to estimates customers economic value added for the company. CLTV-models can be quite complex and contain various different (sub) mathematical models. In the presentation one of those CLTV-sub-model is presented (i.e. product retention model) and how LocalTapiola applies boosted decision trees for that sub-model. Also presentation includes topics around model validation, model error and model stability.
About the speaker: Janne Kaippio is a Chief Actuary for LocalTapiola non-life companies. Also he is a board member for the LocalTapiola life company. Current topics of interest includes areas around how machine learning techniques can be applied to solve non-life insurance business problems.

Matti Heikkonen: Forecasting disability pensions
Abstract: Pension funds use models as tool to support decision making, e.g. by forecasting rare events and risks. The focus of the presentation will be on disability pensions and vocational rehabilitation and the related practical challenges. In addition to pure modelling, I will also talk about some of the problems inherent to real world datasets.
About the speaker: Matti Heikkonen earned his doctorate in the quantitative methods of economics in 2018. For the past one and a half years he has worked as senior data scientist at Ilmarinen developing models and providing statistical analyses.



Tentative schedule for lectures is as follows:

1 Lecture: Introduction to the course, practicalities / logistics, simple examples
2 Lecture: “Ingredients of Machine learning”, the idea of generalisation error [Ch. 1]
3 Lecture: Linear models + Evaluation [Ch. 2]
4 Lecture: Linear models + Evaluation [Ch. 2 and Sec. 5.1]
5 Lecture: Classification, probabilistic methods in general [Ch. 3 and Sec. 6.1-6.2]
6 Lecture: Classification, Gaussian classifiers and Naive Bayes [Ch. 3]
7 Lecture: Classification, k-NN and decision trees [Ch. 8]
8 Lecture: Support Vector Machines [Ch. 9]
9 Lecture: Clustering [Secs. 10.1 and 10.3]
10 Lecture: Principal component analysis / dimensionality reduction [Secs. 10.1-10.2]
11 Lecture: Special guest stars present practical applications of machine learning
12 Lecture: Resampling & ensemble methods, exam, recap topics [Secs. 3.1.2, 5.2, 8.2]

It is a good idea to read the relevant text book chapters before the lectures. Above, we have indicated main sources in the text book relevant for each of the planned lectures.

You can find the fixed exam dates below and from Oodi. Remember that you must register to the examinations by the given deadlines! You can find a tentative exam schedule of the planned exams at

Ke 30.10.2019
12:15 - 14:00
Pe 1.11.2019
10:15 - 12:00
Ke 6.11.2019
12:15 - 14:00
Pe 8.11.2019
10:15 - 12:00
Ke 13.11.2019
12:15 - 14:00
Pe 15.11.2019
10:15 - 12:00
Ke 20.11.2019
12:15 - 14:00
Pe 22.11.2019
10:15 - 12:00
Ke 27.11.2019
12:15 - 14:00
Pe 29.11.2019
10:15 - 12:00
Ke 11.12.2019
12:15 - 14:00
Pe 13.12.2019
10:15 - 12:00

Muu opetus

06.11. - 11.12.2019 Ke 16.15-18.00
Moritz Lange
Opetuskieli: englanti
07.11. - 12.12.2019 To 12.15-14.00
Henri Suominen
Opetuskieli: englanti
07.11. - 12.12.2019 To 16.15-18.00
Henri Suominen
Opetuskieli: englanti
08.11. - 29.11.2019 Pe 12.15-14.00
13.12.2019 Pe 12.15-14.00
Moritz Lange
Opetuskieli: englanti
06.11. - 11.12.2019 Ke 18.15-20.00
Moritz Lange
Opetuskieli: englanti


Course textbook, "An Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, is available online. We will cover the entire book, except Chapter 7.

The course text book has a companion book "The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman which has goes more in depth to the theory and lacks the examples in R. The book is also available online and can be used as additional reading material.

Lecture slides can be found below before the lectures. The contents will be pretty similar to the last year's edition, with minor modification, so if you do not see the slides here yet you can also take look at 2018 course web site.

Some of the lectures are made using R presentations. You can print them or create a pdf by printing the html file from a web browser. Some browser seem to have issues (= weird looking output), but at least Firefox seems to work reasonably well.



Exercise Set 0: Prerequisite Knowledge

Submit the answers to the Exercise Set 0: Prerequisite Knowledge via Moodle on 5 November 2019, at latest.

Email submissions of Exercise Sets from 1 onwards

As discussed in the first lecture, solutions can be returned via email only for verifiable and valid reasons. "Valid" reason is here such that would entitle you to be absent from work, e.g., illness. Non-valid reasons include work, travel, and other studies.

You must send your solutions to the course email box (not to personal addresses!) BEFORE the exercise group. Please include your answer to the exercises as a pdf attachment and give the following information in the body of the email: your (i) name and (ii) student number, (iii) the numbers of problems you have solved, and (iv) the reason for not attending the exercise group. Submissions not adhering to these guidelines will not be counted. Notice that you need to complete 80% of the problems to get full exercises points, i.e., you can get the maximum grade even if you miss one week's exercises.

Exercise Set 1

Due 6-8 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.

Exercise Set 2

Due 13-15 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.

(removed misleading reference to lectures)

Exercise Set 3

Due 20-22 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.

Exercise Set 4

Due 27-29 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.

Exercise Set 5

Due 4-5 December 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc. Notice that there will be no exercise group on Friday 6 December due to the Independence day.

Exercise Set 6

Due 11-13 December 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc. This is the last exercise set.

Project for separate exam takers

If you wish to take a separate exam without having taken part in the exercise sessions, you must complete a project work. (Exercise points are still valid in the January and April 2020 exam, however.) Please be in touch with the lecturer before starting to work on the project, unless you have already received an email from the lecturer about this.

You need to EITHER complete 50% of the problems in the problem sessions OR this project work before we will grade your examination. This project will however require substantially larger effort than the problem sessions. You should therefore choose this project, instead of the problem sessions, only if you really cannot participate to the problem sessions.

You can find the instructions for the project in Moodle.


Data Science Master's programme

Data Science Methods module

The course is available to students from other degree programmes

Prerequisites in terms of knowledge

Basics of probability calculus and statistics (including multivariate probability, Bayes formula, and maximum likelihood estimators) and intermediate level linear algebra (including multivariate calculus). Good programming skills in some language and the ability to quickly acquire the basics of a new environment (R or python/numpy/scipy). Some knowledge of data science and artificial intelligence is useful but not required.

Prerequisites for students in the Data Science programme, in terms of courses


Prerequisites for other students in terms of courses

Introduction to statistics (including multivariate probability, Bayes formula, and maximum likelihood estimators). Linear algebra and matrices I-II (including multivariate calculus). TKT10002 Introduction to Programming and TKT10003 Advanced Course in Programming (i.e., good programming skills in some language and the ability to quickly acquire the basics of a new environment (R or python/numpy/scipy)).

Recommended preceding courses

DATA11001 Introduction to Data Science and DATA15001 Introduction to Artificial Intelligence

Courses in the Machine Learning module

  • Defines and is able to explain basic concepts in machine learning (e.g. training data, feature, model selection, loss function, training error, test error, overfitting)
  • Recognises various machine learning problems and methods suitable for them: supervised vs unsupervised learning, discriminative vs generative learning paradigm, symbolic vs numeric data
  • Knows the basics of a programming environment (such as R or python/numpy/scipy) suitable for machine learning applications
  • Is able to implement at least one distance-based, one linear, and one generative classification method, and apply these to solving simple classification problems
  • Is able to implement and apply linear regression to solve simple regression problems
  • Explains the assumptions behind the machine learning methods presented in the course
  • Implements testing and cross- validation methods, and is able to apply them to evaluate the performance of machine learning methods and to perform model selection
  • Comprehends the most important clustering formalisms (distance measures, k-means clustering, hierarchical clustering)
  • Explains the idea of the k-means clustering algorithm and is able to implement it
  • Is able to implement a method for hierarchical clustering and can interpret its results

First semester (Autumn)

Typically 2nd period

  • statistical learning, models and data, evaluating performance, overfitting, bias-variance tradeoff
  • linear regression
  • classification: logistic regression, linear and quadratic discriminant analysis, naive Bayes, nearest neighbour classifier, decision trees, support vector machine
  • clustering (flat and hierarchical); k-means, agglomerative clustering
  • resampling methods (cross-validation, bootstrap), ensemble methods (bagging, random forests)

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani: An Introduction to Statistical Learning with Applications in R, Springer, 2013.

Parts of the textbook that are required are specified on the course web page.

The course will involve weekly exercises that include both programming and other kinds of problems ("pen and paper").

Assessment and grading is based on completed exercises and a course exam. Possible other criteria will be specified on the course web page.

  • Contact teaching
  • Possible attendance requirements are specified each year at the course web page
  • Completion is based on exercises and one or more exams. Possible other methods of completion will be announced on the course web page.