Viestit
Aikataulu
Materiaalit
Course textbook, "An Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, is available online. We will cover the entire book, except Chapter 7.
The course text book has a companion book "The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman which has goes more in depth to the theory and lacks the examples in R. The book is also available online and can be used as additional reading material.
Lecture slides can be found below before the lectures. The contents will be pretty similar to the last year's edition, with minor modification, so if you do not see the slides here yet you can also take look at 2018 course web site.
Some of the lectures are made using R presentations. You can print them or create a pdf by printing the html file from a web browser. Some browser seem to have issues (= weird looking output), but at least Firefox seems to work reasonably well.
Muu
Tehtävät
Exercise Set 0: Prerequisite Knowledge
Submit the answers to the Exercise Set 0: Prerequisite Knowledge via Moodle on 5 November 2019, at latest.
Email submissions of Exercise Sets from 1 onwards
As discussed in the first lecture, solutions can be returned via email only for verifiable and valid reasons. "Valid" reason is here such that would entitle you to be absent from work, e.g., illness. Non-valid reasons include work, travel, and other studies.
You must send your solutions to the course email box ml2019@helsinki.fi (not to personal addresses!) BEFORE the exercise group. Please include your answer to the exercises as a pdf attachment and give the following information in the body of the email: your (i) name and (ii) student number, (iii) the numbers of problems you have solved, and (iv) the reason for not attending the exercise group. Submissions not adhering to these guidelines will not be counted. Notice that you need to complete 80% of the problems to get full exercises points, i.e., you can get the maximum grade even if you miss one week's exercises.
Exercise Set 1
Due 6-8 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.
Exercise Set 2
Due 13-15 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.
Exercise Set 3
Due 20-22 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.
Exercise Set 4
Due 27-29 November 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc.
Exercise Set 5
Due 4-5 December 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc. Notice that there will be no exercise group on Friday 6 December due to the Independence day.
Exercise Set 6
Due 11-13 December 2019. Mark the problems which you have completed at the exercise sessions and be prepared to present the problems which you have marked. You do not need to submit the solutions via Moodle etc. This is the last exercise set.
Project for separate exam takers
If you wish to take a separate exam without having taken part in the exercise sessions, you must complete a project work. (Exercise points are still valid in the January and April 2020 exam, however.) Please be in touch with the lecturer before starting to work on the project, unless you have already received an email from the lecturer about this.
You need to EITHER complete 50% of the problems in the problem sessions OR this project work before we will grade your examination. This project will however require substantially larger effort than the problem sessions. You should therefore choose this project, instead of the problem sessions, only if you really cannot participate to the problem sessions.
You can find the instructions for the project in Moodle.
Kuvaus
Data Science Master's programme
Data Science Methods module
The course is available to students from other degree programmes
Prerequisites in terms of knowledge
Basics of probability calculus and statistics (including multivariate probability, Bayes formula, and maximum likelihood estimators) and intermediate level linear algebra (including multivariate calculus). Good programming skills in some language and the ability to quickly acquire the basics of a new environment (R or python/numpy/scipy). Some knowledge of data science and artificial intelligence is useful but not required.
Prerequisites for students in the Data Science programme, in terms of courses
None
Prerequisites for other students in terms of courses
Introduction to statistics (including multivariate probability, Bayes formula, and maximum likelihood estimators). Linear algebra and matrices I-II (including multivariate calculus). TKT10002 Introduction to Programming and TKT10003 Advanced Course in Programming (i.e., good programming skills in some language and the ability to quickly acquire the basics of a new environment (R or python/numpy/scipy)).
Recommended preceding courses
DATA11001 Introduction to Data Science and DATA15001 Introduction to Artificial Intelligence
Courses in the Machine Learning module
- Defines and is able to explain basic concepts in machine learning (e.g. training data, feature, model selection, loss function, training error, test error, overfitting)
- Recognises various machine learning problems and methods suitable for them: supervised vs unsupervised learning, discriminative vs generative learning paradigm, symbolic vs numeric data
- Knows the basics of a programming environment (such as R or python/numpy/scipy) suitable for machine learning applications
- Is able to implement at least one distance-based, one linear, and one generative classification method, and apply these to solving simple classification problems
- Is able to implement and apply linear regression to solve simple regression problems
- Explains the assumptions behind the machine learning methods presented in the course
- Implements testing and cross- validation methods, and is able to apply them to evaluate the performance of machine learning methods and to perform model selection
- Comprehends the most important clustering formalisms (distance measures, k-means clustering, hierarchical clustering)
- Explains the idea of the k-means clustering algorithm and is able to implement it
- Is able to implement a method for hierarchical clustering and can interpret its results
First semester (Autumn)
Typically 2nd period
- statistical learning, models and data, evaluating performance, overfitting, bias-variance tradeoff
- linear regression
- classification: logistic regression, linear and quadratic discriminant analysis, naive Bayes, nearest neighbour classifier, decision trees, support vector machine
- clustering (flat and hierarchical); k-means, agglomerative clustering
- resampling methods (cross-validation, bootstrap), ensemble methods (bagging, random forests)
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani: An Introduction to Statistical Learning with Applications in R, Springer, 2013.
Parts of the textbook that are required are specified on the course web page.
The course will involve weekly exercises that include both programming and other kinds of problems ("pen and paper").
Assessment and grading is based on completed exercises and a course exam. Possible other criteria will be specified on the course web page.
- Contact teaching
- Possible attendance requirements are specified each year at the course web page
- Completion is based on exercises and one or more exams. Possible other methods of completion will be announced on the course web page.