Kaisa_2012_3_photo by Veikko Somerpuro

8.10.2018 at 09:00 - 14.12.2018 at 23:59


Master's Programme in Data Science is responsible for the course.

The course belongs to the Data Science Methods / Basic Studies in Data Science module.

The course is available to students from other degree programmes.

Prerequisites in terms of knowledge

Good programming skills, preferably in Python

Prerequisites for students in the Data Science programme, in terms of courses


Prerequisites for other students in terms of courses

Programming course

Recommended preceding courses


Data Science Project

After the course, the student:

  • Knows different infrastructures and systems for large-scale data science processing
  • Can compare various infrastructures and their suitability for a particular problem
  • Can select the appropriate tools and environments for a particular problem
  • Can justify the system design choices behind existing data science infrastructures
  • Is able to implement or extend components for processing infrastructures

Recommended time/stage of studies for completion: first year of data science MS studies

Term/teaching period when the course will be offered: autumn term, Period II

In this course we will study different distributed data processing infrastructures, such as MapReduce, Spark, Petuum, and GraphLab. We will cover their basic design and operation and discuss their differences and suitability for various types of data science problems. Through reading, class discussions, and practical exercises, you will get an overview of the various systems, gain experience in their use, and learn about their designs.

Literature is based on research articles and other online material and will be provided during the course.

During the lectures we will cover material from research articles and the students are expected to have read the articles before the lecture so that they can participate in class discussions.

Exercises in the course will mainly focus on using the various distributed data processing infrastructures in practice and applying them to concrete data science problems. There will be weekly exercise sessions for discussions around the problems and Q&A sessions.

Grading scale 0-5

Grade will be a combination of course exam, mandatory course exercises, and additional exercises as given during the course. Most of the weight of the grade comes from the practical exercises and written exercises around the data processing infrastructures covered in the course.

The course will consist of lectures, written exercises, programming exercises, and possibly other forms of teaching.

Activity during the course, including possibly mandatory attendance, will be required to pass the course.

The course can also be taken as a separate exam via self-study and possible additional exercises.