The material for the exam, is the same as was covered in the course in Fall 2019 and is available from the course page. The material is listed under the sections "Material" and "Assignments", i.e., it covers the slides, the datacenter book, and all of the articles we covered.
Master's Programme in Data Science is responsible for the course.
The course belongs to the Data Science Methods / Basic Studies in Data Science module.
The course is available to students from other degree programmes.
Prerequisites in terms of knowledge
Good programming skills, preferably in Python
Prerequisites for students in the Data Science programme, in terms of courses
Prerequisites for other students in terms of courses
Recommended preceding courses
Data Science Project
After the course, the student:
- Knows different infrastructures and systems for large-scale data science processing
- Can compare various infrastructures and their suitability for a particular problem
- Can select the appropriate tools and environments for a particular problem
- Can justify the system design choices behind existing data science infrastructures
- Is able to implement or extend components for processing infrastructures
Recommended time/stage of studies for completion: first year of data science MS studies
Term/teaching period when the course will be offered: autumn term, Period II
In this course we will study different distributed data processing infrastructures, such as MapReduce, Spark, Petuum, and GraphLab. We will cover their basic design and operation and discuss their differences and suitability for various types of data science problems. Through reading, class discussions, and practical exercises, you will get an overview of the various systems, gain experience in their use, and learn about their designs.
Literature is based on research articles and other online material and will be provided during the course.
During the lectures we will cover material from research articles and the students are expected to have read the articles before the lecture so that they can participate in class discussions.
Exercises in the course will mainly focus on using the various distributed data processing infrastructures in practice and applying them to concrete data science problems. There will be weekly exercise sessions for discussions around the problems and Q&A sessions.
Grading scale 0-5
Grade will be a combination of course exam, mandatory course exercises, and additional exercises as given during the course. Most of the weight of the grade comes from the practical exercises and written exercises around the data processing infrastructures covered in the course.
General exams last 3 hours and 30 minutes. Renewal exam (marked with "(U)") is the first general exam after the course and also a renewal exam of course exam(s). In a renewal exam the points student has earned during the course are taken into account. Exams marked with "(HT)" are allowed only to students who have completed the obligatory projects or other exercises included in those courses. Exams marked with "(HT/U)" are renewals to students who have completed the obligatory projects during the course. General exams might cover different area than the lectured course. Check the course web page and contact the responsible teacher if in doubt.
The course will consist of lectures, written exercises, programming exercises, and possibly other forms of teaching.
Activity during the course, including possibly mandatory attendance, will be required to pass the course.
The course can also be taken as a separate exam via self-study and possible additional exercises.