We use Twitter for course announcements using the hashtag #UnivHelsinkiCS_DDI17. You can join the Slack team for the course using your helsinki.fi email address.
Here you can find course slides and other materials. All exercise and assignment descriptions are only in Moodle. The book "Datacenter as a Computer" linked below is for background reference and general reading.
All the course tasks are in Moodle and will be published only there. Returns also happen exclusively through Moodle; no email returns are accepted.
Passing the course requires completing the following assignments:
1. You need to write 5 essays about the given scientific articles (weight 1/8)
2. You need to complete 3 small projects with different data processing infrastructures (weight 1/4 each)
3. You need to write a summary report where you compare the different systems you have seen and used in the course (weight 1/8)
All of the above are mandatory to be able to pass the course. The assignments will be evaluated individually with the given weights and the total number of points is used to determine the final grade. Half of the points are required for passing and getting 5/6 points will get you the grade 5.
Master's Programme in Data Science is responsible for the course.
The course belongs to the Data Science Methods / Basic Studies in Data Science module.
The course is available to students from other degree programmes.
Prerequisites are the same as entry requirements for the Data Science MS programme, specifically programming skills.
Data Science Project
After the course, the student:
- Knows different infrastructures and systems for large-scale data science processing
- Can compare various infrastructures and their suitability for a particular problem
- Can select the appropriate tools and environments for a particular problem
- Can justify the system design choices behind existing data science infrastructures
- Is able to implement or extend components for processing infrastructures
Recommended time/stage of studies for completion: first year of data science MS studies
Term/teaching period when the course will be offered: autumn term, Period II
In this course we will study different distributed data processing infrastructures, such as MapReduce, Spark, Petuum, and GraphLab. We will cover their basic design and operation and discuss their differences and suitability for various types of data science problems. Through reading, class discussions, and practical exercises, you will get an overview of the various systems, gain experience in their use, and learn about their designs.
Literature is based on research articles and other online material and will be provided during the course.
During the lectures we will cover material from research articles and the students are expected to have read the articles before the lecture so that they can participate in class discussions.
Exercises in the course will mainly focus on using the various distributed data processing infrastructures in practice and applying them to concrete data science problems. There will be weekly exercise sessions for discussions around the problems and Q&A sessions.
Grading scale 0-5
Grade will be a combination of course exam, mandatory course exercises, and additional exercises as given during the course. Most of the weight of the grade comes from the practical exercises and written exercises around the data processing infrastructures covered in the course.
The course will consist of lectures, written exercises, programming exercises, and possibly other forms of teaching.
Activity during the course, including possibly mandatory attendance, will be required to pass the course.
The course can also be taken as a separate exam via self-study and possible additional exercises.