Kaisa_2012_3_photo by Veikko Somerpuro

Enrol

Messages

Jussi Kangasharju's picture

Jussi Kangasharju

Published, 29.10.2019 at 13:50

The Distributed Data Infrastructures course Moodle is now open.

Timetable

Here is the course’s teaching schedule. Check the description for possible other schedules.

DateTimeLocation
Tue 29.10.2019
12:15 - 14:00
Thu 31.10.2019
12:15 - 14:00
Tue 5.11.2019
12:15 - 14:00
Thu 7.11.2019
12:15 - 14:00
Tue 12.11.2019
12:15 - 14:00
Thu 14.11.2019
12:15 - 14:00
Tue 19.11.2019
12:15 - 14:00
Thu 21.11.2019
12:15 - 14:00
Tue 26.11.2019
12:15 - 14:00
Thu 28.11.2019
12:15 - 14:00
Tue 3.12.2019
12:15 - 14:00
Thu 5.12.2019
12:15 - 14:00
Tue 10.12.2019
12:15 - 14:00
Thu 12.12.2019
12:15 - 14:00

Other teaching

Material

The material of the course is mostly the scientific articles we will read during the course. The list is available on Moodle and will also be posted later here. Some parts of the course related to the book "Datacenter as a Computer" which is linked below.

Tasks

Article essays

In the course we cover the following articles. The actual links to the assignments are in Moodle.

MapReduce: Read the article "MapReduce: Simplified Data Processing on Large Clusters" by J. Dean and S. Ghemawat and write a short essay of about 500 words about it. In your essay, focus on the motivation for developing MapReduce and the main design components of the system.

In addition to purely summarizing the article, contrast it with the material we have seen from the "Datacenter as a Computer" book (presented in the first lecture week) and discuss why it was natural for MapReduce to emerge at Google and how you would see it relate to other large Internet companies.

Spark: Read the article "Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing" by M. Zaharia et al., and write an essay of about 500 words about the article. In your summary, focus on the main motivation behind Spark, the key design elements and its performance compared to other competing systems.

In addition to purely summarizing the article, compare Spark with the other systems we have seen in the course and discuss their respective pros and cons.

HDFS: Read the article "The Hadoop Distributed File System" by K. Shvachko et al., and write an essay of roughly 500 words about it. In the summary, focus on the motivation, design, and main attributes of HDFS.

In addition to summarizing the article, discuss its suitability to the kinds of workloads and processing systems we have seen so far in the course, and consider its wider applicability.

Pregel: Read the article "Pregel: a system for large-scale graph processing" by G. Malewicz et al., and write an essay of around 500 words about it. In the summary, focus on the main motivations and system design aspects of Pregel.

In addition to purely summarizing the article, put it into context with the rest of the systems we have seen in the course and discuss the respective pros and cons of them.

PowerGraph: Read the article "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs" by J. Gonzalez et al. and write an essay of around 500 words about it. In the summary, focus on the main motivations and system design aspects of PowerGraph.

In addition to purely summarizing the article, put it into context with the rest of the systems we have seen in the course and discuss the respective pros and cons of them.

TensorFlow: Read the article "Tensorflow: a system for large-scale machine learning" by M. Abadi et al. and write an essay of around 500 words about the article. Summarize the main contributions of the system, its design and the key pros and cons in terms of design tradeoffs.

In addition to purely summarizing the article, put it into context with the rest of the systems we have seen in the course and discuss the respective pros and cons of them.

Petuum: Read the article "Petuum: A New Platform for Distributed Machine Learning on Big Data" by E. P. Xing et al., and write an essay of around 500 words about the article. Summarize the main contributions of the system, its design and the key pros and cons in terms of design tradeoffs.

In addition to purely summarizing the article, put it into context with the rest of the systems we have seen in the course and discuss the respective pros and cons of them.

MillWheel: Read the article "MillWheel: fault-tolerant stream processing at internet scale" by T. Akidau et al., and write an essay of around 500 words about it. In the summary, focus on the main motivations and system design aspects of MillWheel.

In addition to purely summarizing the article, put it into context with the rest of the systems we have seen in the course and discuss the respective pros and cons of them.

Azure Data Lake Store: Read the article "Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics" by R. Ramakrishnan et al., and write an essay of around 500 words about it. In the summary, focus on the main motivations and system design aspects of Azure Data Lake Store.

In addition to purely summarizing the article, put it into context with the rest of the systems we have seen in the course and discuss the respective pros and cons of them.

The Tail at Scale: Read the article "The Tail at Scale" by J. Dean and L. A. Barroso, and write an essay of around 300-350 words about it. In your essay focus on how the lessons from large scale Internet service deployments (as discussed in the article) can be applied to distributed data processing systems. What are the similarities and differences between these two environments?

Conduct of the course

There is no exam for the course. In order to pass the course, you have to complete the following exercises.

You have to write 10 essays based on scientific articles. You will get 1 point per essay for a summary of the article and you can get 1 additional point for providing additional insights and putting the article better into the context of the course.

For each article we will have a discussion in class and you can get 1 point for participation in the class discussion.

We will have 2 practical projects on the topics of the course and each project is worth 9 points.

Half of maximum points are needed for passing. Returning all essays and both projects is mandatory for being eligible to pass the course. All assignment deadlines are strict and no extensions will be given.

Description

Master's Programme in Data Science is responsible for the course.

The course belongs to the Data Science Methods / Basic Studies in Data Science module.

The course is available to students from other degree programmes.

Prerequisites in terms of knowledge

Good programming skills, preferably in Python

Prerequisites for students in the Data Science programme, in terms of courses

None

Prerequisites for other students in terms of courses

Programming course

Recommended preceding courses

None

Data Science Project

After the course, the student:

  • Knows different infrastructures and systems for large-scale data science processing
  • Can compare various infrastructures and their suitability for a particular problem
  • Can select the appropriate tools and environments for a particular problem
  • Can justify the system design choices behind existing data science infrastructures
  • Is able to implement or extend components for processing infrastructures

Recommended time/stage of studies for completion: first year of data science MS studies

Term/teaching period when the course will be offered: autumn term, Period II

In this course we will study different distributed data processing infrastructures, such as MapReduce, Spark, Petuum, and GraphLab. We will cover their basic design and operation and discuss their differences and suitability for various types of data science problems. Through reading, class discussions, and practical exercises, you will get an overview of the various systems, gain experience in their use, and learn about their designs.

Literature is based on research articles and other online material and will be provided during the course.

During the lectures we will cover material from research articles and the students are expected to have read the articles before the lecture so that they can participate in class discussions.

Exercises in the course will mainly focus on using the various distributed data processing infrastructures in practice and applying them to concrete data science problems. There will be weekly exercise sessions for discussions around the problems and Q&A sessions.

Grading scale 0-5

Grade will be a combination of course exam, mandatory course exercises, and additional exercises as given during the course. Most of the weight of the grade comes from the practical exercises and written exercises around the data processing infrastructures covered in the course.

The course will consist of lectures, written exercises, programming exercises, and possibly other forms of teaching.

Activity during the course, including possibly mandatory attendance, will be required to pass the course.

The course can also be taken as a separate exam via self-study and possible additional exercises.