Course learning objectives

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on big data is shaping new algorithms and techniques. This course will mainly discuss some selected algorithms and systems on big data management, including data sketches algorithms, Hadoop MapReduce framework, and query languages for XML and graph documents.

Enrol

Messages

Jiaheng Lu's picture

Jiaheng Lu

Published, 21.8.2017 at 14:37

Welcome to join this big data course!

Please enrol yourself to the Moodle page ( https://moodle.helsinki.fi/course/view.php?id=25255 ), where you can find materials and information about this course.

Timetable

Here is the course’s teaching schedule. Check the description for possible other schedules.

DateTimeLocation
Tue 31.10.2017
14:15 - 16:00
Thu 2.11.2017
10:15 - 12:00
Tue 7.11.2017
14:15 - 16:00
Thu 9.11.2017
10:15 - 12:00
Tue 14.11.2017
14:15 - 16:00
Thu 16.11.2017
10:15 - 12:00
Tue 21.11.2017
14:15 - 16:00
Thu 23.11.2017
10:15 - 12:00
Tue 28.11.2017
14:15 - 16:00
Thu 30.11.2017
10:15 - 12:00
Tue 5.12.2017
14:15 - 16:00
Thu 7.12.2017
10:15 - 12:00
Tue 12.12.2017
14:15 - 16:00
Thu 14.12.2017
10:15 - 12:00

Description

Master's Programme in Data Science is responsible for the course.

The course belongs to the CSM14000 - Software Systems study track module.

The course is available to students from other degree programmes.

Prerequisite courses consist of an introductory course in programming (Concepts of Programming) and a course in math (Math for CS: Discrete Math). Knowledge for a relational database is recommended, but not compulsory. It is recommended to use JAVA language for Hadoop programming, but other programming languages can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program.

  • Transaction management and query optimisation
  • Big data framework
  • Distributed data framework

This course will discuss three topics related to big data management, including multiple data models (relation, XML, JSON and graph) and their operations, data sketch algorithms and Hadoop MapReduce programming.

At the end of this course, the student will be able to:

  • Have a decent understanding of the big data challenge
  • Understand various types of data models, including relation, XML, JSON and graph, and their operations and query language
  • Understand data sketch techniques to handle streaming data, including Bloom filter, Count-Min, Count-Sketch and FM Sketch;
  • Gain hands-on experience for Hadoop MapReduce programming

This course was unfortunately scheduled by mistake to period I and has now been moved to period II!

  • Introduction to big data management,
  • Data models: relational, XML and graph Data Model,
  • Data sketches: Bloom filter, Count-min, Count Sketch, FM sketch
  • MapReduce framework and Hadoop Mapreduce programming

List of associated papers:

(1) Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015)

(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)

(3) Stéphane Marchand-Maillet, Birgit Hofreiter: Big Data Management and Analysis for Business Informatics - A Survey. Enterprise Modelling and Information Systems Architectures 9(1): 90-105 (2014)

(4) Ruogu Fang, Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, S. S. Iyengar: Computational Health Informatics in the Big Data Age: A Survey. ACM Comput. Surv. 49(1): 12 (2016)

(5) Renzo Angles, Claudio Gutiérrez: Survey of graph database models. ACM Comput. Surv. 40(1) (2008)

(6) Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)

(7) Moses Charikar, Kevin C. Chen, Martin Farach-Colton: Finding Frequent Items in Data Streams. ICALP 2002: 693-703

(8) Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)

The course consists of lectures, three exercises, two study groups and an exam.

Lectures: Attending lectures is not obligatory but it is useful. Lecture notes covering key facts will be posted on the webpage of the course, but there will be additional examples and explanations during the lectures. We will explain the answers of questions in self-assessment forms in lectures.

Exercises: The students should solve the problems at home and be prepared to present their solutions at the exercise session. The students are required to solve ALL problems. It will include the hands-on exercise on Hadoop MapReduce programming.

Study groups: The students read some material in advance and then discuss the material in groups during the meeting.

The grading is based on the sum of the points from the exercises (max. 50 points) and the exam (max. 50 points). 50 points is required to pass and gives the lowest grade 1, 90 points or more gives the highest grade 5.

Course exam: The exam covers the lectures (including self-assessment questions) and the exercises and the study group. The exam lasts 2.5 hours. No notes or other material is allowed in the exam.

Renewal Exam: The renewal exam requires participation in the course and can be taken only if one submits the answers for all three exercises.

Separate exams: The separate exams do not require course participation and the grade is based on the exam score and the hands-on exercise score. Students need to submit the answer for the hands-on exercise before the separate exam.

There will be lectures, study groups, and exercises.

Submission for all three exercises is required to attend the course exam.

A separate exam can be taken with independent study, but one should complete hands-on exercises before the exam.