Big data course learning goals

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on big data is shaping new algorithms and techniques. This course will mainly discuss some selected algorithms and systems on big data management, including Hadoop and MapReduce framework, MongoDB databases and data sketches algorithms.

Please enroll yourself to the Moodle page. Most of the information of this course will be released in Moodle:

https://moodle.helsinki.fi/course/view.php?id=29529

Enrol

Messages

Jiaheng Lu's picture

Jiaheng Lu

Published, 15.8.2018 at 12:32

Hi,

Welcome to join the big data management course!

Please register yourself to the Moodle Page, where you will find the complete information of this course, including slides, exercises, and reading materials.

Moodle page:

https://moodle.helsinki.fi/course/view.php?id=29529

We use Piazza forum for the interaction of this course, please enroll yourself to Pizza.

https://piazza.com/helsinki.fi/fall2018/data14002

Our first lecture will be held on Tue 4.9.2018, 10:15 - 12:00 Exactum, D122. Please come to join us.

Best regards

Jiaheng behalf of the teaching team

Interaction

We use Piazza forum for the interaction of this course, please enrol yourself. Thanks!

Timetable

Here is the course’s teaching schedule. Check the description for possible other schedules.

DateTimeLocation
Tue 4.9.2018
10:15 - 12:00
Mon 10.9.2018
14:15 - 16:00
Tue 11.9.2018
10:15 - 12:00
Mon 17.9.2018
14:15 - 16:00
Tue 18.9.2018
10:15 - 12:00
Mon 24.9.2018
14:15 - 16:00
Tue 25.9.2018
10:15 - 12:00
Mon 1.10.2018
14:15 - 16:00
Tue 2.10.2018
10:15 - 12:00
Mon 8.10.2018
14:15 - 16:00
Tue 9.10.2018
10:15 - 12:00
Mon 15.10.2018
14:15 - 16:00
Tue 16.10.2018
10:15 - 12:00

Other teaching

Description

Master's Programme in Data Science is responsible for the course.

The course belongs to the CSM14000 - Software Systems study track module.

The course is available to students from other degree programmes.

Prerequisites in terms of knowledge

Good programming skills, preferably in Java. Basic data models such as relational data model and semi-structured data models (e.g., JSON, XML). Knowledge for a relational database is recommended, but not compulsory.

Prerequisites for students in the Data Science programme, in terms of courses

None

Prerequisites for other students in terms of courses

TKT10002 Introduction to Programming and TKT10003 Advanced Course in Programming (for good programming skills)

Recommended preceding courses

None

  • Transaction management and query optimisation
  • Big data framework
  • Distributed data framework

This course will discuss three topics related to big data management, including multiple data models (relation, XML, JSON and graph) and their operations, data sketch algorithms and Hadoop MapReduce programming.

At the end of this course, the student will be able to:

  • Have a decent understanding of the big data challenge
  • Understand various types of data models, including relation, XML, JSON and graph, and their operations and query language
  • Understand data sketch techniques to handle streaming data, including Bloom filter, Count-Min, Count-Sketch and FM Sketch;
  • Gain hands-on experience for Hadoop MapReduce programming

Recommended time/stage of studies for completion: autumn the first or second year of the Master study

Term/teaching period when the course will be offered: the course is in Autumn term / second period. The course will be offered every year.

  • Introduction to big data management,
  • Data models: relational, XML and graph Data Model,
  • Data sketches: Bloom filter, Count-min, Count Sketch, FM sketch
  • MapReduce framework and Hadoop Mapreduce programming

List of associated papers:

(1) Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015)

(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)

(3) Stéphane Marchand-Maillet, Birgit Hofreiter: Big Data Management and Analysis for Business Informatics - A Survey. Enterprise Modelling and Information Systems Architectures 9(1): 90-105 (2014)

(4) Ruogu Fang, Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, S. S. Iyengar: Computational Health Informatics in the Big Data Age: A Survey. ACM Comput. Surv. 49(1): 12 (2016)

(5) Renzo Angles, Claudio Gutiérrez: Survey of graph database models. ACM Comput. Surv. 40(1) (2008)

(6) Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)

(7) Moses Charikar, Kevin C. Chen, Martin Farach-Colton: Finding Frequent Items in Data Streams. ICALP 2002: 693-703

(8) Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)

The course consists of lectures, three exercises, two study groups and an exam.

Lectures: Attending lectures is not obligatory but it is useful. Lecture notes covering key facts will be posted on the webpage of the course, but there will be additional examples and explanations during the lectures. We will explain the answers of questions in self-assessment forms in lectures.

Exercises: The students should solve the problems at home and be prepared to present their solutions at the exercise session. The students are required to solve ALL problems. It will include the hands-on exercise on Hadoop MapReduce programming.

Study groups: The students read some material in advance and then discuss the material in groups during the meeting.

The grading is based on the sum of the points from the exercises (max. 50 points) and the exam (max. 50 points). 50 points is required to pass and gives the lowest grade 1, 90 points or more gives the highest grade 5.

Course exam: The exam covers the lectures (including self-assessment questions) and the exercises and the study group. The exam lasts 2.5 hours. No notes or other material is allowed in the exam.

Renewal Exam: The renewal exam requires participation in the course and can be taken only if one submits the answers for all three exercises.

Separate exams: The separate exams do not require course participation and the grade is based on the exam score and the hands-on exercise score. Students need to submit the answer for the hands-on exercise before the separate exam.

There will be lectures, study groups, and exercises.

Submission for all three exercises is required to attend the course exam.

A separate exam can be taken with independent study, but one should complete hands-on exercises before the exam.