We use Piazza forum for the interaction of this course, please enrol yourself. Thanks!
Master's Programme in Data Science is responsible for the course.
The course belongs to the CSM14000 - Software Systems study track module.
The course is available to students from other degree programmes.
Prerequisites in terms of knowledge
Good programming skills, preferably in Java. Basic data models such as relational data model and semi-structured data models (e.g., JSON, XML). Knowledge for a relational database is recommended, but not compulsory.
Prerequisites for students in the Data Science programme, in terms of courses
Prerequisites for other students in terms of courses
TKT10002 Introduction to Programming and TKT10003 Advanced Course in Programming (for good programming skills)
Recommended preceding courses
- Transaction management and query optimisation
- Big data framework
- Distributed data framework
This course will discuss three topics related to big data management, including multiple data models (relation, XML, JSON and graph) and their operations, data sketch algorithms and Hadoop MapReduce programming.
At the end of this course, the student will be able to:
- Have a decent understanding of the big data challenge
- Understand various types of data models, including relation, XML, JSON and graph, and their operations and query language
- Understand data sketch techniques to handle streaming data, including Bloom filter, Count-Min, Count-Sketch and FM Sketch;
- Gain hands-on experience for Hadoop MapReduce programming
Recommended time/stage of studies for completion: autumn the first or second year of the Master study
Term/teaching period when the course will be offered: the course is in Autumn term / second period. The course will be offered every year.
- Introduction to big data management,
- Data models: relational, XML and graph Data Model,
- Data sketches: Bloom filter, Count-min, Count Sketch, FM sketch
- MapReduce framework and Hadoop Mapreduce programming
List of associated papers:
(1) Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015)
(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)
(3) Stéphane Marchand-Maillet, Birgit Hofreiter: Big Data Management and Analysis for Business Informatics - A Survey. Enterprise Modelling and Information Systems Architectures 9(1): 90-105 (2014)
(4) Ruogu Fang, Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, S. S. Iyengar: Computational Health Informatics in the Big Data Age: A Survey. ACM Comput. Surv. 49(1): 12 (2016)
(5) Renzo Angles, Claudio Gutiérrez: Survey of graph database models. ACM Comput. Surv. 40(1) (2008)
(6) Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)
(7) Moses Charikar, Kevin C. Chen, Martin Farach-Colton: Finding Frequent Items in Data Streams. ICALP 2002: 693-703
(8) Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)
The course consists of lectures, three exercises, two study groups and an exam.
Lectures: Attending lectures is not obligatory but it is useful. Lecture notes covering key facts will be posted on the webpage of the course, but there will be additional examples and explanations during the lectures. We will explain the answers of questions in self-assessment forms in lectures.
Exercises: The students should solve the problems at home and be prepared to present their solutions at the exercise session. The students are required to solve ALL problems. It will include the hands-on exercise on Hadoop MapReduce programming.
Study groups: The students read some material in advance and then discuss the material in groups during the meeting.
The grading is based on the sum of the points from the exercises (max. 50 points) and the exam (max. 50 points). 50 points is required to pass and gives the lowest grade 1, 90 points or more gives the highest grade 5.
Course exam: The exam covers the lectures (including self-assessment questions) and the exercises and the study group. The exam lasts 2.5 hours. No notes or other material is allowed in the exam.
Renewal Exam: The renewal exam requires participation in the course and can be taken only if one submits the answers for all three exercises.
Separate exams: The separate exams do not require course participation and the grade is based on the exam score and the hands-on exercise score. Students need to submit the answer for the hands-on exercise before the separate exam.
There will be lectures, study groups, and exercises.
Submission for all three exercises is required to attend the course exam.
A separate exam can be taken with independent study, but one should complete hands-on exercises before the exam.