Instruction

Name Cr Method of study Time Location Organiser
Seminar on Big Data Management 5 Cr Seminar 13.1.2020 - 27.4.2020
Name Cr Method of study Time Location Organiser
Seminar on Big Data Management 5 Cr Seminar 14.1.2019 - 29.4.2019
Seminar on Big Data Management 5 Cr Seminar 15.1.2018 - 30.4.2018

Target group

Master's Programme in Computer Science is responsible for the course

  • CSM14000 - Software Systems study track
  • Data Management course package CSM14300
  • Module in Data Management CSM24300

The course is available to students from other degree programmes (this seminar can be available to students with Master of Data Science)

Prerequisites

Basic knowledge on relational databases or equivalent knowledge.

Learning outcomes

Students are expected to

(1) Have a decent understanding of big data challenge
(2) Conduct research on one of the topics related to big data management

(3) Perform a literature review on big data management
(4) Know how to read/write/review a technical paper
(5) Know how to present a paper

Timing

Recommended time for completion is spring the first or second year of the Masters programme.

The seminar is in Spring term and will be offered every year.

Contents

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on collecting and analyzing big data is shaping new platforms and techniques. This seminar will mainly discuss new research papers in different subfields of big data management, including data querying, exploration, sampling, sharing, cleansing, big data benchmark and applications.

Activities and teaching methods in support of learning

Teacher's lectures and students' presentations and reports.

Study materials

Big data survey (Volume, Velocity, Variety and Value)
(1) Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015)
(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)

Hadoop and Spark platforms (Volume, Velocity, Variety)

(1) Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014)
(2) Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma Özcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015)

Cloud data management (Volume, Velocity)

(1) Adam Silberstein, Russell Sears, Wenchao Zhou, Brian F. Cooper: A batch of PNUTS: experiences connecting cloud batch and serving systems. SIGMOD Conference 2011: 1101-1112
(2) Daniel J. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng. Bull. 32(1): 3-12 (2009)

Data sampling (Volume, Velocity)

(1) Ying Yan, Liang Jeff Chen, Zheng Zhang: Error-bounded Sampling for Analytics on Big Sparse Data. PVLDB 7(13): 1508-1519 (2014)
(2) S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000

Graph data management (Volume, Variety)

(1) Yu Liu, Jiaheng Lu, Hua Yang, Xiaokui Xiao, Zhewei Wei: Towards Maximum Independent Sets on Massive Graphs. PVLDB 8(13): 2122-2133 (2015)
(2) Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, Pradeep Dubey: Navigating the maze of graph analytics frameworks using massive graph datasets. SIGMOD Conference 2014: 979-990
(3)Philippe Cudré-Mauroux, Sameh Elnikety: Graph Data Management Systems for New Application Domains. PVLDB 4(12): 1510-1511 (2011)

Data exploration (Volume, Variety)

(1) Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, Letizia Tanca: Database Challenges for Exploratory Computing. SIGMOD Record 44(2): 17-22 (2015)
(2) Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri: Overview of Data Exploration Techniques. SIGMOD Conference 2015: 277-281

Approximate string processing (Variety)

(1) Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with synonyms. SIGMOD Conference 2013: 373-384
(2) Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008: 257-266

Data cleansing (Volume, Variety and Value)

(1) Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261
(2) Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD Conference 2015: 1215-1230

Knowledge base (Volume, Variety and Value)

(1) Omkar Deshpande, Digvijay S. Lamba, Michel Tourn, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, AnHai Doan: Building, maintaining, and using knowledge bases: a report from the trenches. SIGMOD Conference 2013: 1209-1220
(2) Albert Weichselbraun, Stefan Gindl, Arno Scharl: Enriching semantic knowledge bases for opinion mining in big data applications. Knowl.-Based Syst. 69: 78-85 (2014)
(3) Maria Pershina, Mohamed Yakout, Kaushik Chakrabarti: Holistic entity matching across knowledge graphs. Big Data 2015: 1585-1590

Big data benchmark (Volume, Velocity, Variety)

(1) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010: 143-154
(2)Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178

Big data applications (Volume, Velocity, Variety and Value)

(1) Paul Suganthan G. C., Chong Sun, Krishna Gayatri K., Haojun Zhang, Frank Yang, Narasimhan Rampalli, Shishir Prasad, Esteban Arcaute, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra, AnHai Doan: Why Big Data Industrial Systems Need Rules and What We Can Do About It. SIGMOD Conference 2015: 265-276
(2) Javier Andréu Pérez, Carmen C. Y. Poon, Robert D. Merrifield, Stephen T. C. Wong, Guang-Zhong Yang: Big Data for Health. IEEE J. Biomedical and Health Informatics 19(4): 1193-1208 (2015)
(3) Jae-Gil Lee, Minseo Kang: Geospatial Big Data: Challenges and Opportunities. Big Data Research 2(2): 74-81 (2015)
(4) Taruna Seth, Vipin Chaudhary: Big Data in Finance. Big Data - Algorithms, Analytics, and Applications 2015: 329-356
(5) Kesheng Wu, E. Wes Bethel, Ming Gu, David Leinweber, Oliver Rübel: A big data approach to analyzing market volatility. Algorithmic Finance 2(3-4): 241-267 (2013)

Assessment practices and criteria

Students complete this seminar by actively participating in its work: the work methods include studying scientific sources, writing reports and giving presentations, reading the reports of other participants and evaluating them.

Recommended optional studies

To continue with a Master's thesis in computer science related to the topic of the seminar.

Academic writing courses

Completion methods

The grading will be based on each student's own written work (1/3), oral presentation (1/3), and commentary as an opponent on the presentations and reports of others as well as activeness in general (1/3). To pass the seminar, each of these components must be passed. (Active) attendance of seminar meetings is obligatory. Absence from at most two meetings is accepted (and will affect grading).