Kaisa_2012_3_photo by Veikko Somerpuro

Welcome to join big data seminar 2020!

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on collecting and analyzing big data is shaping new platforms and techniques.

This seminar will mainly discuss new research papers in different subfields of big data management, including data querying, exploration, sampling, sharing, cleansing, big data benchmarking and applications, and blockchain data management.

Note that due to the coronavirus outbreak, all campus meetings for this seminar have been canceled from 16.3.2020. This seminar will continue to run with an online learning form. All materials of this seminar are available on the Moodle page.

Enrol
Moodle
Log in to view the registration key for Moodle.

Timetable

Here is the course’s teaching schedule. Check the description for possible other schedules.

DateTimeLocation
Mon 13.1.2020
12:15 - 14:00
Mon 20.1.2020
12:15 - 14:00
Mon 27.1.2020
12:15 - 14:00
Mon 3.2.2020
12:15 - 14:00
Mon 10.2.2020
12:15 - 14:00
Mon 17.2.2020
12:15 - 14:00
Mon 24.2.2020
12:15 - 14:00
Mon 9.3.2020
12:15 - 14:00
Mon 16.3.2020
12:15 - 14:00
Mon 23.3.2020
12:15 - 14:00
Mon 30.3.2020
12:15 - 14:00
Mon 6.4.2020
12:15 - 14:00
Mon 20.4.2020
12:15 - 14:00
Mon 27.4.2020
12:15 - 14:00

Other teaching

Description

Master's Programme in Computer Science is responsible for the course

  • CSM14000 - Software Systems study track
  • Data Management course package CSM14300
  • Module in Data Management CSM24300

The course is available to students from other degree programmes (this seminar can be available to students with Master of Data Science)

Basic knowledge on relational databases or equivalent knowledge.

To continue with a Master's thesis in computer science related to the topic of the seminar.

Academic writing courses

Students are expected to

(1) Have a decent understanding of big data challenge
(2) Conduct research on one of the topics related to big data management

(3) Perform a literature review on big data management
(4) Know how to read/write/review a technical paper
(5) Know how to present a paper

Recommended time for completion is spring the first or second year of the Masters programme.

The seminar is in Spring term and will be offered every year.

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on collecting and analyzing big data is shaping new platforms and techniques. This seminar will mainly discuss new research papers in different subfields of big data management, including data querying, exploration, sampling, sharing, cleansing, big data benchmark and applications.

Big data survey (Volume, Velocity, Variety and Value)
(1) Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015)
(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)

Hadoop and Spark platforms (Volume, Velocity, Variety)

(1) Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014)
(2) Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma Özcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015)

Cloud data management (Volume, Velocity)

(1) Adam Silberstein, Russell Sears, Wenchao Zhou, Brian F. Cooper: A batch of PNUTS: experiences connecting cloud batch and serving systems. SIGMOD Conference 2011: 1101-1112
(2) Daniel J. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng. Bull. 32(1): 3-12 (2009)

Data sampling (Volume, Velocity)

(1) Ying Yan, Liang Jeff Chen, Zheng Zhang: Error-bounded Sampling for Analytics on Big Sparse Data. PVLDB 7(13): 1508-1519 (2014)
(2) S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000

Graph data management (Volume, Variety)

(1) Yu Liu, Jiaheng Lu, Hua Yang, Xiaokui Xiao, Zhewei Wei: Towards Maximum Independent Sets on Massive Graphs. PVLDB 8(13): 2122-2133 (2015)
(2) Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, Pradeep Dubey: Navigating the maze of graph analytics frameworks using massive graph datasets. SIGMOD Conference 2014: 979-990
(3)Philippe Cudré-Mauroux, Sameh Elnikety: Graph Data Management Systems for New Application Domains. PVLDB 4(12): 1510-1511 (2011)

Data exploration (Volume, Variety)

(1) Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, Letizia Tanca: Database Challenges for Exploratory Computing. SIGMOD Record 44(2): 17-22 (2015)
(2) Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri: Overview of Data Exploration Techniques. SIGMOD Conference 2015: 277-281

Approximate string processing (Variety)

(1) Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with synonyms. SIGMOD Conference 2013: 373-384
(2) Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008: 257-266

Data cleansing (Volume, Variety and Value)

(1) Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261
(2) Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD Conference 2015: 1215-1230

Knowledge base (Volume, Variety and Value)

(1) Omkar Deshpande, Digvijay S. Lamba, Michel Tourn, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, AnHai Doan: Building, maintaining, and using knowledge bases: a report from the trenches. SIGMOD Conference 2013: 1209-1220
(2) Albert Weichselbraun, Stefan Gindl, Arno Scharl: Enriching semantic knowledge bases for opinion mining in big data applications. Knowl.-Based Syst. 69: 78-85 (2014)
(3) Maria Pershina, Mohamed Yakout, Kaushik Chakrabarti: Holistic entity matching across knowledge graphs. Big Data 2015: 1585-1590

Big data benchmark (Volume, Velocity, Variety)

(1) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010: 143-154
(2)Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178

Big data applications (Volume, Velocity, Variety and Value)

(1) Paul Suganthan G. C., Chong Sun, Krishna Gayatri K., Haojun Zhang, Frank Yang, Narasimhan Rampalli, Shishir Prasad, Esteban Arcaute, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra, AnHai Doan: Why Big Data Industrial Systems Need Rules and What We Can Do About It. SIGMOD Conference 2015: 265-276
(2) Javier Andréu Pérez, Carmen C. Y. Poon, Robert D. Merrifield, Stephen T. C. Wong, Guang-Zhong Yang: Big Data for Health. IEEE J. Biomedical and Health Informatics 19(4): 1193-1208 (2015)
(3) Jae-Gil Lee, Minseo Kang: Geospatial Big Data: Challenges and Opportunities. Big Data Research 2(2): 74-81 (2015)
(4) Taruna Seth, Vipin Chaudhary: Big Data in Finance. Big Data - Algorithms, Analytics, and Applications 2015: 329-356
(5) Kesheng Wu, E. Wes Bethel, Ming Gu, David Leinweber, Oliver Rübel: A big data approach to analyzing market volatility. Algorithmic Finance 2(3-4): 241-267 (2013)

Teacher's lectures and students' presentations and reports.

Students complete this seminar by actively participating in its work: the work methods include studying scientific sources, writing reports and giving presentations, reading the reports of other participants and evaluating them.

The grading will be based on each student's own written work (1/3), oral presentation (1/3), and commentary as an opponent on the presentations and reports of others as well as activeness in general (1/3). To pass the seminar, each of these components must be passed. (Active) attendance of seminar meetings is obligatory. Absence from at most two meetings is accepted (and will affect grading).