Instruction
Name | Cr | Method of study | Time | Location | Organiser |
---|---|---|---|---|---|
Seminar on Big Data Management | 5 Cr | Seminar | 13.1.2020 - 27.4.2020 |
Name | Cr | Method of study | Time | Location | Organiser |
---|---|---|---|---|---|
Seminar on Big Data Management | 5 Cr | Seminar | 14.1.2019 - 29.4.2019 | ||
Seminar on Big Data Management | 5 Cr | Seminar | 15.1.2018 - 30.4.2018 |
Target group
Master's Programme in Computer Science is responsible for the course
- CSM14000 - Software Systems study track
- Data Management course package CSM14300
- Module in Data Management CSM24300
The course is available to students from other degree programmes (this seminar can be available to students with Master of Data Science)
Prerequisites
Basic knowledge on relational databases or equivalent knowledge.
Learning outcomes
Students are expected to
(1) Have a decent understanding of big data challenge
(2) Conduct research on one of the topics related to big data management
(3) Perform a literature review on big data management
(4) Know how to read/write/review a technical paper
(5) Know how to present a paper
Timing
Recommended time for completion is spring the first or second year of the Masters programme.
The seminar is in Spring term and will be offered every year.
Contents
We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on collecting and analyzing big data is shaping new platforms and techniques. This seminar will mainly discuss new research papers in different subfields of big data management, including data querying, exploration, sampling, sharing, cleansing, big data benchmark and applications.
Activities and teaching methods in support of learning
Teacher's lectures and students' presentations and reports.
Study materials
Big data survey (Volume, Velocity, Variety and Value)
(1) Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015)
(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)
Hadoop and Spark platforms (Volume, Velocity, Variety)
(1) Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014)
(2) Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma Özcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015)
Cloud data management (Volume, Velocity)
(1) Adam Silberstein, Russell Sears, Wenchao Zhou, Brian F. Cooper: A batch of PNUTS: experiences connecting cloud batch and serving systems. SIGMOD Conference 2011: 1101-1112
(2) Daniel J. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng. Bull. 32(1): 3-12 (2009)
Data sampling (Volume, Velocity)
(1) Ying Yan, Liang Jeff Chen, Zheng Zhang: Error-bounded Sampling for Analytics on Big Sparse Data. PVLDB 7(13): 1508-1519 (2014)
(2) S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000
Graph data management (Volume, Variety)
(1) Yu Liu, Jiaheng Lu, Hua Yang, Xiaokui Xiao, Zhewei Wei: Towards Maximum Independent Sets on Massive Graphs. PVLDB 8(13): 2122-2133 (2015)
(2) Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, Pradeep Dubey: Navigating the maze of graph analytics frameworks using massive graph datasets. SIGMOD Conference 2014: 979-990
(3)Philippe Cudré-Mauroux, Sameh Elnikety: Graph Data Management Systems for New Application Domains. PVLDB 4(12): 1510-1511 (2011)
Data exploration (Volume, Variety)
(1) Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, Letizia Tanca: Database Challenges for Exploratory Computing. SIGMOD Record 44(2): 17-22 (2015)
(2) Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri: Overview of Data Exploration Techniques. SIGMOD Conference 2015: 277-281
Approximate string processing (Variety)
(1) Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with synonyms. SIGMOD Conference 2013: 373-384
(2) Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008: 257-266
Data cleansing (Volume, Variety and Value)
(1) Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261
(2) Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD Conference 2015: 1215-1230
Knowledge base (Volume, Variety and Value)
(1) Omkar Deshpande, Digvijay S. Lamba, Michel Tourn, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, AnHai Doan: Building, maintaining, and using knowledge bases: a report from the trenches. SIGMOD Conference 2013: 1209-1220
(2) Albert Weichselbraun, Stefan Gindl, Arno Scharl: Enriching semantic knowledge bases for opinion mining in big data applications. Knowl.-Based Syst. 69: 78-85 (2014)
(3) Maria Pershina, Mohamed Yakout, Kaushik Chakrabarti: Holistic entity matching across knowledge graphs. Big Data 2015: 1585-1590
Big data benchmark (Volume, Velocity, Variety)
(1) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010: 143-154
(2)Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178
Big data applications (Volume, Velocity, Variety and Value)
(1) Paul Suganthan G. C., Chong Sun, Krishna Gayatri K., Haojun Zhang, Frank Yang, Narasimhan Rampalli, Shishir Prasad, Esteban Arcaute, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra, AnHai Doan: Why Big Data Industrial Systems Need Rules and What We Can Do About It. SIGMOD Conference 2015: 265-276
(2) Javier Andréu Pérez, Carmen C. Y. Poon, Robert D. Merrifield, Stephen T. C. Wong, Guang-Zhong Yang: Big Data for Health. IEEE J. Biomedical and Health Informatics 19(4): 1193-1208 (2015)
(3) Jae-Gil Lee, Minseo Kang: Geospatial Big Data: Challenges and Opportunities. Big Data Research 2(2): 74-81 (2015)
(4) Taruna Seth, Vipin Chaudhary: Big Data in Finance. Big Data - Algorithms, Analytics, and Applications 2015: 329-356
(5) Kesheng Wu, E. Wes Bethel, Ming Gu, David Leinweber, Oliver Rübel: A big data approach to analyzing market volatility. Algorithmic Finance 2(3-4): 241-267 (2013)
Assessment practices and criteria
Students complete this seminar by actively participating in its work: the work methods include studying scientific sources, writing reports and giving presentations, reading the reports of other participants and evaluating them.
Recommended optional studies
To continue with a Master's thesis in computer science related to the topic of the seminar.
Academic writing courses