Open Data Science (DataCamp, R, RStudio, GitHub)

Open Data - Open Science - Data science

Our era of data - larger than ever and complex like chaos - requires several skills from statisticians and other data scientists.

We must discover the patterns hidden behind numbers in matrices and arrays. We are not afraid of coding, recoding, programming, or modelling. We want to visualize, analyze, interpret, understand, and communicate. These are the core themes of Open Data Science (Open Data - Open Science - Data Science). And this course is THE course for learning these skills.

General learning objective:
After completing this course you will understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. You will know how to use R, RStudio, RMarkdown, and GitHub for these tasks and also know how to learn more of these open software tools. You will also know how to apply certain statistical methods of data science, that is, data-driven statistics.

11.12.2016 at 09:00 - 25.1.2017 at 23:59


On this course, we use the learning platform for MOOCs (Massive Open Online Courses) of the University of Helsinki:

The first IODS course is now finished, but IODS will return - in AUTUMN of 2017!

Stay tuned for more information!


There is about one week for completing each exercise (DataCamp exercises that are automatically graded and RStudio exercises that are graded using peer-review).

Deadlines are strict: late submissions cannot be accepted.

Instead of lectures, the classes are free-form workshops where the students work together and the teachers give advice when needed.

Thu 19.1.2017
08:00 - 10:00
Thu 26.1.2017
08:00 - 10:00
Thu 2.2.2017
08:00 - 10:00
Thu 9.2.2017
08:00 - 10:00
Thu 16.2.2017
08:00 - 10:00
Thu 23.2.2017
08:00 - 10:00
Thu 2.3.2017
08:00 - 10:00


Welcome to the course!


1 Tools and methods for open and reproducible research
R, RStudio, Rmarkdown, GitHub
2 Regression and model validation
3 Logistic regression
4 Clustering and classification
Discriminant analysis (DA)
K-means clustering (KMC)
5 Dimensionality reduction techniques
Principal component analysis (PCA)
Multiple Correspondence analysis (MCA)
6 Final assignment

Conduct of the course

Each week exercises are completed using DataCamp, RStudio, and GitHub. The course grade consists of

1) Points from DataCamp exercises (weekly)
2) Points from RStudio exercises (weekly)
3) Points from Final assignment.

DataCamp exercises are completed and automatically evaluated on the DataCamp learning platform. RStudio exercises are completed on your own computer, moved onto the web (GitHub) and then submitted and peer-reviewed in the weekly Workshop (=the name of the peer-review tool of the MOOC platform).

More details on the MOOC platform.

Course overview

Primarily targeted to Doctoral students of the (Computational) Social Sciences and (Digital) Humanities, but Master's students also welcome, and suitable even for Bachelor's studies (give it a try!). Should be quite .relevant stuff for anyone! No prerequisities. Will be a MOOC (Massive Open Online Course) in the future.

Introduction to Open Data Science

See the video (02:28) at

The name of this course refers to THREE BIG THEMES: 1) Open Data, 2) Open Science, and 3) Data Science. These themes are summarized briefly as follows:

1) Open Data

There are more and more open data sets available. Utilizing and sharing data is an essential skill for researchers in all fields. During this course we use open data sets from different sources and learn to prepare them for different analyses. You will explore, analyse and interpret data from real world applications.

2) Open Science

Science thrives to be open. Repeating or reproducing the results is a common aim in any branch of science, but it is not always easy or simple. Sharing data is not enough for reproducibility. What is also needed, is using openly available software tools and methods as well as sharing your code and results. You will learn these skills during this course, using state-of-the-art tools.

3) Data Science

Data Science is the name for the data driven world of Statistics. Nowadays, finding or collecting data is not a problem. Instead, the challenges are in extracting knowledge and discovering the patterns behind the data. It requires skills of coding, programming, and modelling, as well as visualizing and analysing. You will face all these topics on this course.

We are quite excited about this course! So come along! Together we’ll guide you through these themes.

Welcome to the course! :)

Teacher of the course:

Kimmo Vehkalahti, Univ.Lecturer, Adj.Prof. of Appl.Stats, D.Soc.Sci (Statistics)
Fellow of the Teachers' Academy (
Centre for Research Methods (

Assistant teachers:

Emma Kämäräinen, Tuomo Nieminen, Petteri Mäntymaa (students of Statistics/Data Science)

Kimmo Vehkalahti