Nimi Op Opiskelumuoto Aika Paikkakunta Järjestäjä
Data Clinic 5 Cr Kurssi 1.11.2019 - 17.4.2020
Corpus Clinic (LDA-H506) 5 Cr Verkkokurssi 9.11.2018 - 26.4.2019
Corpus clinic 5 Cr Harjoitusryhmä 9.11.2017 - 16.4.2018


The course belongs to the MA Programme Linguistic Diversity in the Digital Age

  • study track: language technology
  • modules: Studies in Language Technology (LDA-T3100), Essentials in Language Technology (LDA-TA500), Comprehensive specialization in Language Technology (LDA-TB500)

This is an optional course.

The course is available to students from other study tracks and degree programmes.

Edeltävät opinnot tai edeltävä osaaminen

  • The basic course Corpus linguistics and statistical methods (or similar knowledge and skills) is recommended as background.
  • For students working on speech corpora, the course Introduction to speech analysis (or similar knowledge and skills) is recommended either before or during the Corpus clinic.
  • This course is primarily aimed at students in humanities and social sciences. No programming skills or complicated mathematics are required.


After successfully completing the course

  • You understand why data management skills are necessary and useful in language research.
  • You know how to write a data management plan for a text or speech corpus.
  • You have a realistic idea of the stages that are required in order to prepare and analyze your corpus.
  • You feel brave enough to use some of the tools that are available for annotating, processing and analyzing your language material, e.g., automatic taggers, speech recognition tools and the R statistical environment, if these are required.
  • You have identified variables in the corpus that are relevant for your research question.
  • You have managed to produce a dataset that will help you solve your research question, or you know how to produce one manually or automatically.
  • You have obtained the practical skills and knowledge on how to complete the analysis of your data.
  • You have already generated passages of text that you can use to describe your methods in publications, e.g., in your thesis.
  • You know how to publish and share your corpus and/or your data.


Students are advised to take this course in year 2 (master level) or year 1 (doctoral level). The aim of the course is to provide you with the necessary tools and support for handling, processing and analyzing your language data. The course is useful only after you have decided on a topic for your Master’s thesis or for your doctoral dissertation, and as soon as you know what kind of language material or which corpus you will be working on.

The coursework is distributed over three periods, starting in period 2 and ending in period 4.


At the beginning of period 2 in the autumn, the group will have a kick-off meeting either online or in class. With online support from your peers and from the teacher, you will then complete learning tasks, small tutorials and hands-on assignments covering various tools and issues in language data management. For instance, you will get familiar with automatic tools that are available for parsing text or for transcribing speech. You will then focus on how to process your own data into a suitable format for further analysis and how to import your data into a spreadsheet program or a statistical environment such as R. By the end of period 2, you will be required to write a brief data management plan and to review and discuss the plans of several other participants.

During period 3, the participants will work on their theses independently and update their data management plans as required. Every participant will have at least one but no more than two individual “clinic” appointments with the teacher and/or with an expert in the methods of his/her thesis. The aim of these meetings is to help each student carry out the analysis.

During period 4, a final group meeting will be organized. In this meeting, the participants will briefly report on their progress regarding data management and analysis.

Oppimista tukevat aktiviteetit ja opetusmenetelmät

The course and all the assignments are completed online, but the support meetings can be organized either online or face to face. The online learning environment is available throughout the course for getting advice and for sharing tips and tricks with the group.

This course aims to provide peer support, useful tips and specific guidance in technical and practical issues that may turn up during language data handling and analysis. The course will not replace the supervisors of your thesis.


A list of recommended literature and other material will be provided during the course.

Arviointimenetelmät ja -kriteerit

The grade 0-5 is based on the quality of the data management plan, the other course assignments and on active participation in course activities. Further details on the grading principles will be provided at the beginning of the course.

Suositeltavat valinnaiset opinnot

Depending on the topic area of your thesis or dissertation, courses in statistical methods, speech signal processing, natural language processing, and/or programming may be useful.


Online course with assignments and tutorials, group meetings and at least one individual appointment.