Mietta Lennes, CC BY 4.0

Learn to prepare and analyze your language data

This course will help you in various stages of managing, annotating and analyzing your language material, e.g., text or speech.

This course is intended to help you in the most concrete and practical problems you may face when you start working with your MA thesis or PhD project: how to design your study, where to find suitable tools and how to use them, how to annotate your material (in case you need to), how to convert and read your data into a statistical program, how to begin your analysis, how to make sure you can publish and re-use your data and results, etc.

The course is mainly focussed on handling research material and data that contains language in some form. You may enrol for this course in case you already know which/what kind of language material you will be using for your Master's Thesis or PhD dissertation. For instance, you may be using text documents or audio and video recordings of spoken language. Your data may come from existing language corpora, from an archive, from the Internet, or you may have collected it from some other source. Your research topic need not be linguistic, however. Students of all fields are welcome!

NB: In the study year 2019-20, the course is jointly organized by FIN-CLARIN and HELDIG. After successfully passing the first part of the course, each participant will be have an opportunity to meet with an expert from the network of Data Advisors provided by HELDIG and FIN-CLARIN.

Please note that the number of participants is restricted. In case there are too many enrolled students, first priority will be given to students in the LingDA Master's Program at the University of Helsinki and secondly to those students who have attended the first meeting (either online or in person).

If space allows, it is possible to participate the course from another Finnish university. You can join the meetings online. After passing the course, students from outside the University of Helsinki will receive a written certificate that they can provide in their own institution in order to receive the credits. Before enrolling, you should negotiate with your local supervisor about whether this course can be included in your degree.


All students need to enrol for this course:
- In case you are a student at the University of Helsinki, please enrol via WebOodi as usual. After that, you can go to the course area in Moodle. Before the course starts, you should fill in the enrollment survey that is available on Moodle. NB: ENROLMENT PERIOD HAS BEEN EXTENDED ON WEBOODI UNTIL 15.11.2019!
- In case you wish to participate from outside the University of Helsinki, you cannot enrol via the WebOodi link on this page. However, you can go directly to the course area in Moodle and fill in the enrollment survey.

The Moodle area is now open for students at https://moodle.helsinki.fi/mod/feedback/view.php?id=1709275.

How to access Moodle:
- To log in from Finnish universities outside Helsinki, please use the HAKA login link.
- To log in from universities outside Finland, please try the eduGAIN login link and the user account provided by your home university. In case you are unable to log in, please contact the teacher of the course so we can make some arrangements in order to get you in.

24.9.2019 klo 09:00 - 15.11.2019 klo 23:59


Two face-to-face meetings have been pre-scheduled: one at the beginning of the course, and another at the end. It is preferable to participate these meetings on site, but in case this is impossible for you, you will be provided with the option to participate online. During the spring term, each participant will have at least one personal support meeting with the teacher.

We can set up additional meetings if required. Normally, these will be arranged online.

Pe 1.11.2019
10:15 - 11:45
Pe 17.4.2020
10:15 - 12:45


All materials will be provided via the Moodle course area.


The course belongs to the MA Programme Linguistic Diversity in the Digital Age

  • study track: language technology
  • modules: Studies in Language Technology (LDA-T3100), Essentials in Language Technology (LDA-TA500), Comprehensive specialization in Language Technology (LDA-TB500)

This is an optional course.

The course is available to students from other study tracks and degree programmes.

  • The basic course Corpus linguistics and statistical methods (or similar knowledge and skills) is recommended as background.
  • For students working on speech corpora, the course Introduction to speech analysis (or similar knowledge and skills) is recommended either before or during the Corpus clinic.
  • This course is primarily aimed at students in humanities and social sciences. No programming skills or complicated mathematics are required.

Depending on the topic area of your thesis or dissertation, courses in statistical methods, speech signal processing, natural language processing, and/or programming may be useful.

After successfully completing the course

  • You understand why data management skills are necessary and useful in language research.
  • You know how to write a data management plan for a text or speech corpus.
  • You have a realistic idea of the stages that are required in order to prepare and analyze your corpus.
  • You feel brave enough to use some of the tools that are available for annotating, processing and analyzing your language material, e.g., automatic taggers, speech recognition tools and the R statistical environment, if these are required.
  • You have identified variables in the corpus that are relevant for your research question.
  • You have managed to produce a dataset that will help you solve your research question, or you know how to produce one manually or automatically.
  • You have obtained the practical skills and knowledge on how to complete the analysis of your data.
  • You have already generated passages of text that you can use to describe your methods in publications, e.g., in your thesis.
  • You know how to publish and share your corpus and/or your data.

Students are advised to take this course in year 2 (master level) or year 1 (doctoral level). The aim of the course is to provide you with the necessary tools and support for handling, processing and analyzing your language data. The course is useful only after you have decided on a topic for your Master’s thesis or for your doctoral dissertation, and as soon as you know what kind of language material or which corpus you will be working on.

The coursework is distributed over three periods, starting in period 2 and ending in period 4.

At the beginning of period 2 in the autumn, the group will have a kick-off meeting either online or in class. With online support from your peers and from the teacher, you will then complete learning tasks, small tutorials and hands-on assignments covering various tools and issues in language data management. For instance, you will get familiar with automatic tools that are available for parsing text or for transcribing speech. You will then focus on how to process your own data into a suitable format for further analysis and how to import your data into a spreadsheet program or a statistical environment such as R. By the end of period 2, you will be required to write a brief data management plan and to review and discuss the plans of several other participants.

During period 3, the participants will work on their theses independently and update their data management plans as required. Every participant will have at least one but no more than two individual “clinic” appointments with the teacher and/or with an expert in the methods of his/her thesis. The aim of these meetings is to help each student carry out the analysis.

During period 4, a final group meeting will be organized. In this meeting, the participants will briefly report on their progress regarding data management and analysis.

A list of recommended literature and other material will be provided during the course.

The course and all the assignments are completed online, but the support meetings can be organized either online or face to face. The online learning environment is available throughout the course for getting advice and for sharing tips and tricks with the group.

This course aims to provide peer support, useful tips and specific guidance in technical and practical issues that may turn up during language data handling and analysis. The course will not replace the supervisors of your thesis.

The grade 0-5 is based on the quality of the data management plan, the other course assignments and on active participation in course activities. Further details on the grading principles will be provided at the beginning of the course.

Online course with assignments and tutorials, group meetings and at least one individual appointment.