Mietta Lennes, CC BY 4.0

Learn to prepare and analyze your language data

This course will take you through the steps of managing, annotating and analyzing your language material, e.g., text or speech.

NB: Registration deadline extended to 23rd November 2018!

This course is intended to help you in the most concrete and practical problems you may face when you start working with your MA thesis or PhD project: how to design your study, where to find suitable tools and how to use them, how to annotate your material (in case you need to), how to convert and read your data into a statistical program, how to begin your analysis, how to make sure you can publish and re-use your data and results, etc.

The course is mainly focussed on handling research material and data that contains language in some form. You may enrol for this course in case you already know which/what kind of language material you will be using for your Master's Thesis or PhD dissertation. For instance, you may be using text documents or audio and video recordings of spoken language. Your data may come from existing language corpora, from an archive, from the Internet, or you may have collected it from some other source. Your research topic need not be linguistic, however. Students of all fields are welcome!

NB: In the study year 2018-19, the course is jointly organized by FIN-CLARIN and HELDIG. After successfully passing the first part of the course, each participant will be have an opportunity to meet with one of the supervising experts from the network provided by HELDIG and FIN-CLARIN.

Please note that the number of participants is restricted. In case there are too many enrolled students, first priority will be given to students in the LingDA Master's Program at the University of Helsinki and secondly to those students who have attended the first meeting (either online or in person).

If space allows, it is possible to participate the course from another Finnish university. You can join the meetings online. After passing the course, students from outside the University of Helsinki will receive a written certificate that they can provide in their own institution in order to receive the credits. Before enrolling, you should negotiate with your local supervisor about whether this course can be included in your degree.


All students need to enrol for this course:
- In case you are a student at the University of Helsinki, please enrol via WebOodi as usual. After that, you can go to the course area in Moodle. Before the course starts, you should fill in the enrollment survey that is available on Moodle.
- In case you wish to participate from outside the University of Helsinki, you cannot enrol via the WebOodi link on this page. However, you can go directly to the course area in Moodle and fill in the enrollment survey.

The Moodle area is now open for students via this link: https://moodle.helsinki.fi/course/view.php?id=29462

How to access Moodle:
- To log in from Finnish universities outside Helsinki, please use the HAKA login link.
- To log in from universities outside Finland, please try the eduGAIN login link and the user account provided by your home university. In case you are unable to log in, please contact the teacher of the course so we can try to get you in.

9.10.2018 klo 09:00 - 23.11.2018 klo 23:59


Two face-to-face meetings have been pre-scheduled: one at the beginning of the course, and another at the end. It is preferable to participate these meetings on site, but in case this is impossible for you, you will be provided with the option to participate online. During the spring term, each participant will have at least one personal support meeting with the teacher.

We can set up additional meetings if required. Normally, these will be arranged online.

Provisional syllabus of the course:

Introduction to the Corpus Clinic
* Why are we here? Getting to know each other
* The Language Bank of Finland
* Open science: basic concepts

Stage 1. Composing the initial version of your data management plan (November-December)
* Where to find material? Metadata services and catalogues
* Elaborating your research questions
* Defining the variables you need to include in your dataset
* Do you need to annotate your language material?
* Sharing your language corpus or other sets of data
* Citing data and making your data citeable
* Legal issues: Copyright and personal data
* Submitting the initial version of your data management plan

Stage 2. Gathering, pre-processing and cleaning your material (December)
* Obtaining a dataset from Korp or other sources
* What other means can you use to build a dataset out of your language material?
* Text editors and multimedia viewers
* File conversion (text, audio, video)
* Tips on file naming
* Moving and storing your files

Stage 3a. Parsing, annotating and searching text material (December-January)
* Online parsers and taggers
* Other text annotation tools
* What you need to know about technical annotation formats
* Text visualization tools
* Using regular expressions and grep

Stage 3b. Using, annotating and searching speech material (December-January)
* Transcribing and annotating speech with ELAN and Praat
* Aligning an existing transcript with the media file
* Searching an annotated speech corpus
* Collecting data from an annotated speech sample

Stage 4. Wrangling your data (February)
* Importing your tabular data to a spreadsheet program (MS Excel; RStudio)
* Obtaining summaries and some initial visualizations
* Understanding and cleaning your data

Stage 5. Discovering the methods and tools for completing your own analysis (February-April)
* 1-2 personal meetings with the course tutor and/or one of the expert supervisors

Stage 6. Updating your DMP and submitting the final version (April)

Pe 9.11.2018
10:00 - 13:00
Pe 26.4.2019
10:00 - 13:00


All materials will be provided via the Moodle course area.

Some additional courses can be recommended in order to support your thesis project. Please see the suggestions below or consider taking another course that suits your own background and goals!


The course belongs to the MA Programme Linguistic Diversity in the Digital Age

  • study track: language technology
  • modules: Studies in Language Technology (LDA-T3100), Essentials in Language Technology (LDA-TA500), Comprehensive specialization in Language Technology (LDA-TB500)

This is an optional course.

The course is available to students from other study tracks and degree programmes.

  • The basic course Corpus linguistics and statistical methods (or similar knowledge and skills) is recommended as background.
  • For students working on speech corpora, the course Introduction to speech analysis (or similar knowledge and skills) is recommended either before or during the Corpus clinic.
  • This course is primarily aimed at students in humanities and social sciences. No programming skills or complicated mathematics are required.

Depending on the topic area of your thesis or dissertation, courses in statistical methods, speech signal processing, natural language processing, and/or programming may be useful.

After successfully completing the course

  • You understand why data management skills are necessary and useful in language research.
  • You know how to write a data management plan for a text or speech corpus.
  • You have a realistic idea of the stages that are required in order to prepare and analyze your corpus.
  • You feel brave enough to use some of the tools that are available for annotating, processing and analyzing your language material, e.g., automatic taggers, speech recognition tools and the R statistical environment, if these are required.
  • You have identified variables in the corpus that are relevant for your research question.
  • You have managed to produce a dataset that will help you solve your research question, or you know how to produce one manually or automatically.
  • You have obtained the practical skills and knowledge on how to complete the analysis of your data.
  • You have already generated passages of text that you can use to describe your methods in publications, e.g., in your thesis.
  • You know how to publish and share your corpus and/or your data.

Students are advised to take this course in year 2 (master level) or year 1 (doctoral level). The aim of the course is to provide you with the necessary tools and support for handling, processing and analyzing your language data. The course is useful only after you have decided on a topic for your Master’s thesis or for your doctoral dissertation, and as soon as you know what kind of language material or which corpus you will be working on.

The coursework is distributed over three periods, starting in period 2 and ending in period 4.

At the beginning of period 2 in the autumn, the group will have a kick-off meeting either online or in class. With online support from your peers and from the teacher, you will then complete learning tasks, small tutorials and hands-on assignments covering various tools and issues in language data management. For instance, you will get familiar with automatic tools that are available for parsing text or for transcribing speech. You will then focus on how to process your own data into a suitable format for further analysis and how to import your data into a spreadsheet program or a statistical environment such as R. By the end of period 2, you will be required to write a brief data management plan and to review and discuss the plans of several other participants.

During period 3, the participants will work on their theses independently and update their data management plans as required. Every participant will have at least one but no more than two individual “clinic” appointments with the teacher and/or with an expert in the methods of his/her thesis. The aim of these meetings is to help each student carry out the analysis.

During period 4, a final group meeting will be organized. In this meeting, the participants will briefly report on their progress regarding data management and analysis.

A list of recommended literature and other material will be provided during the course.

The course and all the assignments are completed online, but the support meetings can be organized either online or face to face. The online learning environment is available throughout the course for getting advice and for sharing tips and tricks with the group.

This course aims to provide peer support, useful tips and specific guidance in technical and practical issues that may turn up during language data handling and analysis. The course will not replace the supervisors of your thesis.

The grade 0-5 is based on the quality of the data management plan, the other course assignments and on active participation in course activities. Further details on the grading principles will be provided at the beginning of the course.

Online course with assignments and tutorials, group meetings and at least one individual appointment.