Linguistics is becoming ever more data-driven, with increasing use of data from corpora and experiments. This course will teach you the basic computational skills that you need to engage in large-scale data collection and analysis. Concretely, you will learn to use the programming language Python and its associated open source library called the Natural Language Toolkit (NLTK).
The course is based on the NLTK book and assumes no prior programming experience. In the first part of the course you learn the basics of Python and explore how this can be used to access large corpora without user interfaces. In the second part, we cover selected advanced topics such as automated tagging, information extraction, parsing, computational semantics, data management and statistical analysis and see how they can be used to find linguistic structures in a corpus, collect frequency data and perform statistical analysis. This part of the course will be adapted to the interests of the students. As much as possible we will work directly with data from the students' own research.
This is a practical hands-on course which requires you to attend class every week and solve programming exercises to the get the maximum benefit. Students will be working on two projects during the course: one smaller, written assignment, and one project that they will present orally at the end of the course. The latter project should ideally connect with the students own research interests and could give rise to a separate publication.