What?
The content - and the learning goals - for the course is described in broad terms at the main course page. More specifically, we will consider various steps involved in NLP systems from sentence splitting, tokenization and tagging to named-entity recognition, dependency parsing, role labeling, information extraction and more. Central to the course will be the use of experiments in NLP, as many of you will carry out such experiments as part of your master's thesis project. We will in particular consider how experiments should be set up and evaluated, as well as various machine learning algorithms, and what makes linguistic material special when it comes to machine learning experiments.
Background
Natural Language Processing is an interdisciplinary discipline building on insights from various fields including
- Language and Linguistics
- Computer Science in general and programming in particular
- Probability theory and statistics (and mathematics)
- Machine Learning and "Data Science"
Students who come to this class have different backgrounds. Some are familiar with some of the fields, others are familiar with other fields. This can be a challenge. This semester we will try to solve this as follows.Some of what can be considered background material will be covered in the lectures - but not all. In addition to the regular lectures and the group/lab sessions, we will early in the semester give some tutorials on background topics we expect are familiar to some of you, but not to all of you, e.g. probability. (The tutorials will be Mondays 2.15-4 pm and show up as group 2 in the schedule). What we will cover in class (lectures+tutorials) will be adapted to what can be assumed from the first year master students in Informatics: Language and Computation, since this course is mandatory for these students. If you have a different background, you may have to read up on some topics on your own.
Here is some more on assumed background and recommendations on what to read.
Language and linguistics
You have to be familiar with some core concepts of linguistics, like "parts of speech" and "sentence structure". If you have not taken any courses in linguistics or NLP/Computational Linguistics you should consult some of the following.
- Chapter 3, "Linguistic Essentials", p. 81-115, in Manning and Schütze: Foundations of Statistical Natural Language Processing. This is the best overview for what will be assumed in the course. Unfortunately, the book is not online, but you find it in the library.
- You are recommended to acquire Jurafsky and Martin, Speech and Language Processing, anyhow. The sections 3.1 + 12.1-12.3 introduce some of the key concepts of morphology and syntax.
- You are also recommended to read sections 8.1-8.3 in the NLTK book: Natural Language Processing with Python, by Bird, Klein and Loper
Programming in Python
The course will not be a heavy programming course, but you have to be able to write programs to solve simple tasks. Moreover, many of the tools we will use are Python modules. We assume that you know how to program and that you are able to learn yourself Python if you are not already familiar with it. Sources for learning Python includes
- The official Python documentation, in particular the Tutorial and Library. (We recommend Python 3.x.x this semester)
- Think Python: How to Think Like a Computer Scientist is an easy introduction to Python
- Also the first four chapters of the NLTK book gives a Python introduction.
The Natural Language Toolkit (NLKT)
This toolkit is used in several bachelor courses. We will also use parts of chapter 6 and chapter 7 in this course. You are advised to familiarize yourself with the 3 first chapters of the book as soon as possible, in particular chapter 1 and chapter 2, sec. 2.1-2.2.
Probability theory and Statistics
Since we don't presuppose any background in Probability theory and Statistics, we will give a crash course in the form of one or to tutorials on the basic concepts. Do you need a book on statistics? We will cover all the concepts on the slides, so a book is not strictly required. But it could be useful with some more explanations and examples than what we reach to cover in class.
- If you already own a book on statistics, that will probably suffice, e.g. the STK1000 book, Moore and McCabe, Introduction to the Practice of Statistics.
- I like Gonnick and Smith's, The Cartoon Guide to Statistics. It is mostly drawings - not too many words, but it covers the essentials.
- Statistics in a Nutshell by Sarah Boslaugh covers what we we need in not too many pages, and in roughly the same order as we will present the material.
- There are several free book and courses on statistics on the internet. You may look around. I have taken a first look at the OpenIntro Statistics. So far it looks promising. It covers what we need and is not too technical. The main challenge is that it also covers a lot more.
- In earlier semesters some students recommended Khan academy
What first?
Question: If I Lack some of this background, in which order should I attack it?
- If you lack experience with Python and NLTK, that is most urgent. We are going to use it from the first week.
- Then, if you don't have knowledge of linguistics, that's next on your agenda.
- If you already know Python, NLTK and some linguistics, say you have taken INF2820 Computational Linguistics, it is time for probabilities and statistics. As said, we will have some tutorials, but is wise to start ahead and use the first weeks of the semester.