Lecture 2: Words, tokenization, tagged text
This lecture will consider
- some basic linguistics concepts related to words
- the processes of tokenization and normalization
- tagged text
Presentation
Recordings
Mandatory reading
Jurafsky and Martin, Speech and Language Processing, 3. ed. (edition of 30 Dec. 2020!)
- Ch. 2 Regular expressions, etc.
- Sec. 2.0
- Sec. 2.2 Words
- Sec. 2.3 Corpora
- Sec. 2.4 Normalization, except 2.4.3 and the technical details of 2.4.1
- Ch. 8 Sequence Labelling ...
- Sec 8.1 and 8.2
- Ch. 3, sec. 6 Normalizing Text
- Ch. 3, sec. 8 Segmentation
- Ch. 5, sec. 1 Using a tagger
- Ch. 5, sec. 2 Tagged corpora
Wikipedia
Recommended reading
Wikipedia
Lecture 3, Thursday 2 Sept. 12.15 at Sed
Presentation
Recordings
Unfortunately, the recording equipment in SED did not work, but the following recordings from 2020 cover roughly the same material:
- First part
- Second part
(This opens in PowerPoint because of problems with the recording equipment also last year.)
Readings
OpenIntro (3. ed.) (In the 4th ed. add one to the chapter numbers)
- Ch. 2, "Probability", sec. 2.1-2.4
- Ch. 3, "Distributions of random variables":
- Sec. 3.3.1 Bernoulli distribution
- Sec. 3.4.1 Binomial distribution