Last Modiﬁed: January 21, 2017

CS 295: Statistical NLP: Winter 2017

Homework 2: Language Modeling

Sameer Singh

http://sameersingh.org/courses/statnlp/wi17/

One of the fundamental tasks for natural language processing is probabilistic modeling of language, i.e. how

can we differentiate between a random sequence of words, and something we might consider an english sentence.

Such language models are used in many applications, such as handwriting recognition, speech recognition, machine

translation, and text generation. In this second programming assignment, you will perform language modeling of

different kinds of text. The submissions are due by midnight on February 9, 2017.

1 Task: Language Modeling of Different Datasets

Your task is to analyze the similarities and differences in different domains using your language model.

1.1 Data

The data archive (available on Canvas) contains corpus from three different domains, with a train, test, dev, and

readme ﬁle for each of them. The domains are summarized below, but feel free to uncompress and examine the

ﬁles themselves for more details (will be quite helpful to perform your analysis).

◦

Brown Corpus: Objective of the corpus is to be the standard corpus to represent the present-day (i.e. 1979)

edited american english. More details are available at http://www.hit.uib.no/icame/brown/bcm.html.

◦

Gutenberg Corpus: This corpus contains selection of text from public domain works by authors including Jane

Austen and William Shakespeare (see readme ﬁle for the full list). More details about Project Gutenberg is

available at http://gutenberg.net/.

◦

Reuters Corpus: Collection of ﬁnancial news articles that appeared on the Reuters newswire in 1987. The cor-

pus is hosted on the UCI ML repository at

https://archive.ics.uci.edu/ml/datasets/Reuters-21578+

Text+Categorization+Collection.

1.2 Source Code

I have released some initial source code, available at

https://github.com/sameersingh/uci-statnlp/tree/

master/hw2

. The interface and a simple implementation of a language model is available in

lm.py

, which you

can extend to implement your models. In

generator.py

, I provide a generic sentence sampler for a language

model. The ﬁle

data.py

contains the main function, that reads in all the train, test, and dev ﬁles from the archive,

trains all the unigram models, and computes the perplexity of all the models on each other’s data. The README

ﬁle provides a little bit more detail. Of course, feel free to ignore the code if you do not ﬁnd it useful.

2 What to Submit?

Prepare and submit a single write-up (

PDF, maximum 5 pages

) and relevant source code (compressed in a single

zip

tar.gz

ﬁle; we will not be compiling or executing it, nor will we be evaluating the quality of the code) to

Canvas. The write-up and code should contain the following.

2.1 Implement a Language Model (20 points)

The primary task of the homework is to implement a non-trivial language model. You are free to pick the type

of the model, such as discriminative/neural or generative. If you decide to implement an n-gram language

model, it should at least use the previous two words, i.e. a trigram model (with appropriate ﬁltering). Use

appropriate smoothing to ensure your language model outputs a non-zero and valid probability distribution

for out-of-vocabulary words as well. In order to make things efﬁcient for evaluation and analysis, it might be

worthwhile to implement serialization of the model to disk, perhaps using pickle .

Homework 2 UC Irvine 1/ 2

CS 295: Statistical NLP Winter 2017

In the write up, deﬁne and describe the language model in detail (saying “trigram+laplace smoothing” is not

sufﬁcient). Include any implementation details you think are important (for example, if you implemented your

own sampler, or an efﬁcient smoothing strategy). Also describe what the hyper-parameters of your model are and

how you set them (you should use the dev split of the data if you are going to tune it).

2.2 Analysis on In-Domain Text (40 points)

Here, you will train a model for each of the domains, and anayze only on the text from their respective domains.

◦

Empirical Evaluation: Compute the perplexity of the test set for each of the three domains (the provided

code would do this for you), and compare it to the unigram model. If it is easy to include a baseline version

of your model, for example leaving out some features or using only bigrams, please do so, but this is not

required. Provide further empirical analysis of the performance of your model, such as the performance as

hyper-parameters and/or amount of training data is varied, or implementing an additional metric.

◦

Qualitative: Show examples of sampled sentences to highlight what your models represent for each domain.

It might be quite illustrative to start with the same preﬁx, and show the different sentences each of them

results in. You may also hand-select, or construct, sentences for each domain, and show how usage of certain

words/phrases is scored by all of your models (function

lm.logprob

sentence()

might be useful for this).

2.3 Analysis on Out-of-Domain Text (40 points)

In this part, you have to evaluate your models on text from a domain different from the one it was trained on. For

example, you will be analyzing how a model trained on the Brown corpus performs on the Gutenberg text.

◦

Empirical Evaluation: Include the perplexity of all three of your models on all three domains (a 3

3 matrix,

as computed in

data.py

). Compare these to the unigram models, and your baselines if any, and discuss

the results (e.g. if unigram outperforms one of your models, why might that happen?). Include additional

graphs/plots/tables to support your analysis.

◦

Qualitative Analysis: Provide an analysis of the above results. Why do you think certain models/domains

generalize better to other domains? What might it say about the language used in the domains and their

similarity? Provide graphs, tables, charts, examples, or other summary evidence to support any claims you

make (you can reuse the same tools as the qualitative analysis in § 2.2, or introduce new ones).

2.4 Extra Credit: Additional corpus (20 points)

Identify a corpus on your own that is substantially different from the included ones, and provide similar analysis

as above on this data. Upload this corpus to Canvas with your submission, and mention any license restrictions, if

any, in your write-up (I might want to use it for a future offering of the course). You will be allowed an additional

page to your report if you include such an analysis.

3 Statement of Collaboration

It is

mandatory

to include a Statement of Collaboration in each submission, with respect to the guidelines below.

Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.

All students are required to follow the academic honesty guidelines posted on the course website. For

programming assignments, in particular, I encourage the students to organize (perhaps using Piazza) to discuss

the task descriptions, requirements, bugs in my code, and the relevant technical content before they start working

on it. However, you should not discuss the speciﬁc solutions, and, as a guiding principle, you are not allowed to

take anything written or drawn away from these discussions (i.e. no photographs of the blackboard, written notes,

referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict the discussion

to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.

Since we do not have a leaderboard for this assignment, you are free to discuss the numbers you are getting

with others, and again, I encourage you to use Piazza to post your results and comparing them with others.

Acknowledgements

This homework is adapted from one by Prof. Yejin Choi of the University of Washington. Thanks, Yejin!

Homework 2 UC Irvine 2/ 2