CS 295: Statistical NLP Winter 2017
In the write up, define and describe the language model in detail (saying “trigram+laplace smoothing” is not
sufficient). Include any implementation details you think are important (for example, if you implemented your
own sampler, or an efficient smoothing strategy). Also describe what the hyper-parameters of your model are and
how you set them (you should use the dev split of the data if you are going to tune it).
2.2 Analysis on In-Domain Text (40 points)
Here, you will train a model for each of the domains, and anayze only on the text from their respective domains.
◦
Empirical Evaluation: Compute the perplexity of the test set for each of the three domains (the provided
code would do this for you), and compare it to the unigram model. If it is easy to include a baseline version
of your model, for example leaving out some features or using only bigrams, please do so, but this is not
required. Provide further empirical analysis of the performance of your model, such as the performance as
hyper-parameters and/or amount of training data is varied, or implementing an additional metric.
◦
Qualitative: Show examples of sampled sentences to highlight what your models represent for each domain.
It might be quite illustrative to start with the same prefix, and show the different sentences each of them
results in. You may also hand-select, or construct, sentences for each domain, and show how usage of certain
words/phrases is scored by all of your models (function
lm.logprob
_
sentence()
might be useful for this).
2.3 Analysis on Out-of-Domain Text (40 points)
In this part, you have to evaluate your models on text from a domain different from the one it was trained on. For
example, you will be analyzing how a model trained on the Brown corpus performs on the Gutenberg text.
◦
Empirical Evaluation: Include the perplexity of all three of your models on all three domains (a 3
×
3 matrix,
as computed in
data.py
). Compare these to the unigram models, and your baselines if any, and discuss
the results (e.g. if unigram outperforms one of your models, why might that happen?). Include additional
graphs/plots/tables to support your analysis.
◦
Qualitative Analysis: Provide an analysis of the above results. Why do you think certain models/domains
generalize better to other domains? What might it say about the language used in the domains and their
similarity? Provide graphs, tables, charts, examples, or other summary evidence to support any claims you
make (you can reuse the same tools as the qualitative analysis in § 2.2, or introduce new ones).
2.4 Extra Credit: Additional corpus (20 points)
Identify a corpus on your own that is substantially different from the included ones, and provide similar analysis
as above on this data. Upload this corpus to Canvas with your submission, and mention any license restrictions, if
any, in your write-up (I might want to use it for a future offering of the course). You will be allowed an additional
page to your report if you include such an analysis.
3 Statement of Collaboration
It is
mandatory
to include a Statement of Collaboration in each submission, with respect to the guidelines below.
Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.
All students are required to follow the academic honesty guidelines posted on the course website. For
programming assignments, in particular, I encourage the students to organize (perhaps using Piazza) to discuss
the task descriptions, requirements, bugs in my code, and the relevant technical content before they start working
on it. However, you should not discuss the specific solutions, and, as a guiding principle, you are not allowed to
take anything written or drawn away from these discussions (i.e. no photographs of the blackboard, written notes,
referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict the discussion
to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.
Since we do not have a leaderboard for this assignment, you are free to discuss the numbers you are getting
with others, and again, I encourage you to use Piazza to post your results and comparing them with others.
Acknowledgements
This homework is adapted from one by Prof. Yejin Choi of the University of Washington. Thanks, Yejin!
Homework 2 UC Irvine 2/ 2