CS 295: Statistical NLP Winter 2017
There are many variations of this algorithm, known as self-training, that you can use. The first choice is to decide
which labels to include in
ˆ
D
l
in every iteration: every prediction? most “confident” predictions? predictions with
a “soft” label (kind of like EM)? only points that an ensemble of simple classifiers agree on? nearest neighbors
of previously labeled points? or something else? The second choice is to decide the stopping criterion, should it
be a fixed number of iterations (determined using development data)? should it be when the set of labels stop
changing (or only a small proportion changes)? or something else?
When analyzing this approach, I would expect you to include the accuracy and the size of the labeled set (if
appropriate) as they vary across iterations. It may also be useful to identify a few features/words whose weight
changed significantly, and hypothesize why that might have happened.
2.2.2 Designing Better Features
The primary concern when using a small set of labeled data for training text classifiers is the sparsity of the
vocabulary in the training data; irrelevant words might look incredibly discriminative for a label, while relevant
words may not even appear in the training data, because of small sample and high dimensional statistics. A way to
counteract this is to utilize the corpus of unlabeled text to learn something about the word semantics, and use it to
identify the words that are likely to be uttered by the same person. In order words, knowledge from unlabeled
documents can allow us to spread the labels to words that do not even appear in the training data.
For example, suppose a word like “healthcare” does not appear in the training data (and is quite indicative of a
particular candidate, say Obama 2008), but “health”, “insurance”, and “coverage” do (and are as indicative). From
co-occurence statistics on the unlabeled set of speeches, we can determine that “healthcare” seems to co-occur
with “health”, “insurance”, and “coverage” with a much higher probability than with other words. Thus, we may
hypothesize that “healthcare” should be indicative of Obama 2008 as well, even if it never appears during training.
Of course, I am being deliberately vague here in order to not provide a specific solution. If you will explore
this direction, you will have to consider how to represent the word contexts (as a word-document matrix?
word-word matrix?), how to compute similarity between word representations (cosine distance? pairwise mutual
information?), how to represent/encode the notion of similar words (fixed or hierarchical clusters? low-dimensional
embeddings? topics?), and finally, how to utilize the labeled data (use as features during training? propagate
labels to nearest points using the new distance? directly set weights of new words?), and so on.
If you pick this kind of an approach, the analysis should include why you picked a certain strategy (why
you thought it would be a good idea). You should also include examples of words and/or speeches where the
propagation of information helped (where it worked) and hurt.
3 Statement of Collaboration
It is
mandatory
to include a Statement of Collaboration in each submission, with respect to the guidelines below.
Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.
All students are required to follow the academic honesty guidelines posted on the course website. For
programming assignments, in particular, I encourage the students to organize (perhaps using Piazza) to discuss
the task descriptions, requirements, bugs in my code, and the relevant technical content before they start working
on it. However, you should not discuss the specific solutions, and, as a guiding principle, you are not allowed to
take anything written or drawn away from these discussions (i.e. no photographs of the blackboard, written notes,
referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict the discussion
to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.
4 Potential Concerns
Since this kind of an assignment might be unfamiliar to some students (lack of very specific instructions, too many
possible solutions, course leaderboard, etc.), I have put together a set of answers that hopefully address some of
these concerns. Please post on Piazza if you have any other questions.
Question: Is the grade based on the performance of my submissions relative to the others in the class?
No, you will primarily be evaluated on the quality and creativity of your write up, but of course, it is expected
both your submissions should beat the simple baseline I have provided. If you do get significantly unsatisfactory
Homework 1 UC Irvine 3/ 4