CS 295: Statistical NLP Winter 2017
compute the features of a token in a sentence, which you will be extending. The method there returns the
computed features for a token as a list of strings (so does not have to worry about indices, etc.).
◦ struct
_
perceptron.py
: A direct port (with negligible changes) of the structured perceptron trainer from
the
http://pystruct.github.io
project. Only used for the CRF tagger. The description of the various
hyper-parameters of the trainer are available here, but you should not modify this file, but instead change
them in the constructor in tagger.py .
◦ viterbi.py
and
viterbi
_
test.py
: General purpose interface to a sequence Viterbi decoder in
viterbi.py
,
which currently has an incorrect implementation. Once you have implemented the Viterbi implementation,
running python viterbi
_
test.py should result in successful execution without any exceptions.
By default, running
python data.py
will run logistic regression with the basic features on POS tagging, and
prints the evaluation metrics and write out the prediction files in the
data
directory. I encourage you to run this,
followed by
conlleval.pl
, to get an understanding of the output and evaluation. The files that you certainly
have to change (and include as part of your submission) are
viterbi.py
and
feat
_
gen.py
. More details about
what you need to implement in the sections below.
2 What to Submit?
Prepare and submit a single write-up (
PDF, maximum 5 pages
) and your modified
viterbi.py
and
feat
_
gen.py
(compressed in a single
zip
or
tar.gz
file; we will not be compiling or executing it, nor will we be evaluating
the quality of the code) to Canvas. Note that Part 1 and Part 2 are completely independent of each other, so you
can start with either first (Part 3 builds upon both). The write-up and code should address the following.
2.1 Feature Engineering (35 points)
Take a look at
feat
_
gen.py
. It computes features for a given token (at position
i
) for a given sentence (a sequence
of tokens). The current set of features are pretty basic, they just look at the word, and whether certain default string
properties apply to it or not. But note that a feature here is just a unique string, such as
WORD=is
or
IS
_
UPPER
,
so you do not have to worry about indexing it as a vector. Further, with the
add
_
neighs
flag, we also add all
the basic features of our neighbors by calling the function recursively, and prefixing the features with a certain
keyword. I encourage you to run
python feat
_
gen.py
to see the features for a simple sentence, and play around
with other sentences. However, you can imagine many other kinds of features that would be useful for both
part-of-speech tagging and named entity recognition, and thus your goal here is to introduce new features and
evaluate their utility.
Your code should just extend this function with more features. Feel free to use the provided lexicons or any
other external information that you think will be useful for the task. One thing to keep in mind, as you perform
feature engineering is that you will have some operations that are expensive and only should be done once, during
preprocessing. Also keep an eye on how many features you are introducing, since each additional feature can
actually increase the number of parameters by quite a bit, which can significantly slow down training. Note
that since all the features are boolean, you cannot directly use word embeddings, but of course clustering on
top of embeddings can be incorporated as cluster memberships (i.e. brown clustering style). Finally, by running
feat
_
gen.py
independent of other files, you can test out whether you are generating the right features, before
training a model using them.
In the writeup, describe what the features that you have implemented are. Try to motivate them (why did you
think they’d be useful), describe them in sufficient details, give examples (if it’ll be useful to understand), and
finally, evaluate how much they helped on the dev and test set for the logistic regression tagger (you should just
need to execute data.py after making your changes in feat
_
gen.py ).
2.2 Implement Viterbi decoding (35 points)
More important than having a good set of features from a model, we need to be able to make predictions from it.
Unfortunately, the conditional random field implementation I have included lacks this feature, and when we try to
predict from it, gives a pretty stupid sequence. Thankfully, we have covered the use of dynamic programming
multiple times in the class, and thus, here you will implement one of them here, the Viterbi algorithm for sequence
tagging.
The main file you will be modifying is
viterbi.py
, which needs a function to compute the best sequence
(and its score) given the various transition and emission scores (corresponding to the
ψ
?
s in Section 1.1.2). As a
Homework 3 UC Irvine 4/ 6