CS 295: Statistical NLP Winter 2017
source tokens). Multi-stack decoding consists of a stack for each “number of source words translated”, thus scores
are more comparable, and hypotheses can be combined within each stack.
Much of the smarts of this algorithm are in
Next
, which computes the scores
α
q
based on the partial translation
and the next phrase. As we will see next,
Compatible
is also a very important function that defines the candidates
phrases to be added, thus defining monotonicity of the decoder.
1.2 Monotonicity and Non-Monotonicity
The stack decoder always builds the English sentence left to right, i.e. each phrase is added to the end of the
English token sequence. However, by default, there is no requirement for the tokens that the phrases cover in
the foreign sentence to be left to right. This property defines the monotonicity of the decoder, i.e. whether the
predicted translation p
1
, . . . , p
L
meet the requirement of ∀i ∈ (2,. . . , L), t
p
i−1
= s
p
i
, and s
p
1
= 1.
Although for many languages such a constraint may be too harsh, such as the ones that do not follow the SVO
structure of English, it is often not a horrible assumption for languages such as French and Spanish. Constraining
the decoder to be monotonic for such languages, therefore, provides significant computational benefits, at the
cost of accuracy of these translations. However, there are still many situations that benefit from arbitrary word
ordering that non-monotonic decoders would provide.
Non-monotonic decoders only vary in the phrases that they consider compatible to a given state
q
: any phrase
p
that does not cover any of the existing tokens covered by
q
is used to extend
q
. Thus the branching factor
explodes with the number of total phrases for the sentence, as opposed to the number of phrases for each token for
monotonic decoders. In order to be computationally efficient, the notion of distortion limit was described in the
class, that describes the amount of non-monotonicity that is tolerated. It is worth noting that incorporating language
model is extremely important for non-monotonic decoders, since the phrase ordering is not really constrained.
1.3 Data and Source Code
I have released the initial source code, available at
https://github.com/sameersingh/uci-statnlp/tree/
master/hw4
, and the data archive available on Canvas. You will need to uncompress the archive, and put it in the
data/ folder for the code to work. The available code and source code contains the following:
◦ lm.py
and
lm.gz
: Similar to the assignment in HW2, this code provides an implementation of a Trigram
language model with Kneser-Ney smoothing, along with the parameters of such a model trained on a really
large corpus of English documents. Note, since we are computing
P
(
f |e
)
P
(
e
), we do not require a language
model of French in order to perform the decoding. The format of the language model file, known as the
ARPA model, consists of 1-, 2-, and 3-grams, with their log probabilities and backoff scores. You can load
and query the language model using the main function of this file.
◦ phrase.py
and
phrasetable.txt.gz
: Code and data for the French to English phrase table. Each line
in the file contains a pair of these phrases, along with a number of scores for different features of the pair.
The code reads this file and computes the single score
g
p
for each pair of phrases. This code also provides a
handy method to get all the possible phrase translations for a given sentence, i.e.
phrases()
corresponds to
PHRASES in the above pseudocode. You can investigate the translation table as shown in the main function.
◦ decoder.py
: Implementation of the multiple stack-based decoding algorithm. This implementation attempts
to follow the above notation of the pseudocode (and Collins’ notes) as much as possible, deviating as needed
for an optimized implementation. The code implements a working monotonic decoder that does not take
the language model into account. This is especially important when you are looking at the code for finding
compatible phrases (
Compatible
), computing the language model score (
lm
_
score
), and the distortion
score (
dist
_
score
). Some code that differs from the pseudocode includes precomputing the set of phrases
that should be considered for position
r
in
index
_
phrases
and extra fields in the state to make equality
comparisons efficient (
key
in
State
). You will need to develop a reasonable understanding of this code, so
please post privately or publicly on Piazza if you are not able to understand something.
◦ submission.py
: Skeleton code for the submission. It contains the three types of decoders, outof which
only the first one,
MonotonicDecoder
, works as intended. You have to implement the other functions in
this skeleton, as we will describe next.
◦ data.py
and
test.en/test.fr
: This is the code that reads in the files related to the translation model,
reads French sentences from
test.fr
, corresponding English translations from
test.en
, runs the model
on the French sentences, and computes the Bleu score on the predictions. It also contains some simple words
and phrases to translate into French, just to test your decoder.
Homework 4 UC Irvine 2/ 4