Many real-world applications contain a small number of labeled instances but a large number of unlabeled instances. Machine learning algorithms that are able to utilize the information from unlabeled instances are known as semi-supervised approaches. The first programming assignment will require you to implement such an algorithm that benefits from large amounts of unlabeled text.
One of the fundamental tasks for natural language processing is probabilistic modeling of language, i.e. how can we differentiate between a random sequence of words, and something we might consider an english sentence. Such language models are used in many applications, such as handwriting recognition, speech recognition, machine translation, and text generation. In this second programming assignment, you will perform language modeling of different kinds of text.
A number of tasks in natural language processing can be framed as sequence tagging, i.e. predicting a sequence of labels, one for each token in the sentence. Such tasks include more finer grained tasks such as tokenization and chunking, but also coarse-level part of speech tagging and named entity recognition. In this homework, you will be looking the latter two for a corpus of tweets, and investigating two challenges in sequence modeling: inference and feature engineering.
One of the most widespread and public-facing applications of natural language processing is machine translation. It has gained a lot of attention in recent years, both infamously for its lack of ability to understand the nuance in human communications, and for near human-level performance achieved using neural models. In this homework, we will be looking at phrase-based translation from French-English, and implementing stack-based decoders of various complexity to achieve this.