CS 273A: Machine Learning Fall 2017

Project Details

The course project will consist of groups of three students working together. There are two options for you to pick your project; one is more useful if you are interested in just learning machine learning, but not necessarily pursue it as a career option, and the other is more suitable for students who are already familiar with machine learning and want to pursue a career in it (research or otherwise). The two options are:

Kaggle Competition: Implement and analyze an application of machine learning to a given dataset, will involve participating in the class-specific Kaggle competition: https://www.kaggle.com/c/uc-irvine-cs-273a-project-fall-2017/ (more details: here)
ICLR 2018 Challenge: Implement an existing papers in order to try and reproduce their results, and submit a report to the ICLR challenge: http://www.cs.mcgill.ca/~jpineau/ICLR2018-ReproducibilityChallenge.html. For advanced students only! (more details: here)

More details about the two options are forthcoming.

Forming a Team

Deadline: November 10

Once you have identified your teammates, coordinate and do the following:

Come up with a short and simple team name
One of you go here, find an empty project group, and add yourself to it. Do NOT create a group!
Everyone else should also add themselves to the same group on Canvas.
Once everyone has been added, one of you submit the team name as the submission before the deadline here: https://canvas.eee.uci.edu/courses/6624/assignments/147538
Start experimenting with the options.
- Option 1: Download the data, and start training classifiers
- Option 2: Identify the paper you want to implement
Option 2: Pick a paper, and meet with me.

Option 1: Kaggle

URL: https://www.kaggle.com/c/uc-irvine-cs-273a-project-fall-2017/

Data: Our competition data are satellite-based measurements of cloud temperature (infrared imaging), being used to predict the presence or absence of rainfall at a particular location. The data are courtesy of the UC Irvine Center for Hydrometeorology and Remote Sensing, and have been pre-processed to extract features corresponding to a model they use actively for predicting rainfall across the globe. Each data point corresponds to a particular lat-long location where the model thinks there might be rain; the extracted features include information such as IR temperature at that location, and information about the corresponding cloud (area, average temperature, etc.). The target value is a binary indicator of whether there was rain (measured by radar) at that location; you will notice that the data are slightly imbalanced (positives make up about 30% of the training data).

Evaluation: Scoring of predictions is done using AUC, the area under the ROC (receiver-operator characteristic) curve. This gives an average of your learner’s performance at various levels of sensitivity to positive data. This means that you will likely do better if, instead of simply predicting the target class, you also include your confidence level of that class value, so that the ROC curve can be evaluated at different levels of specificity. To do so, you can report your confidence in class +1 (as a real number); the predictions will then be sorted in order of confidence, and the ROC curve evaluated.

Participating: Download the training data (X_train, Y_train) and the test data features (X_test). You will train your models using the former, make predictions using the test data features, and upload them to Kaggle. Kaggle will then score your predictions, and report your performance on a subset used for placing your team on the current leaderboard (the “leaderboard” data). After the competition, the score on the remainder of the data will be used to determine your final standing; this ensures that your scores are not affected by overfitting to the leaderboard data.

Kaggle Submission Format

Your submission has to be a file containing two columns separated by a comma. The first column should be the instance number, followed by the second column that is the score of class (1). Further, the first line of the file should be “ID,Prob1”, i.e. the name of the two columns. We have made a sample submission file available, containing random predictions, call Y_random.txt.

Also, check out this notebook (coming soon) showing how to load the data, train logistic regression and decision trees, and create the submission files that you can upload to Kaggle. I recommend you generate these and try submitting them as soon as possible.

Note: Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload every possible configuration and check its leaderboard quality. You will need to do some validation process to decide on your learners’ parameter settings, etc. and only upload models you believe are doing OK.

Project Requirements

I am looking for several elements to be present in any good project. These are:

Exploration of at least one or two techniques on which we did not spend significant time in class. For example, using neural networks, support vector machines, or random forests are great ideas; but if you do this, you should explore in some depth the various options available to you for parameterize the model, controlling complexity, etc. (This should involve more than simply varying a parameter and showing a plot of results.) Other options might include feature design, or optimizing your models to deal with special aspects of the data (large numbers of zeros in the data; possible outlier data; etc.). Your report should describe what aspects you chose to focus on.
Performance validation. You should practice good form and use validation or cross-validation to assess your models’ performance, do model selection, combine models, etc. You should not simply upload hundreds of different predictors to the website to see how they do. Think of the website as “test” performance in practice, you would only be able to measure this once you go live.
Adaptation to under- and over-fitting. Machine learning is not very “one size fits all”; it is impossible to know for sure what model to choose, what features to give it, or how to set the parameters until you see how it does on the data. Therefore, much of machine learning revolves around assessing performance (e.g., is my poor performance due to underfitting, or overfitting?) and deciding how to modify your techniques in response. Your report should describe how, during your process, you decided how to adapt your models and why.

Your team will produce a single write-up document, approximately 6 pages long, describing the problem you chose to tackle and the methods you used to address it, including which model(s) you tried, how you trained them, how you selected any parameters they might require, and how they performed in on the test data. Consider including tables of performance of different approaches, or plots of performance used to perform model selection (i.e., parameters that control complexity). Within your document, please try to describe to the best of your ability who was responsible for which aspects (which learners, etc.), and how the team as a whole put the ideas together.

You are free to collaborate with other teams, including sharing ideas and even code, but please document where your predictions came from. (This also relaxes the proscription from posting code or asking for code help on Piazza, at least for project purposes.) For example, for any code you use, please say in your report who wrote the code and how it was applied (who determined the parameter settings and how, etc.) Collaboration is particularly true for learning ensembles of predictors: your teams may each supply a set of predictors, and then collaborate to learn an ensemble from the set.

Some possible components of a successful project include:

Semi-supervised methods: investigate how your knowledge of the test features can be used to improve prediction. As examples, see e.g., label propagation (http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf), or using EM (within e.g. naive Bayes or a Gaussian mixture model, e.g., http://www.kamalnigam.com/papers/emcat-mlj99.pdf).
Kernel learning, or similarity/metric learning of the measure of dissimilarity used in, for example, nearest neighbors or SVMs, to improve their performance. See for example Weinberger and Saul 2008, http://www.cse.wustl.edu/~kilian/papers/jmlr08_lmnn.pdf.
Neural networks and deep learning; using existing packages like PyTorch, Keras, MxNet, and PyLearn2.
Support vector machines. For example, you could investigate the effect of different kernel choices, regularization, etc.). The implementation libsvm is pretty good.
Go in-depth with ensembles. Use lots of learners, stacking, and information from your leaderboard performance to try to improve your prediction quality.
Feature selection methods, such as stepwise regression (or in this case, classification); e.g.http://en.wikipedia.org/wiki/Stepwise_regression. (Note: if you use feature selection, you should use a predictor that is sufficiently complex to need feature selection!)
New Features Techniques for creating new features, including “kitchen sink” features (http://books.nips.cc/papers/files/nips21/NIPS2008_0885.pdf), clustering-based features, etc. Once you have many features, of course, you may also have to explore feature selection (see above) or regularization to control complexity.
Sophisticated decision tree structures, e.g., http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.1587.
etc.

Option 2: ICLR Challenge

URL: http://www.cs.mcgill.ca/~jpineau/ICLR2018-ReproducibilityChallenge.html

Coming Soon