CS 178: Machine Learning Fall 2017

Project Details

The course project will consist of groups of three students working together to create classifiers for a class-specific Kaggle competition:

Forming a Team

Deadline: November 3

Once you have identified your teammates, coordinate and do the following:

  1. Come up with a short and simple team name
  2. Everyone should sign up for a Kaggle account
  3. One of you create a team on Kaggle, with the selected name, and add your teammates to it.
  4. One of you go here, make sure you’re looking at the “Project Groups” tab, and find an empty project group. Add yourself to it.
  5. Everyone else should also add themselves to the same group on Canvas.
  6. Once everyone has been added, one of you submit the team name as the submission before the deadline here: https://canvas.eee.uci.edu/courses/6623/assignments/147537
  7. Download the code, the data, and try out some submissions!

Evaluation Setup

Data: Our competition data are satellite-based measurements of cloud temperature (infrared imaging), being used to predict the presence or absence of rainfall at a particular location. The data are courtesy of the UC Irvine Center for Hydrometeorology and Remote Sensing, and have been pre-processed to extract features corresponding to a model they use actively for predicting rainfall across the globe. Each data point corresponds to a particular lat-long location where the model thinks there might be rain; the extracted features include information such as IR temperature at that location, and information about the corresponding cloud (area, average temperature, etc.). The target value is a binary indicator of whether there was rain (measured by radar) at that location; you will notice that the data are slightly imbalanced (positives make up about 30% of the training data).

Evaluation: Scoring of predictions is done using AUC, the area under the ROC (receiver-operator characteristic) curve. This gives an average of your learner’s performance at various levels of sensitivity to positive data. This means that you will likely do better if, instead of simply predicting the target class, you also include your confidence level of that class value, so that the ROC curve can be evaluated at different levels of specificity. To do so, you can report your confidence in class +1 (as a real number); the predictions will then be sorted in order of confidence, and the ROC curve evaluated.

Kaggle: Download the training data (X_train, Y_train) and the test data features (X_test). You will train your models using the former, make predictions using the test data features, and upload them to Kaggle. Kaggle will then score your predictions, and report your performance on a subset used for placing your team on the current leaderboard (the “leaderboard” data). After the competition, the score on the remainder of the data will be used to determine your final standing; this ensures that your scores are not affected by overfitting to the leaderboard data.

Submission Details

Your submission has to be a file containing two columns separated by a comma. The first column should be the instance number, followed by the second column that is the score of class (1). Further, the first line of the file should be “ID,Prob1”, i.e. the name of the two columns. We have made a sample submission file available, containing random predictions, call Y_random.txt.

Also, check out this notebook (coming soon) showing how to load the data, train logistic regression and decision trees, and create the submission files that you can upload to Kaggle. I recommend you generate these and try submitting them as soon as possible.

Note: Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload every possible configuration and check its leaderboard quality. You will need to do some validation process to decide on your learners’ parameter settings, etc. and only upload models you believe are doing OK.

Project Requirements

Your project will consist of learning several predictors for the Kaggle data, as well as an ensemble “blend” of them, to try to do as well as possible at the prediction task. Specifically, learn at least three (more is good) different types of models; suggestions include:

  1. K-Nearest neighbor. Note that a KNN model on these data will need to overcome two issues: the large number of training & test data, and the data dimension. As noted in class, distance-based methods often do not work well in high dimensions, so you may need to perform some kind of feature selection process to decide how many and which features to include. Similarly, the O(nmm) scaling will likely prove prohibitive; you will need to reduce the number of training data somehow, either by subsampling, clustering, or selecting “useful” examples to retain. Finally, the right “distance” for prediction may not be Euclidean in the original feature scaling (these are raw numbers); you may want to experiment with scaling features differently, or even learning the distance function.
  2. Linear models. Since you have relatively few features compared to the number of data, getting good performance will require significant feature expansion. You can generate new features systematically (polynomial, etc.), randomly (kitchen sink features), using clustering, or more.
  3. Kernel methods. For a fast implementation of SVMs, I recommend libSVM. However, like KNN, these methods often do not scale well with dimension (at least for the RBF kernel) or number of training data, and you will need to do something about these issues. See KNN discussion.
  4. Random forest. This is essentially what you did for Homework 4; you may use that learner, or elaborate on it further.
  5. Boosted learners. Use AdaBoost, Gradient Boosting, or another boosting algorithm, to train a boosted ensemble of some base learner (perceptron, decision stump or shallow tree, or Gaussian Bayes classifier).
  6. Neural network. The key for learning a NN model on these data will be to ensure that your model is well-optimized. You should monitor its performance, preferably on both training & validation data, during backpropagation, and verify that the training process is working properly and converging to a reasonable performance value (e.g., comparably to other methods). Start with few layers (2-3) and moderate numbers of hidden nodes (100-1000) per layer; within these settings you can work to make sure your model is training adequately.
  7. Other. You tell me: apply another class of learners, or a variant or combination of methods like the above. You can use existing libraries or modify course code; just be sure to understand the model you are applying, and why it may work well.

Then, take your models and combine them using a blending or stacking technique – this could be as simple as a straight average / vote, a weighted vote, or a stacked predictor (linear or other model type). Feel free to experiment and see what performance you can get.

Note: you should do enough work to make sure each learner achieves “reasonable” performance, e.g., on the same approximate quality level as logistic regression and decision trees. (If it is significantly worse than the other models, it will not contribute usefully to the combination.)

Project Report

For your report, please turn in a two-page report on your individual models and the final ensemble. Please include:

  1. A table listing each model, with its performance on training, validation, and leaderboard data, as well as the same for your blended or stacked combination(s).

  2. A paragraph or two for each model describing it: what features did you give it (input features, expanded or reduced feature sets, etc.); how was it trained (learner, software, any parameter settings, how you decided on those settings);

  3. A paragraph or two for your overall ensemble: how did you combine the individual models & why did you pick that technique, etc.

Apart from evaluating your project based on the quality of your writeup, we will also be given additional points depending on the position on the leaderboard.