The course project will consist of groups of three students working together to create classifiers for a class-specific Kaggle competition:
Deadline: November 3
Once you have identified your teammates, coordinate and do the following:
Data: Our competition data are satellite-based measurements of cloud temperature (infrared imaging), being used to predict the presence or absence of rainfall at a particular location. The data are courtesy of the UC Irvine Center for Hydrometeorology and Remote Sensing, and have been pre-processed to extract features corresponding to a model they use actively for predicting rainfall across the globe. Each data point corresponds to a particular lat-long location where the model thinks there might be rain; the extracted features include information such as IR temperature at that location, and information about the corresponding cloud (area, average temperature, etc.). The target value is a binary indicator of whether there was rain (measured by radar) at that location; you will notice that the data are slightly imbalanced (positives make up about 30% of the training data).
Evaluation: Scoring of predictions is done using AUC, the area under the ROC (receiver-operator characteristic) curve. This gives an average of your learner’s performance at various levels of sensitivity to positive data. This means that you will likely do better if, instead of simply predicting the target class, you also include your confidence level of that class value, so that the ROC curve can be evaluated at different levels of specificity. To do so, you can report your confidence in class +1 (as a real number); the predictions will then be sorted in order of confidence, and the ROC curve evaluated.
Kaggle: Download the training data (X_train, Y_train) and the test data features (X_test). You will train your models using the former, make predictions using the test data features, and upload them to Kaggle. Kaggle will then score your predictions, and report your performance on a subset used for placing your team on the current leaderboard (the “leaderboard” data). After the competition, the score on the remainder of the data will be used to determine your final standing; this ensures that your scores are not affected by overfitting to the leaderboard data.
Your submission has to be a file containing two columns separated by a comma. The first column should be the instance number, followed by the second column that is the score of class (1). Further, the first line of the file should be “ID,Prob1”, i.e. the name of the two columns. We have made a sample submission file available, containing random predictions, call Y_random.txt.
Also, check out this notebook (coming soon) showing how to load the data, train logistic regression and decision trees, and create the submission files that you can upload to Kaggle. I recommend you generate these and try submitting them as soon as possible.
Note: Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload every possible configuration and check its leaderboard quality. You will need to do some validation process to decide on your learners’ parameter settings, etc. and only upload models you believe are doing OK.
Your project will consist of learning several predictors for the Kaggle data, as well as an ensemble “blend” of them, to try to do as well as possible at the prediction task. Specifically, learn at least three (more is good) different types of models; suggestions include:
Then, take your models and combine them using a blending or stacking technique – this could be as simple as a straight average / vote, a weighted vote, or a stacked predictor (linear or other model type). Feel free to experiment and see what performance you can get.
Note: you should do enough work to make sure each learner achieves “reasonable” performance, e.g., on the same approximate quality level as logistic regression and decision trees. (If it is significantly worse than the other models, it will not contribute usefully to the combination.)
For your report, please turn in a two-page report on your individual models and the final ensemble. Please include:
A table listing each model, with its performance on training, validation, and leaderboard data, as well as the same for your blended or stacked combination(s).
A paragraph or two for each model describing it: what features did you give it (input features, expanded or reduced feature sets, etc.); how was it trained (learner, software, any parameter settings, how you decided on those settings);
A paragraph or two for your overall ensemble: how did you combine the individual models & why did you pick that technique, etc.
Apart from evaluating your project based on the quality of your writeup, we will also be given additional points depending on the position on the leaderboard.