The course project will consist of groups of three students working together. There are two options for you to pick your project; one is more useful if you are interested in just learning machine learning, but not necessarily pursue it as a career option, and the other is more suitable for students who are already familiar with machine learning and want to pursue a career in it (research or otherwise). The two options are:
More details about the two options are forthcoming.
Deadline: November 10
Once you have identified your teammates, coordinate and do the following:
Data: Our competition data are satellite-based measurements of cloud temperature (infrared imaging), being used to predict the presence or absence of rainfall at a particular location. The data are courtesy of the UC Irvine Center for Hydrometeorology and Remote Sensing, and have been pre-processed to extract features corresponding to a model they use actively for predicting rainfall across the globe. Each data point corresponds to a particular lat-long location where the model thinks there might be rain; the extracted features include information such as IR temperature at that location, and information about the corresponding cloud (area, average temperature, etc.). The target value is a binary indicator of whether there was rain (measured by radar) at that location; you will notice that the data are slightly imbalanced (positives make up about 30% of the training data).
Evaluation: Scoring of predictions is done using AUC, the area under the ROC (receiver-operator characteristic) curve. This gives an average of your learner’s performance at various levels of sensitivity to positive data. This means that you will likely do better if, instead of simply predicting the target class, you also include your confidence level of that class value, so that the ROC curve can be evaluated at different levels of specificity. To do so, you can report your confidence in class +1 (as a real number); the predictions will then be sorted in order of confidence, and the ROC curve evaluated.
Participating: Download the training data (X_train, Y_train) and the test data features (X_test). You will train your models using the former, make predictions using the test data features, and upload them to Kaggle. Kaggle will then score your predictions, and report your performance on a subset used for placing your team on the current leaderboard (the “leaderboard” data). After the competition, the score on the remainder of the data will be used to determine your final standing; this ensures that your scores are not affected by overfitting to the leaderboard data.
Your submission has to be a file containing two columns separated by a comma. The first column should be the instance number, followed by the second column that is the score of class (1). Further, the first line of the file should be “ID,Prob1”, i.e. the name of the two columns. We have made a sample submission file available, containing random predictions, call Y_random.txt.
Also, check out this notebook (coming soon) showing how to load the data, train logistic regression and decision trees, and create the submission files that you can upload to Kaggle. I recommend you generate these and try submitting them as soon as possible.
Note: Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload every possible configuration and check its leaderboard quality. You will need to do some validation process to decide on your learners’ parameter settings, etc. and only upload models you believe are doing OK.
I am looking for several elements to be present in any good project. These are:
Your team will produce a single write-up document, approximately 6 pages long, describing the problem you chose to tackle and the methods you used to address it, including which model(s) you tried, how you trained them, how you selected any parameters they might require, and how they performed in on the test data. Consider including tables of performance of different approaches, or plots of performance used to perform model selection (i.e., parameters that control complexity). Within your document, please try to describe to the best of your ability who was responsible for which aspects (which learners, etc.), and how the team as a whole put the ideas together.
You are free to collaborate with other teams, including sharing ideas and even code, but please document where your predictions came from. (This also relaxes the proscription from posting code or asking for code help on Piazza, at least for project purposes.) For example, for any code you use, please say in your report who wrote the code and how it was applied (who determined the parameter settings and how, etc.) Collaboration is particularly true for learning ensembles of predictors: your teams may each supply a set of predictors, and then collaborate to learn an ensemble from the set.
Some possible components of a successful project include: