Data Analytics Project

Data Analytics Project
Kaggle Data Science Competition
www.kaggle.com
Overview: The centerpiece of this course will be for you to participate in a data science competition. The
competition is hosted by Kaggle (www.kaggle.com) which is an online community of data scientists and
which hosts private and public data science competitions. The goal of this competition is to gain handson experience with the execution and implementation of real data science projects.
You will participate in one Kaggle competitions-the House Prices Competition (www.kaggle.com/c/houseprices-advanced-regression-techniques). It will serve as an application of data modeling concepts &
techniques that we discuss in class.
Project Deliverables: You will work on building the best-possible model to predict house prices. Based on
that model, Kaggle will then score the accuracy of your predictions and return a) your prediction score
and b) your position on the leaderboard (this is a world-wide competition, so do not get discouraged if
you do not find yourself at the top of the leaderboard!). Then, the deliverable for this course, is the
submission of your (best) score and leadership position. Both, your best score and best leadership position
will be submitted via Canvas.
More details below…
Getting Started: As a first step, please visit www.kaggle.com and create an account.
Next, locate the competition for our class project:
The House Prices Competition (www.kaggle.com/c/house-prices-advanced-regression-techniques)
At this point, you may want to familiarize yourself with each competition, read the competition
descriptions and download all of the associated data.
Your deliverable – Kaggle House Prices Competition:
Deliverable and Due date: this deliverable is due in our last week of class. That is, you will need to have
made at least one proper submission to Kaggle by the last week of our class in order to have a valid
deliverable for this course. However, as you will see, Kaggle allows you to make repeat submissions. That
is, you can submit more than one prediction to Kaggle. In fact, Kaggle allows up to 5 submission per day.
In that sense, you can start making submissions to Kaggle early, and – based on your score – refine your
model and then re-submit (with – hopefully – a better score). For this course, only submit your best score.
In fact, your submission for this course will be a Kaggle screenshot of your best score as well as you best
position on the leaderboard.
Step-by-step: Below you will find a description of some of the steps that you will need to take in order
to complete the competition.
• Data download: First, you will need to locate the House Prices Competition at
www.kaggle.com/c/house-prices-advanced-regression-techniques, read the competition description
and then download the data. Note that the data contains 4 different files
o train.csv contains the data that you have available to build your best model. It contains
many different possible input variables (or predictors) such as the building class or the
zoning classification. Note that it also contains the output variable (or response) that we are
interested in: SalePrice. Our goal is to find the best model for SalePrice.
o test.csv contains the data that you will use to test your model. In fact, Kaggle will use that
data to evaluate how well your model predicts (relative to all other teams that have
participated in that competition). In that sense, test.csv contains the same information as
train.csv with one (important) exception: it does not contain information on the sale price,
or in other words, you will not find the column SalePrice in that data. In fact, your task will
be to estimate the most accurate sales price for each record in text.csv (using your best
model).
o data_description.txt contains a description of all data fields (but you can also find the same
data descriptions on the competition’s home page)
o sample_submission.csv contains a sample file on the format necessary for your submission
to Kaggle.

• Data upload to R Commander: In order to get started with the modeling process, you will first need
to upload your data into R Commander. Be sure to select the field separator as “Commas” as the
data files are formatted as comma separated. Also, you may want to enter the name “train” for the
training data and, similarly, “test” for the test data.
Be sure to click on “View data set” in order to check that the data has been uploaded properly.
• Data Modeling & Model Optimization: The data modeling process will involve iterating through
several linear regression models based on the training data. We will discuss more on the modeling
process in class. In R Commander, you can find the linear regression model under the Statistics > Fit
Models > Linear regression…:
Be sure to select “SalePrice” as the Response variable and one (or more) variables as the Explanatory
variable(s).
After you click “OK” you will see the resulting estimated regression model.
• Deploying your best model to the test data: Once you feel confident that you have found your best
model, you want to apply that model to predict the records from the test data. This can be done
directly inside of R Commander. However, in order to make this process as seamless as possible, we
first have to install an add-on package to R Commander. The name of this add-on package is
“Rcmdr.Plugin.UCA”. Here is how you install this add-on package:
• Installing the add-on package Rcmdr.Plugin.UCA:
First, go to the R Console and select “Install package(s)…”
A new window will pop-up asking you to select a mirror (i.e. server); choose a location close to you.
Another window will pop-up with all available R packages. Scroll down until you see
Rcmdr.Plugin.UCA; select that package and click OK; this will start an installation process that may
take several minutes.
When the installation process has finished, close-out R Commander and R, and re-start. First re-start
R, then re-start R Commander
Once R Commander has re-started, go to “Tools” and select “Load Rcmdr plug-in(s)
Select the plug-in that you just installed (Rcmdr.Plugin.UCA) and click “OK”
R Commander will ask you to re-start; click OK
Once R Commander has re-started, you will find the new option to “Predict active model” under the
tab “Model” ; we will use that option for our Kaggle Competition.
• At this point, you should have installed the add-on package Rcmdr.Plugin.UCA. You should also have
a pretty good idea about your best model (or at least your first best model that you want to submit
to Kaggle – you can always refine your model and submit additional models later). I will assume that
you have already uploaded the train data train.csv into Rcmdr; be sure that you also upload the test
data test.csv at this point.
The test data should look just like the train data, except for that the last column (i.e. the column
with SalePrice) is missing.
Then, using your best model from the training data (be sure to select “train” in Rcmdr when you
run that model)….
…click on Models > Predict using active model > Add predictions to existing dataset….
Select the test data “test” and click OK
If you now click on “View data set” for the test data, you should find that a new column has
been appended to the test data; this new column read “fitted.RegModel.1” (or a similar
number, depending on how many different regression models you have already run at this
point). This new column is your predicted SalePrice for the test data.
You can download these predictions into a .csv file by going to the “Data” tab, then selecting
“Active data set” and then “Export active data set…”
Select how you want your data to be exported (you might want to select the field separator
“commas” such that your exported data also becomes a comma separated .csv file)
You can now open your test data and double check that you see the new column with your
predictions for SalePrice
• Creating your submission file: Next, you want to merge the predictions with the identifier (ID)
records from the test data (i.e. from test.csv)…
…in order to get a suitable submission file. The submission file should only contain the identifier
(ID) in the first column and the predicted sales price (SalePrice) in the second column.
• Submitting your predictions to Kaggle: Once you have created your submission file, go back to the
home page for the House Prices competition and locate the “submit predictions” tab. Click on that
tab and upload your submission file.
Then complete your submission. If all goes well, you should now see your score….
…and you should also find your position on the leaderboard
Deliverables:
• Please submit your best score and your best position on the leaderboard to Canvas.com. Your
score and leadership position will be used to evaluate your efforts (relative to that of the other
students in the course) and to assign your grade for this project.
That’s it! Good luck with your data science competition.