Kaggle in 30 minutes: House Prices competition. Part 2

In the previous article we have learnt what Kaggle is and the different sections it has. Now let’s take a look at one of the most basic Kaggle competitions – House Prices.

The first thing to do is to familiarize yourself with the purpose of the competition, rules and data. It is also worth remembering the basics of Kaggle from the first article.

The main goal is to predict house prices based on a number of features such as location, square footage, number of rooms, garage, etc.

It is a regression problem solving contest, from which we will proceed.

The data consists of four files:

  • train.csv – training (training) sample.
  • test.csv – test data, on the basis of which we are going to make predictions.
  • data_description.txt – full description of each column.
  • sample_submission.csv – example of how our response (submission) should look like.

All code is reproduced in the jupyter notebook cells.

First, we load the test and training samples.

import numpy as np
import pandas as pd
train_df = pd.read_csv('../input/train.csv')
test_df  = pd.read_csv('../input/test.csv')
# Let’s check the data
train_df.head()

Next we need define the dimensions of the dataset. We will use the training part for the analysis.

train_df.shape

We get (1460, 81), which is 81 columns and 1460 rows.

Data overview – target variable

The first thing we have to do is to look at our target variable SalePrice.

train_df['SalePrice'].describe()

It seems that the house price deviates significantly from the normal distribution:

  • The standard deviation is too high.
  • The minimum is greater than 0 (which is logical for real estate prices).
  • There is a large difference between the minimum and the 25th percentile.
  • The difference between the 75th percentile and the maximum is larger than the 25th percentile and the maximum.

We should create a histogram to finally see what kind of distribution we are dealing with.

# Importing necessary libraries for visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# histogram
f, ax = plt.subplots(figsize=(8, 6))
sns.distplot(train_df['SalePrice'])

As we assumed, the distribution is far from ideal. Let’s make more observations:

# Calculate asymmetry and kurtosis
print("Asymmetry: %f" % train_df['SalePrice'].skew())
print("Kurtosis: %f" % train_df['SalePrice'].kurt())

Asymmetry : 1.882876.
Excess 6.536282.

Something needs to be done with that. Maybe a logarithmic transformation of the target variable would help? Let’s create two graphs: one with the original data and one using the technique mentioned above:

from scipy import stats
fig = plt.figure(figsize = (14,8))
# Distribution on raw data
fig.add_subplot(1,2,1)
res = stats.probplot(train_df['SalePrice'], plot=plt)
# Distribution after ‘SalePrice' logarithmising
fig.add_subplot(1,2,2)
res = stats.probplot(np.log1p(train_df['SalePrice']), plot=plt)

This method of constructing features has remedied the situation. Our task now is not just to do the logarithm on the graph, but to apply this method to the whole training sample:

train_df['SalePrice'] = np.log1p(train_df['SalePrice']) 

Overview of the data – correlation

Now let’s look at which features the target variable SalePrice correlates with:

# Correlation matrix
corrmat = train_df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

In such a graph it is not easy to select the features we need.

Let’s try a truncated version and reduce the number of correlated features to 10:

k = 10 # the amount of correlation features we want to see
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train_df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True,
                 fmt='.2f', annot_kws={'size': 10},
                 yticklabels=cols.values, xticklabels=cols.values)
plt.show() 

Now we see that SalePrice correlates best with GrLivArea and OverallQual. Let us check these two attributes for outliers:

fig, ax = plt.subplots()
ax.scatter(x = train_df['GrLivArea'], y = train_df['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
fig, ax = plt.subplots()
ax.scatter(x = train_df['OverallQual'], y = train_df['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('OverallQual', fontsize=13)
plt.show()

The outliers are insignificant. However, if we remove a few of the most prominent values, the model result improves.

# Removing
# Only these 2 removals will help us improve our score on the leaderboard
train_df = train_df.drop(train[(train['OverallQual'] > 9) & (train['S
alePrice'] < 220000)].index)
train_df = train_df.drop(train[(train['GrLivArea'] > 4000) & (train['
SalePrice'] < 300000)].index)

Data cleaning and feature selection

It’s worth examining the data for missing values and other things that might spoil the score (and hence our position in the competition leaderboard).

This line of code will output the top 20 missing values:

# Missing values
train_df.isnull().sum().sort_values(ascending=False).head(20)

The scale of missing values will be better visible in the chart:

# Let’s visualise
total = train_df.isnull().sum().sort_values(ascending=False)
percent = (train_df.isnull().sum() / train_df.isnull().count()).sort_
values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
# Histogram
percent_data = percent.head(20)
percent_data.plot(kind="bar", figsize = (8,6), fontsize = 10)
plt.xlabel("Columns", fontsize = 20)
plt.ylabel("Count", fontsize = 20)
plt.title("Total number of missing values (%)", fontsize = 20)

This needs to be dealt with. A large amount of missing data in both training and test datasets will hit the quality of the model very hard. This is a direct road to the bottom of the leaderboard in the competition.

Let’s fix the problem on the merged data.

# Remove lines where the target value is missing
Target = 'SalePrice'
train_df.dropna(axis=0, subset=[Target], inplace=True)
# Connect the training & test datasets to perform our transformations on all data
all_data = pd.concat([train_df.iloc[:,:-1], test_df],axis=0)
print(‘Training dataset contains {} rows and {} features'.format(train_df.shape[0], train_df.shape[1]))
print(‘Test dataset contains {} rows and {} features'.format(test_df.shape[0], test_df.shape[1]))
print('The combined dataset contains {} rows and {} features'.format(all_data.shape[0], all_data.shape[1]))
# Removing a useless column
all_data = all_data.drop(columns=['Id'], axis=1)

We can now fully deal with the missing data.

# Function to view missing data
def missingValuesInfo(df):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = round(df.isnull().sum().sort_values(ascending = False)
/ len(df)*100, 2)
    temp = pd.concat([total, percent], axis=1, keys=['Total', 'Percen
t'])
    return temp.loc[(temp['Total'] > 0)]
missingValuesInfo(train_df)

# Let’s deal with the missing data
# Numerical values are selected through belonging to a
#[‘int64’, ‘float64'] format
# Categorical values are selected through belonging to a #[“object”] format
def HandleMissingValues(df):
    num_cols = [cname for cname in df.columns if df[cname].dtype in [
'int64', 'float64']]
    cat_cols = [cname for cname in df.columns if df[cname].dtype == "
object"]
    values = {}
    for a in cat_cols:
        values[a] = 'UNKNOWN'
    for a in num_cols:
# Check it
all_data.isnull().sum().sum()

The result is 0.

Great, we have solved the basic problem. Don’t forget about the categorical features either.

# Dealing with categorical features
def getObjectColumnsList(df):
    return [cname for cname in df.columns if df[cname].dtype == "obje
ct"]
def PerformOneHotEncoding(df, columnsToEncode):
    return pd.get_dummies(df, columns=columnsToEncode)
cat_cols = getObjectColumnsList(all_data)
all_data = PerformOneHotEncoding(all_data, cat_cols)
all_data.head()

Our task of basic data cleaning and feature selection has been completed. Now we can again split the data into a training dataset and a test dataset. This is necessary because we will be predicting the behaviour of the future model on the test sample.

train_df = all_data.iloc[:1460, :]
test_df = all_data.iloc[1460:, :]
print(train_df.shape)
print(test_df.shape)

Modelling

The House Prices competition has the task of regression, so we will use appropriate models.

Ridge regression

from sklearn.linear_model import RidgeCV
ridge_cv = RidgeCV(alphas = (0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
ridge_cv.fit(X, y)
ridge_cv_preds = ridge_cv.predict(test_df)

XGBRegressor

import xgboost as xgb 
model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_
rate=0.2)
model_xgb.fit(X, y)
xgb_preds = model_xgb.predict(test_df)

Let’s take the average value from both models:

predictions = (ridge_cv_preds + xgb_preds) / 2

And create dataframe to lay out our solution.

submission = {
    'Id': test_df.Id.values,
    'SalePrice': predictions
}
solution = pd.DataFrame(submission)
solution.to_csv('submission.csv',index=False)

Conclusion

The purpose of this article is to give you a basic understanding of the ‘Pipeline’ you need to successfully take a prt in different Kaggle competitions.

This includes:

  • Loading the data, scrutinising it and then cleaning it up.
  • Selecting features, creating new ones if necessary.
  • Selection of the right model (in advanced cases, an ensemble of several models), selection of acceptable parameters.
  • Prediction and successful subsampling.

Based on this, you can refine the basic solution described above. For example, deal with missing data individually for each feature rather than in a loop. Create new features based on existing ones, or find parameters that will increase the score of the model onn the leaderboard.

Finding the best solution in a Kaggle competition is an art, which you can master by combining a wide variety of techniques with unconventional methods.

Related blog posts