Post

Dataquest Guided Project - Predicting House Sale Prices

Dataquest Guided Project - Predicting House Sale Prices

In this project, we are going to apply machine learning algorithms to predict the price of a house using ‘AmesHousing.tsv’. In order to do so, we’ll have to transform the data and apply various feature engineering techniques.

We will be focusing on the linear regression model, and use RMSE as the error metric. First let’s explore the data.

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
%matplotlib inline

pd.set_option('display.max_columns', 500)
data = pd.read_csv("AmesHousing.tsv", delimiter='\t')
1
2
3
4
print(data.shape)
print(len(str(data.shape))*'-')
print(data.dtypes.value_counts())
data.head()
1
2
3
4
5
6
(2930, 82)
----------
object     43
int64      28
float64    11
dtype: int64
OrderPIDMS SubClassMS ZoningLot FrontageLot AreaStreetAlleyLot ShapeLand ContourUtilitiesLot ConfigLand SlopeNeighborhoodCondition 1Condition 2Bldg TypeHouse StyleOverall QualOverall CondYear BuiltYear Remod/AddRoof StyleRoof MatlExterior 1stExterior 2ndMas Vnr TypeMas Vnr AreaExter QualExter CondFoundationBsmt QualBsmt CondBsmt ExposureBsmtFin Type 1BsmtFin SF 1BsmtFin Type 2BsmtFin SF 2Bsmt Unf SFTotal Bsmt SFHeatingHeating QCCentral AirElectrical1st Flr SF2nd Flr SFLow Qual Fin SFGr Liv AreaBsmt Full BathBsmt Half BathFull BathHalf BathBedroom AbvGrKitchen AbvGrKitchen QualTotRms AbvGrdFunctionalFireplacesFireplace QuGarage TypeGarage Yr BltGarage FinishGarage CarsGarage AreaGarage QualGarage CondPaved DriveWood Deck SFOpen Porch SFEnclosed Porch3Ssn PorchScreen PorchPool AreaPool QCFenceMisc FeatureMisc ValMo SoldYr SoldSale TypeSale ConditionSalePrice
0152630110020RL141.031770PaveNaNIR1LvlAllPubCornerGtlNAmesNormNorm1Fam1Story6519601960HipCompShgBrkFacePlywoodStone112.0TATACBlockTAGdGdBLQ639.0Unf0.0441.01080.0GasAFaYSBrkr16560016561.00.01031TA7Typ2GdAttchd1960.0Fin2.0528.0TATAP210620000NaNNaNNaN052010WDNormal215000
1252635004020RH80.011622PaveNaNRegLvlAllPubInsideGtlNAmesFeedrNorm1Fam1Story5619611961GableCompShgVinylSdVinylSdNone0.0TATACBlockTATANoRec468.0LwQ144.0270.0882.0GasATAYSBrkr896008960.00.01021TA5Typ0NaNAttchd1961.0Unf1.0730.0TATAY1400001200NaNMnPrvNaN062010WDNormal105000
2352635101020RL81.014267PaveNaNIR1LvlAllPubCornerGtlNAmesNormNorm1Fam1Story6619581958HipCompShgWd SdngWd SdngBrkFace108.0TATACBlockTATANoALQ923.0Unf0.0406.01329.0GasATAYSBrkr13290013290.00.01131Gd6Typ0NaNAttchd1958.0Unf1.0312.0TATAY393360000NaNNaNGar21250062010WDNormal172000
3452635303020RL93.011160PaveNaNRegLvlAllPubCornerGtlNAmesNormNorm1Fam1Story7519681968HipCompShgBrkFaceBrkFaceNone0.0GdTACBlockTATANoALQ1065.0Unf0.01045.02110.0GasAExYSBrkr21100021101.00.02131Ex8Typ2TAAttchd1968.0Fin2.0522.0TATAY000000NaNNaNNaN042010WDNormal244000
4552710501060RL74.013830PaveNaNIR1LvlAllPubInsideGtlGilbertNormNorm1Fam2Story5519971998GableCompShgVinylSdVinylSdNone0.0TATAPConcGdTANoGLQ791.0Unf0.0137.0928.0GasAGdYSBrkr928701016290.00.02131TA6Typ1TAAttchd1997.0Fin2.0482.0TATAY212340000NaNMnPrvNaN032010WDNormal189900

Data Cleaning and Features Engineering


This dataset has a total of 82 columns and 2930 rows. Since we’ll be using the linear regression model, we can only use numerical values in our model. One of the most important aspects of machine learning is knowing the features. Here are a couple things we can do to clean up the data:

  • The ‘Order’ and ‘PID’ columns are not useful for machine learning as they are simply identification numbers.

  • It doesn’t make much sense to use ‘Year built’ and ‘Year Remod/Add’ in our model. We should generate a new column to determine how old the house is since the last remodelling.

  • We want to drop columns with too many missing values, let’s start with 5% for now.

  • We don’t want to leak sales information to our model. Sales information will not be available to us when we actually use the model to estimate the price of a house.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#Create a new feature, 'years_to_sell'.
data['years_to_sell'] = data['Yr Sold'] - data['Year Remod/Add'] 
data = data[data['years_to_sell'] >= 0]

#Remove features that are not useful for machine learning.
data = data.drop(['Order', 'PID'], axis=1)

#Remove features that leak sales data.
data = data.drop(['Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition'], axis=1)

#Drop columns with more than 5% missing values
is_null_counts = data.isnull().sum()
features_col = is_null_counts[is_null_counts < 2930*0.05].index


data = data[features_col]
data.head()
MS SubClassMS ZoningLot AreaStreetLot ShapeLand ContourUtilitiesLot ConfigLand SlopeNeighborhoodCondition 1Condition 2Bldg TypeHouse StyleOverall QualOverall CondYear BuiltYear Remod/AddRoof StyleRoof MatlExterior 1stExterior 2ndMas Vnr TypeMas Vnr AreaExter QualExter CondFoundationBsmt QualBsmt CondBsmt ExposureBsmtFin Type 1BsmtFin SF 1BsmtFin Type 2BsmtFin SF 2Bsmt Unf SFTotal Bsmt SFHeatingHeating QCCentral AirElectrical1st Flr SF2nd Flr SFLow Qual Fin SFGr Liv AreaBsmt Full BathBsmt Half BathFull BathHalf BathBedroom AbvGrKitchen AbvGrKitchen QualTotRms AbvGrdFunctionalFireplacesGarage CarsGarage AreaPaved DriveWood Deck SFOpen Porch SFEnclosed Porch3Ssn PorchScreen PorchPool AreaMisc ValSalePriceyears_to_sell
020RL31770PaveIR1LvlAllPubCornerGtlNAmesNormNorm1Fam1Story6519601960HipCompShgBrkFacePlywoodStone112.0TATACBlockTAGdGdBLQ639.0Unf0.0441.01080.0GasAFaYSBrkr16560016561.00.01031TA7Typ22.0528.0P210620000021500050
120RH11622PaveRegLvlAllPubInsideGtlNAmesFeedrNorm1Fam1Story5619611961GableCompShgVinylSdVinylSdNone0.0TATACBlockTATANoRec468.0LwQ144.0270.0882.0GasATAYSBrkr896008960.00.01021TA5Typ01.0730.0Y1400001200010500049
220RL14267PaveIR1LvlAllPubCornerGtlNAmesNormNorm1Fam1Story6619581958HipCompShgWd SdngWd SdngBrkFace108.0TATACBlockTATANoALQ923.0Unf0.0406.01329.0GasATAYSBrkr13290013290.00.01131Gd6Typ01.0312.0Y3933600001250017200052
320RL11160PaveRegLvlAllPubCornerGtlNAmesNormNorm1Fam1Story7519681968HipCompShgBrkFaceBrkFaceNone0.0GdTACBlockTATANoALQ1065.0Unf0.01045.02110.0GasAExYSBrkr21100021101.00.02131Ex8Typ22.0522.0Y000000024400042
460RL13830PaveIR1LvlAllPubInsideGtlGilbertNormNorm1Fam2Story5519971998GableCompShgVinylSdVinylSdNone0.0TATAPConcGdTANoGLQ791.0Unf0.0137.0928.0GasAGdYSBrkr928701016290.00.02131TA6Typ12.0482.0Y212340000018990012

Since we are dealing with a dataset with a large number a columns, it is a good idea to split the data up into two dataframes. We’ll first work with the ‘float’ and ‘int’ columns. Then we’ll set ‘object’ columns to a new dataframe. Once both dataframes contain only numerical values, we can combine them again and use the features for our linear regression model.

There are qutie a bit of NA values in the numerical columns, so we’ll fill them up with the mode. Some of the columns are categorical, so it wouldn’t make sense to use median or mean for this.

1
2
3
4
5
numerical_cols = data.dtypes[data.dtypes != 'object'].index
numerical_data = data[numerical_cols]

numerical_data = numerical_data.fillna(data.mode().iloc[0])
numerical_data.isnull().sum().sort_values(ascending = False)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
years_to_sell      0
BsmtFin SF 2       0
Gr Liv Area        0
Low Qual Fin SF    0
2nd Flr SF         0
1st Flr SF         0
Total Bsmt SF      0
Bsmt Unf SF        0
BsmtFin SF 1       0
SalePrice          0
Mas Vnr Area       0
Year Remod/Add     0
Year Built         0
Overall Cond       0
Overall Qual       0
Lot Area           0
Bsmt Full Bath     0
Bsmt Half Bath     0
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
Kitchen AbvGr      0
TotRms AbvGrd      0
Fireplaces         0
Garage Cars        0
Garage Area        0
Wood Deck SF       0
Open Porch SF      0
Enclosed Porch     0
3Ssn Porch         0
Screen Porch       0
Pool Area          0
Misc Val           0
MS SubClass        0
dtype: int64

Next, let’s check the correlations of all the numerical columns with respect to ‘SalePrice’

1
2
num_corr = numerical_data.corr()['SalePrice'].abs().sort_values(ascending = False)
num_corr
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
SalePrice          1.000000
Overall Qual       0.801206
Gr Liv Area        0.717596
Garage Cars        0.648361
Total Bsmt SF      0.644012
Garage Area        0.641425
1st Flr SF         0.635185
Year Built         0.558490
Full Bath          0.546118
years_to_sell      0.534985
Year Remod/Add     0.533007
Mas Vnr Area       0.506983
TotRms AbvGrd      0.498574
Fireplaces         0.474831
BsmtFin SF 1       0.439284
Wood Deck SF       0.328183
Open Porch SF      0.316262
Half Bath          0.284871
Bsmt Full Bath     0.276258
2nd Flr SF         0.269601
Lot Area           0.267520
Bsmt Unf SF        0.182751
Bedroom AbvGr      0.143916
Enclosed Porch     0.128685
Kitchen AbvGr      0.119760
Screen Porch       0.112280
Overall Cond       0.101540
MS SubClass        0.085128
Pool Area          0.068438
Low Qual Fin SF    0.037629
Bsmt Half Bath     0.035875
3Ssn Porch         0.032268
Misc Val           0.019273
BsmtFin SF 2       0.006127
Name: SalePrice, dtype: float64

We can drop values with less than 0.4 correlation for now. Later, we’ll make this value an adjustable parameter in a function.

1
2
3
4
num_corr = num_corr[num_corr > 0.4]
high_corr_cols = num_corr.index

hi_corr_numerical_data = numerical_data[high_corr_cols]

For the ‘object’ or text columns, we’ll drop any column with more than 1 missing value.

1
2
3
4
5
6
7
text_cols = data.dtypes[data.dtypes == 'object'].index
text_data = data[text_cols]

text_null_counts = text_data.isnull().sum()
text_not_null_cols = text_null_counts[text_null_counts < 1].index

text_data = text_data[text_not_null_cols]

From the documatation we want to convert any columns that are nominal into categories. ‘MS subclass’ is a numerical column but it should be categorical.

For the text columns, we’ll take the list of nominal columns from the documentation and use a for loop to search for matches.

1
2
nominal_cols = ['MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Roof Style', 'Roof Mat1', 'Exterior 1st',  'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air'] 
nominal_num_col = ['MS SubClass']
1
2
3
4
5
6
#Finds nominal columns in text_data
nominal_text_col = []
for col in nominal_cols:
    if col in text_data.columns:
        nominal_text_col.append(col)
nominal_text_col
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
['MS Zoning',
 'Street',
 'Land Contour',
 'Lot Config',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Exterior 1st',
 'Exterior 2nd',
 'Foundation',
 'Heating',
 'Central Air']

Simply use boolean filtering to keep the relevant columns in our text dataframe.

1
text_data = text_data[nominal_text_col]
1
2
3
4
for col in nominal_text_col:
    print(col)
    print(text_data[col].value_counts())
    print("-"*10)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
MS Zoning
RL         2270
RM          462
FV          139
RH           27
C (all)      25
A (agr)       2
I (all)       2
Name: MS Zoning, dtype: int64
----------
Street
Pave    2915
Grvl      12
Name: Street, dtype: int64
----------
Land Contour
Lvl    2632
HLS     120
Bnk     115
Low      60
Name: Land Contour, dtype: int64
----------
Lot Config
Inside     2138
Corner      510
CulDSac     180
FR2          85
FR3          14
Name: Lot Config, dtype: int64
----------
Neighborhood
NAmes      443
CollgCr    267
OldTown    239
Edwards    192
Somerst    182
NridgHt    165
Gilbert    165
Sawyer     151
NWAmes     131
SawyerW    125
Mitchel    114
BrkSide    108
Crawfor    103
IDOTRR      93
Timber      72
NoRidge     71
StoneBr     51
SWISU       48
ClearCr     44
MeadowV     37
BrDale      30
Blmngtn     28
Veenker     24
NPkVill     23
Blueste     10
Greens       8
GrnHill      2
Landmrk      1
Name: Neighborhood, dtype: int64
----------
Condition 1
Norm      2520
Feedr      164
Artery      92
RRAn        50
PosN        38
RRAe        28
PosA        20
RRNn         9
RRNe         6
Name: Condition 1, dtype: int64
----------
Condition 2
Norm      2898
Feedr       13
Artery       5
PosA         4
PosN         3
RRNn         2
RRAn         1
RRAe         1
Name: Condition 2, dtype: int64
----------
Bldg Type
1Fam      2422
TwnhsE     233
Duplex     109
Twnhs      101
2fmCon      62
Name: Bldg Type, dtype: int64
----------
House Style
1Story    1480
2Story     871
1.5Fin     314
SLvl       128
SFoyer      83
2.5Unf      24
1.5Unf      19
2.5Fin       8
Name: House Style, dtype: int64
----------
Roof Style
Gable      2320
Hip         549
Gambrel      22
Flat         20
Mansard      11
Shed          5
Name: Roof Style, dtype: int64
----------
Exterior 1st
VinylSd    1025
MetalSd     450
HdBoard     442
Wd Sdng     420
Plywood     221
CemntBd     124
BrkFace      88
WdShing      56
AsbShng      44
Stucco       43
BrkComm       6
Stone         2
CBlock        2
AsphShn       2
ImStucc       1
PreCast       1
Name: Exterior 1st, dtype: int64
----------
Exterior 2nd
VinylSd    1014
MetalSd     447
HdBoard     406
Wd Sdng     397
Plywood     274
CmentBd     124
Wd Shng      81
Stucco       47
BrkFace      47
AsbShng      38
Brk Cmn      22
ImStucc      15
Stone         6
AsphShn       4
CBlock        3
PreCast       1
Other         1
Name: Exterior 2nd, dtype: int64
----------
Foundation
PConc     1307
CBlock    1244
BrkTil     311
Slab        49
Stone       11
Wood         5
Name: Foundation, dtype: int64
----------
Heating
GasA     2882
GasW       27
Grav        9
Wall        6
OthW        2
Floor       1
Name: Heating, dtype: int64
----------
Central Air
Y    2731
N     196
Name: Central Air, dtype: int64
----------

Columns with too many categories can cause overfitting. We’ll remove any columns with more than 10 categories. We’ll write a function later to adjust this as a parameter in our feature selection.

1
2
3
4
5
6
nominal_text_col_unique = []
for col in nominal_text_col:
    if len(text_data[col].value_counts()) <= 10:
        nominal_text_col_unique.append(col)
               
text_data = text_data[nominal_text_col_unique]

Finally, we can use the pd.get_dummies function to create dummy columns for all the categorical columns.

1
2
3
4
5
#Create dummy columns for nominal text columns, then create a dataframe.
for col in text_data.columns:
    text_data[col] = text_data[col].astype('category')   
categorical_text_data = pd.get_dummies(text_data)    
categorical_text_data.head()
MS Zoning_A (agr)MS Zoning_C (all)MS Zoning_FVMS Zoning_I (all)MS Zoning_RHMS Zoning_RLMS Zoning_RMStreet_GrvlStreet_PaveLand Contour_BnkLand Contour_HLSLand Contour_LowLand Contour_LvlLot Config_CornerLot Config_CulDSacLot Config_FR2Lot Config_FR3Lot Config_InsideCondition 1_ArteryCondition 1_FeedrCondition 1_NormCondition 1_PosACondition 1_PosNCondition 1_RRAeCondition 1_RRAnCondition 1_RRNeCondition 1_RRNnCondition 2_ArteryCondition 2_FeedrCondition 2_NormCondition 2_PosACondition 2_PosNCondition 2_RRAeCondition 2_RRAnCondition 2_RRNnBldg Type_1FamBldg Type_2fmConBldg Type_DuplexBldg Type_TwnhsBldg Type_TwnhsEHouse Style_1.5FinHouse Style_1.5UnfHouse Style_1StoryHouse Style_2.5FinHouse Style_2.5UnfHouse Style_2StoryHouse Style_SFoyerHouse Style_SLvlRoof Style_FlatRoof Style_GableRoof Style_GambrelRoof Style_HipRoof Style_MansardRoof Style_ShedFoundation_BrkTilFoundation_CBlockFoundation_PConcFoundation_SlabFoundation_StoneFoundation_WoodHeating_FloorHeating_GasAHeating_GasWHeating_GravHeating_OthWHeating_WallCentral Air_NCentral Air_Y
000000100100011000000100000000100000100000010000000010001000001000001
100001000100010000101000000000100000100000010000001000001000001000001
200000100100011000000100000000100000100000010000000010001000001000001
300000100100011000000100000000100000100000010000000010001000001000001
400000100100010000100100000000100000100000000010001000000100001000001
1
2
3
4
5
6
#Create dummy columns for nominal numerical columns, then create a dataframe.
for col in numerical_data.columns:
    if col in nominal_num_col:
        numerical_data[col] = numerical_data[col].astype('category')  
              
categorical_numerical_data = pd.get_dummies(numerical_data.select_dtypes(include=['category'])) 

Using the pd.concat() function, we can combine the two categorical columns together.

1
categorical_data = pd.concat([categorical_text_data, categorical_numerical_data], axis=1)

We end up with one numerical dataframe, and one categorical dataframe. We can then combine them into one dataframe for machine learning.

1
hi_corr_numerical_data.head()
SalePriceOverall QualGr Liv AreaGarage CarsTotal Bsmt SFGarage Area1st Flr SFYear BuiltFull Bathyears_to_sellYear Remod/AddMas Vnr AreaTotRms AbvGrdFireplacesBsmtFin SF 1
0215000616562.01080.0528.0165619601501960112.072639.0
110500058961.0882.0730.0896196114919610.050468.0
2172000613291.01329.0312.0132919581521958108.060923.0
3244000721102.02110.0522.02110196824219680.0821065.0
4189900516292.0928.0482.0928199721219980.061791.0
1
categorical_data.head()
MS Zoning_A (agr)MS Zoning_C (all)MS Zoning_FVMS Zoning_I (all)MS Zoning_RHMS Zoning_RLMS Zoning_RMStreet_GrvlStreet_PaveLand Contour_BnkLand Contour_HLSLand Contour_LowLand Contour_LvlLot Config_CornerLot Config_CulDSacLot Config_FR2Lot Config_FR3Lot Config_InsideCondition 1_ArteryCondition 1_FeedrCondition 1_NormCondition 1_PosACondition 1_PosNCondition 1_RRAeCondition 1_RRAnCondition 1_RRNeCondition 1_RRNnCondition 2_ArteryCondition 2_FeedrCondition 2_NormCondition 2_PosACondition 2_PosNCondition 2_RRAeCondition 2_RRAnCondition 2_RRNnBldg Type_1FamBldg Type_2fmConBldg Type_DuplexBldg Type_TwnhsBldg Type_TwnhsEHouse Style_1.5FinHouse Style_1.5UnfHouse Style_1StoryHouse Style_2.5FinHouse Style_2.5UnfHouse Style_2StoryHouse Style_SFoyerHouse Style_SLvlRoof Style_FlatRoof Style_GableRoof Style_GambrelRoof Style_HipRoof Style_MansardRoof Style_ShedFoundation_BrkTilFoundation_CBlockFoundation_PConcFoundation_SlabFoundation_StoneFoundation_WoodHeating_FloorHeating_GasAHeating_GasWHeating_GravHeating_OthWHeating_WallCentral Air_NCentral Air_YMS SubClass_20MS SubClass_30MS SubClass_40MS SubClass_45MS SubClass_50MS SubClass_60MS SubClass_70MS SubClass_75MS SubClass_80MS SubClass_85MS SubClass_90MS SubClass_120MS SubClass_150MS SubClass_160MS SubClass_180MS SubClass_190
0000001001000110000001000000001000001000000100000000100010000010000011000000000000000
1000010001000100001010000000001000001000000100000010000010000010000011000000000000000
2000001001000110000001000000001000001000000100000000100010000010000011000000000000000
3000001001000110000001000000001000001000000100000000100010000010000011000000000000000
4000001001000100001001000000001000001000000000100010000001000010000010000010000000000
1
final_data = pd.concat([hi_corr_numerical_data, categorical_data], axis=1)

Creating Functions with Adjustable Parameters


When we did our data cleaning, we decided to remove columns that had more than 5% missing values. We can incorporate our this into a function as an adjustable parameter. In addition, this function will perform all the data cleaning operations I’ve explained above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def transform_features(data, percent_missing=0.05):
    
    #Adding relevant features:
    data['years_since_remod'] = data['Year Built'] - data['Year Remod/Add']
    data['years_to_sell'] = data['Yr Sold'] - data['Year Built']
    data = data[data['years_since_remod'] >= 0]
    data = data[data['years_to_sell'] >= 0]
    
    #Remove columns not useful for machine learning
    data = data.drop(['Order', 'PID', 'Year Built', 'Year Remod/Add'], axis=1)
    
    #Remove columns that leaks sale data
    data = data.drop(['Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition'], axis=1)
    
    #Drop columns with too many missing values defined by the function
    is_null_counts = data.isnull().sum()
    low_NaN_cols = is_null_counts[is_null_counts < len(data)*percent_missing].index
    
    transformed_data = data[low_NaN_cols]    
    return transformed_data

For the feature engineering and selection step, we chose columns that had more than 0.4 correlation with ‘SalePrice’ and removed any columns with more than 10 categories.

Once again, I’ve combined all the work we’ve done previously into a function with adjustable parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def select_features(data, corr_threshold=0.4, unique_threshold=10):  
    
    #Fill missing numerical columns with the mode.
    numerical_cols = data.dtypes[data.dtypes != 'object'].index
    numerical_data = data[numerical_cols]

    numerical_data = numerical_data.fillna(data.mode().iloc[0])
    numerical_data.isnull().sum().sort_values(ascending = False)

    #Drop text columns with more than 1 missing value.
    text_cols = data.dtypes[data.dtypes == 'object'].index
    text_data = data[text_cols]

    text_null_counts = text_data.isnull().sum()
    text_not_null_cols = text_null_counts[text_null_counts < 1].index

    text_data = text_data[text_not_null_cols]

    num_corr = numerical_data.corr()['SalePrice'].abs().sort_values(ascending = False)

    num_corr = num_corr[num_corr > corr_threshold]
    high_corr_cols = num_corr.index

    #Apply the correlation threshold parameter
    hi_corr_numerical_data = numerical_data[high_corr_cols]
    

    #Nominal columns from the documentation
    nominal_cols = ['MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Roof Style', 'Roof Mat1', 'Exterior 1st',  'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air'] 
    nominal_num_col = ['MS SubClass']

    #Finds nominal columns in text_data
    nominal_text_col = []
    for col in nominal_cols:
        if col in text_data.columns:
            nominal_text_col.append(col)
    nominal_text_col

    text_data = text_data[nominal_text_col]

    nominal_text_col_unique = []
    for col in nominal_text_col:
        if len(text_data[col].value_counts()) <= unique_threshold:
            nominal_text_col_unique.append(col)
        
        
    text_data = text_data[nominal_text_col_unique]
    text_data.head()

    #Set all these columns to categorical
    for col in text_data.columns:
        text_data[col] = text_data[col].astype('category')   
    categorical_text_data = pd.get_dummies(text_data)    

    #Change any nominal numerical columns to categorical, then returns a dataframe
    for col in numerical_data.columns:
        if col in nominal_num_col:
            numerical_data[col] = numerical_data[col].astype('category')  
           
    
    categorical_numerical_data = pd.get_dummies(numerical_data.select_dtypes(include=['category'])) 
    final_data = pd.concat([hi_corr_numerical_data, categorical_text_data, categorical_numerical_data], axis=1)

    return final_data

Applying Machine Learning


Now we are ready to apply machine learning, we’ll use the linear regression model from scikit-learn. Linear regression should work well here since our target column ‘SalePrice’ is a continuous value. We’ll evaluate this model with RMSE as an error metric.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def train_and_test(data):

    train = data[0:1460]
    test = data[1460:]
    features = data.columns.drop(['SalePrice'])
    
    #train
    lr = LinearRegression()
    lr.fit(train[features], train['SalePrice'])
    #predict
    
    predictions = lr.predict(test[features])
    rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
    return rmse
1
2
3
4
5
6
data = pd.read_csv("AmesHousing.tsv", delimiter='\t')

transformed_data = transform_features(data, percent_missing=0.05)
final_data = select_features(transformed_data, 0.4, 10)
result = train_and_test(final_data)
result
1
28749.561761556044

We’ve selected the first 1460 rows as the training set, and the remaining data as the testing set. This is not really a good way to evaluate a model’s performance because the error will change as soon as we shuffle the data.

We can use KFold cross validation to split the data in K number of folds. Using the KFold function from scikit learn, we can get the indices for the testing and training sets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from sklearn.model_selection import KFold

def train_and_test2(data, k=2):  
    rf = LinearRegression()
    if k == 0:
        train = data[0:1460]
        test = data[1460:]
        features = data.columns.drop(['SalePrice'])
    
        #train
        rf.fit(train[features], train['SalePrice'])
        
        #predict    
        predictions = rf.predict(test[features])
        rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
        return rmse
    
    elif k == 1:
        train = data[:1460]
        test = data[1460:]
        features = data.columns.drop(['SalePrice'])
        
        rf.fit(train[features], train["SalePrice"])
        predictions_one = rf.predict(test[features])        
        
        mse_one = mean_squared_error(test["SalePrice"], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        rf.fit(test[features], test["SalePrice"])
        predictions_two = rf.predict(train[features])        
       
        mse_two = mean_squared_error(train["SalePrice"], predictions_two)
        rmse_two = np.sqrt(mse_two)
        return np.mean([rmse_one, rmse_two])   
    
    else:
        kf = KFold(n_splits=k, shuffle=True, random_state = 2)
        rmse_list = []
        for train_index, test_index in kf.split(data):
            train = data.iloc[train_index]
            test = data.iloc[test_index]
            features = data.columns.drop(['SalePrice'])
    
            #train
            rf.fit(train[features], train['SalePrice'])
        
            #predict    
            predictions = rf.predict(test[features])
        
            rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
            rmse_list.append(rmse)
        return np.mean(rmse_list)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data = pd.read_csv("AmesHousing.tsv", delimiter='\t')

transformed_data = transform_features(data, percent_missing=0.05)
final_data = select_features(transformed_data, 0.4, 10)

results = []
for i in range(100):
    result = train_and_test2(final_data, k=i)
    results.append(result)
    
x = [i for i in range(100)]
y = results 
plt.plot(x, y)
plt.xlabel('Kfolds')
plt.ylabel('RMSE')

print(results[99])
1
29830.6836474

png

Our error is actually the lowest, when k = 0. This is acutally not very useful because it means the model is only useful for the indices we’ve picked out. Without validation there is no way to be sure that the model works well for any set of data.

This is when cross validation is useful for evaluating model performance. We can see the average RMSE goes down as we increase the number of folds. This makes sense as the RMSE shown on the graph above is an average of the cross validation tests. A larger K means we have less bias towards overestimating the model’s true error. As a trade off, this requires a lot more computation time.


Learning Summary

Concepts explored: pandas, data cleaning, features engineering, linear regression, hyperparameter tuning, RMSE, KFold validation

Functions and methods used: .dtypes, .value_counts(), .drop, .isnull(), sum(), .fillna(), .sort_values(), . corr(), .index, .append(), .get_dummies(), .astype(), predict(), .fit(), KFold(), mean_squared_error()

The files used for this project can be found in my GitHub repository.

This post is licensed under CC BY 4.0 by the author.