Seattle Housing Analysis - an intro to Multiple-Linear Regression
Shopping for houses can be overwhelming, especially in cities like Seattle. There are tons of factors such as location data, square footage, number of bedrooms and bathrooms, building condition, and so on - it can be hard to know how important these factors are in the value of a home and how to recognize a good deal.
What if we had a method to feed in housing data, interpret how much each datapoint impacts the price of a house, and automatically estimate the value of every home on the market? Then we could compare that estimate to the list price and filter down to the homes that seem the most undervalued given the available information.
We can do this by building a predictive model in python. Predictive models come in many flavors and complexities, for this project we’ll keep it simple with a multiple-linear regression model. To keep this writeup humble in scope we’ll avoid some of the technical rabbit-holes and model validation and just run through the stages to build my final implementation.
Building a Multiple-Linear Regression Model
Data Preparation
For this exercise we’ll be using the King County housing dataset from 2014-2015 from Kaggle. In order for our model to run correctly, we’ll have to clean and translate our data a little. Once the data is ready, finding the importance of features is fairly simple. The main things we want to do for data cleaning are:
Import data into python
Remove or Fill Null Values
Remove duplicates
Drop unneeded columns
Transform categorical data
Transform numeric data to log numeric
Before any of these steps, we’ll import the python packages we’ll be using. Python code will be denoted by the following:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
1.) Importing data into python
For this exercise our data is located one step down in the ‘data’ folder. We read this with pandas (pd) and store it in code as the variable ‘data’
data = pd.read_csv('data/kc_house_data.csv')
2.) Remove or Fill Null Values
In this dataset there are a number of null values, which will break the multiple-linear regression. Null values are different from values of “0” and specifically signify the absence of data. There are a number of methods for dealing with null values such as filling them with the mean value in the column or dropping the rows entirely. Since most of the missing values in this dataset are categorical, we will create a new category, called “DUNNO”. We will end up dropping the other columns with nulls later, as through experimentation it was shown that they do not significantly improve the model.
data['waterfront'] = data['waterfront'].fillna(value='DUNNO')
data['view'] = data['view'].fillna(value='DUNNO')
3.) Remove Duplicates
We also want to avoid any duplicate columns - here we will look at the id for each row and drop any which are duplicates.
data = data.loc[data.duplicated('id') == False]
4.) Dropping unneeded columns
Sometimes it’s hard to know which columns to drop to get the best results, so it’s worth experimenting, building more inclusive models and narrowing down, or building simple models and adding complexity. In this case we drop columns in the next two steps while we selectively transform data.
5.) Transforming categorical data
Here we take categorical data and make a new column for every possible option. This is called making dummy columns, where each column has either a 1 or a 0 signifying true or false for each category. This is easier for a linear regression model to interpret. Each of the first five lines is performing this operation, and the last line is storing these into a dataframe called “categorical”.
view = pd.get_dummies(data['view'], prefix='view', drop_first=True)
waterfront = pd.get_dummies(data['waterfront'], prefix='waterfront', drop_first=True)
grade = pd.get_dummies(data['grade'], prefix='grade', drop_first=True)
condition = pd.get_dummies(data['condition'], prefix='condition', drop_first=True)
zipcode = pd.get_dummies(data['zipcode'].astype(str), drop_first=True)
categorical = pd.concat([view, waterfront, grade, condition, zipcode], axis=1
6.) Transforming numeric data to log numeric
Here we take the numeric data and transform it to log numeric - this transformation makes the distribution of data into a more normal distribution, which generally improves the performance of a linear regression model.
numeric = data[['price', 'sqft_living', 'yr_built', 'bedrooms', 'bathrooms', 'floors']]
log_numeric = np.log(numeric)
Last Steps
Since we just split our data into “categorical” and “log_numeric” datasets, we’ll need to recombine it as “processed” with the following code:
processed = pd.concat([log_numeric, categorical], axis=1)
Lastly we’ll separate the data we use for prediction and the target as X and y respectively. We will then split these into train and test sets. The test set will be used later to see how well our model predicts unseen information. The last line adds a constant to the data.
X = processed.drop(['price'], axis=1)
y = processed['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)
X_train_1 = sm.add_constant(X_train)
Now our data should be fully prepared to feed into our model!
Training the Model
Now that our data is cleaned, feeding the data into our model is done with just a couple lines of code with the statsmodel package:
model = sm.OLS(y_train, X_train_1)
results = model.fit()
And now it’s trained! To see the results we’ll run:
results.summary()
Interpreting the Results
The results we get should look something like this:
Top Portion:
Middle Portion:
Bottom Portion:
There’s a lot here that we could dig into, so we’ll just touch on some of the key fields starting from the top.
R-Squared - Represents how much of the target, here “price”, is predicted by the model. A “1” would be a perfect explanation and generally means you are overfitting.
F-Statistic - An F-Statistic will show if there is statistical significance in the model, similar to P-value. Larger F-statistics indicate more statistical significance.
Coef - Regression Coefficients represent the amount that each feature is impacting the results in the data. Since we are working with log data, a regression coefficient of 1 would indicate a 100% increase in the target (here the value of a home).
P>t - The P-value shows us how confident we are in the impact of a feature. A low p-value (usually less than .05) indicates the results are statistically significant, meaning there is a very low chance the data could be explained by randomization.
Cond. No. - The condition number generally indicates whether there is redundant information (multicollinearity) in the data. This redundant information can lead to complications or incorrect results where small changes to input data can lead to large changes in predicted value.
Visualizing the results
This is all well and good, but we haven’t actually applied the predictive model to help us do anything yet. One way to help us make decisions might be to look at the model’s predicted prices and the (actual) list prices for homes and compare the two. If the list price is lower than the estimate, it might be a good property to look at.
The following code makes a comparison between the list price and the model estimate:
### Here we add the model predictions to a dataset with the actual prices and compare the two.
data_f = data # New dataset
data_f['prediction_log'] = list(results.predict()) # Adding model prediction
data_f['actual_log'] = y # adding actual price
data_f['rank'] = data_f['prediction_log'] - data_f['actual_log'] # comparing the prediction with the actual value
# data_f[['rank', 'id']].sort_values(by='rank')
# Now we will make a dataframe of what the model believes are the 500 best values
best_500 = data_f[['rank', 'id']].sort_values(by='rank').tail(500) # grabbing the 500 'best values'
best_500 = pd.merge(best_500, data_f, how='left', on='id') # populating these with more information
best_500['Value'] = best_500['rank_y'] # cleaning up the rank columns
best_500 = best_500.drop(['rank_x', 'rank_y'], axis=1) # cleaning up the rank columns
Next we can visualize these results on a map using the open-source folium plugin:
### Making a folium map colored by house price
lat = 47.560093
long = -122.213982
fmap = folium.Map([lat, long], zoom_start=10, tiles='CartoDB positron')
# Color map
colors = cm.LinearColormap(['orange','yellow', 'yellow', 'lightgreen', 'lawngreen'],
index=[.45, .55, .65, .75, .85],
vmin=min(best_500['Value']), vmax=1)
# Adding Markers
sample = best_500
for p in range(0, len(sample)):
folium.Circle(
location=[sample.iloc[p]['lat'], sample.iloc[p]['long']],
fill=True,
fill_color = colors(sample.iloc[p]['Value']),
fill_opacity=0.9,
color=False,
radius=100,
legend_name='Model Predictions for Best Values',
popup=best_500.iloc[p][['Value', 'price', 'bedrooms', 'bathrooms', 'id']].to_string(index=True)
).add_to(fmap)
fmap.add_child(colors)
# Displaying the map
fmap
And that’s it! A quick rundown of how to build multiple-linear regression predictive model to compare actual and estimated values on a map. There’s a ton we didn’t get into about model validation (how do we know these predictions are accurate?) and optimization (how might we improve this model?) which tends to be the real artform of data-science, but those two questions are sort of endless. It’s often less intimidating to just go ahead and build something.
You don’t need a perfect understanding to start to see results and compare them with your intuitions about homes and prices, but going beyond those intuitions are where value can really be added in making decisions at this scale.