I am not in any ways a betting man, but would have been a very hard bet to believe if anyone had mentioned 3 years back that we will be in masks, social distancing and working from home. The changes that pandemic brought by had their own consequences — work from home enabled employees to work from anywhere in the country, this drove high tech and high paid employees from bigger city to mid sized cities and towns for quieter/healthier lifestyle. This move from mainstream cities to mid sized cities along with low interest loans hiked the house prices all over.
In this project we will look at how pandemic impacted the house prices in one of the hottest markets in the US — Austin, Texas by using Exploratory Data Analysis and then we will use the cleaned and transformed data to fit and check which model performs best for house price prediction .
Data Cleaning and Transformations:
Since the dataset imported from Kaggle has no null values — the data cleaning part is shortened and revolves mainly around data filtering, removing redundant variables and removing high correlation variables.
Data Filtering :
For this step to understand the spread and trends of data we take distribution plots and count plots of variables.
Distribution plot of latestPrice by density:
From this above chart to keep the spread of target variable normal — we filter all houses by latestPrice ≤ $1 Million.
Distribution plot of living Area (sqft) by density:
From this above chart to keep the spread of variable normal — we filter all houses by livingAreaSqFt ≤ 4000 SqFt.
Count plots of multiple variables:
From the above charts to remove values with low counts in these variables we apply the filters below:
- Remove all the rows with numOfBedrooms > 6
- Remove all the rows with numOfBathrooms > 7
- Remove all the rows with parkingSpaces > 6
- Remove all the rows with garageSpaces >6
Distribution plot of YearBuilt by density:
Filter all houses where yearbuilt > 1920 to keep the age below 100 years.
Removing redundant variables:
We remove redundant variables with text data as we are not doing NLP analysis — we take out description, street address, home image url. We also take out redundant variables like zipid, num of photos and we take out latest sale date because we have latest sale year and latest sale month.
Adding new variable Age:
Removing columns with high correlation:
We need to remove columns which have high (+/-) correlation with each other since having both columns will not add any value to model performance. The correlation is generated using corr plot using sns.heatmap.
From the above corrplot we can observe that:
- hasGarage has high (+) correlation with garage space.
- hasAssociation has high (-) correlation with Age.
- yearBuilt has high (-) correlation with Age.
To reduce dimensionality we remove columns hasGarage, hasAssociation and yearBuilt.
Correlation plot with Price as focus:
Exploratory Data Analysis:
Exploratory analysis is a prime step to understand the data spread, outliers, dependency between variables, spot anomalies and discover patterns.
Box plots of Price by multiple variables:
Line plots of Price by Age and Average school rate:
From the above charts we can observe that latestPrice decreases with increase in Age and then goes up again for older houses because of vintage house markup. latestPrice goes on an upward trend with avgSchoolRating.
Median latestPrice by latest_saleyear:
The trend we see here is accordance with covid trends as we see ~21% hike in prices from 2019 to 2021. Below is the article from Forbes which gives trends for multiple metro areas — Median-home-price-increase
Geo Spatial Data:
We can plot a heat map by using latitude and longitude by using latestPrice as hue. The high price areas are darker, and are surrounded by water if we refer Austin map.
Machine Learning to predict price:
Models do not do well with string datatype. We use Labelencoder() from scikit-learn library in python to factorize categorical columns before fitting machine learning models.
The data must be split into training and test data set for machine learning; training dataset is the subset of data which is used to train the model with labelled data and testing dataset is the subset which is used to test the model performance against new data that did not go into training. For the analysis we have made train(85%):test(15% ) split.
Machine learning models:
Multivariate regression is a machine learning model in which we try to find a line that best fits and captures most variance in the data.
Random Forest Regressor:
Random forest is a machine learning algorithm in which we find the fit by bootstrap aggregation- it is an ensemble learning method in which multiple decision trees are created during training and output is given my taking mean of predictions for decision trees.
XG Boost Regressor (Best Model):
Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. Extreme gradient boost is the concept under XG Boost algorithm.
We use GridSearchCV to select the best hyper parameters to achieve best estimator model.
Neural Net Regressor:
Neural net algorithm is designed mimic the biological neural system by using statistics. Every neuron is connected to every neuron in the next layer creating a complex multi layer network which mimics the human neural system. The hidden layer becomes a blackbox when the layers are more than 2 it is then classified as deep neural network.
For neural net to be efficient we need to scale the data using MinMaxScalar() or Standard Scalar.
Below is the Sequential neural nets we designed with multiple hidden layers with adam as optimizer and mean squared error as it is a regression problem
Below is loss function plot for the neural net:
We tabulate Root mean squared error, r2 score and explained variance for all 4 models we used to predict house prices.
From the results above we see that all models performed well except Multivariate Regression. From the bias variance tradeoff we select XG boost regressor as it has lowest RMSE and highest r2 score and explained variance.
Below is the feature importance graph for XG Boost Regressor:
From the above graph we conclude that location variables (latitude, longitude), lot area size, living area size and age are the top 5 predictors which have the highest impact on price.