Used Car price prediction & analysis

Omkar Gawade
6 min readJan 24, 2022

Driving has always been one of my favorite guilty pleasures — whenever i feel overwhelmed by multitudes of things going on in my life, a short drive accompanied by music helps me clear my mind. Used car market in the US has always been an attractive option for first time car buyers — Bang for your buck, lower price depreciation, lower insurance prices & lower dealership price who wouldn’t want that? Although all those above reasons are valid, recently the US has seen an exponential increase in used car prices — the analysis below will help us uncover car price trends, factors affecting and best machine learning algorithm to predict price.

Dataset:

The dataset for the analysis was exported from Kaggle Used Car dataset — The data contains all vehicle listings scraped from Craigslist.com, below are the variables in dataset with description :

Data Cleaning:

Removing columns with most null values: Null values reduce model performance and skew the data. The first step of data cleaning is to remove columns with rows having greater than 25% of null values. Columns ‘cylinders’, ’VIN’, ’size’, ’paint_color’, ‘county’ were removed.

Removing redundant variables: Variables which contain unimportant data such as links, description reduce model performance. Columns ‘id’, ’url’, ’region_url’,’image_url’,’description’,’posting_date’ were removed.

Removing rows with Null values: Removing all rows with null values from select columns to have a consistent dataset.

Data filters to reduce high variance: To prevent data with high variance we apply filters to columns and reduce the number of rows.

  • Filter data by top 10 manufacturers with most posting:
  • Filters for price, odometer and year: For the analysis we are looking at economical sector i.e Price < 100k and Price >5k. For odometer we wanted to take out all the cars >200k miles as buying used car is not logical after such high mileage. Additionally, we wanted to take out all cars with odometer <1000 miles since they are practically new cars. For the year variable we wanted to filter cars >2008 as statistically speaking cars start getting high maintenance costs after that 13 years of age.
  • Filtering by unique model count: To reduce complexity by multiple factors we filtered manufacturers with unique model counts > 50.

Removing columns with high correlation: We need to remove columns which have high (+/-) correlation with each other since having both columns will not add any value to model performance. The correlation is generated using corr plot using sns.heatmap(cor,cmap=’mako’,annot=True). From the below corrplot we can observe that state has high correlation with latitude and longitude — to reduce dimensionality we remove columns lat and long.

Corr plot

Data Transformation:

We replace year column with age as it makes more sense to have age as a predictor to predict price.

Age = current_year() — year

Exploratory Data Analysis:

Exploratory analysis is a prime step to understand the data spread, outliers, dependency between variables, spot anomalies and discover patterns.

Univariate Data Analysis:

Univariate analysis explain the data spread and trends for a variable by itself.

Multivariate Data Analysis:

Multivariate analysis is when more than one dependent variable is analyzed simultaneously — helps us understand how target variable changes with dependent variables.

Median price by year for age=3 yrs:

The above chart shows how median car prices have changed for cars of age 3 years over the 2019,2020,2021 and 2022. Vehicles purchased in year 2021 have 20% higher price than year 2020 due semiconductor chip shortage.

Median price depreciation by manufacturer:

Price by manufacturer and drive type:

Price by manufacturer and drive

Price by manufacturer and state:

Price by manufacturer and state

Machine Learning to predict price:

Model preprocessing:

Models do not do well with string datatype. We use Labelencoder() from scikit-learn library in python to factorize categorical columns before fitting machine learning models.

Train-Test split:

The data must be split into training and test data set for machine learning; training dataset is the subset of data which is used to train the model with labelled data and testing dataset is the subset which is used to test the model performance against new data that did not go into training. For the analysis we have made train(70%):test(30% ) split.

Machine learning models:

Linear Regression: Linear regression is a machine learning model in which we try to find a line that best fits and captures most variance in the data.

Random Forest Regressor: Random forest is a machine learning algorithm in which we find the fit by bootstrap aggregation- it is an ensemble learning method in which multiple decision trees are created during training and output is given my taking mean of predictions for decision trees.

Support Vector Machine Regressor: Support Vector machine learning although famous for classification problems can also be used for regression. SVR aims at reducing error by determining the hyperplane which minimizes the difference between predicted and observed values.

XG Boost Regressor (Best model): Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. Extreme gradient boost is the concept under XG Boost algorithm.

Model Performance for above models:

From above table we can conclude that XG Boost Regressor is the model with the best performance — it has the lowest Root Mean Squared Error (RMSE) and highest R2_score which explains how good is the fit of the line.

Feature Importance plot for XG Boost:

--

--

Omkar Gawade

Sr. Data Scientist @ Sourceday | Data professional with 5 plus years experience in analytics, visualizations and modeling.