Diamond Price Analysis Using Machine Learning Models in Python

Haoyang Niu
5 min readMay 8, 2021

If you happen to find a diamond during a treasure hunt trip, do you know much is it worth? It is almost impossible for normal people to accurately estimate the price of a diamond, as it might vary from $1000 to $20,000. There are a lot of factors that can determine the price — the carat, the color, the cut quality, the length, etc. With the help of machine learning models, we can predict the price of a diamond with an accuracy rate of over 90%.

This is a classic dataset that contains the prices and other attributes of almost 54,000 diamonds. There are nine features — carat, cut, color, clarity, x(length), y(width), z(depth), depth percentage, and table. And one target variable — the price of a diamond. It’s a great dataset for beginners learning to work with data analysis and visualization. The following analysis will build three models — linear regression, decision tree, and XGBoosting.

Here are all the libraries need for this analysis.

First, we need to import the data from the CSV file and explore the dataset.

From the df.info() we can tell that there is a useless variable called “Unnamed:0” that we should delete later. And there are object variables such as cut and color that we might need to turn them into numerical variables later.

Here is an overview of what the dataset looks like.

We can also explore the variables by visualizing them using different plots. Here are some interesting plots that might give us some hints about what model to make later.

This is a scatter plot that shows a positive relationship between the carat and the price. It is obvious that the carat plays an important role in determining the price.

The histogram of clarity show that most diamonds have good clarity of SI2 to VS2.

The count plot of prices show that most of diamonds have prices below $5000, which means the diamonds are much rarer with higher prices.

Since the target variable price is a numeric output, we can’t use classification method. Instead, we will do regression models.

To start building models, we need to first turn all the features into dummy variables, and then split the data into training and testing data sets (30% testing).

The shape of the testing and training datasets should look like this.

Model 1: Linear Regression

The first and most simple model to build is the linear regression.

Here are how we make it and visualize the prediction result.

Here is the plot of how linear regression looks. The blue points are the real data.

To calculate the accuracy of the model and evaluate how it performs, we will use the Rs score and RMSE rather than accuracy_score function, since the accuracy_score is for classification problems.

The R2 score is about 0.92, which is already pretty good. However, the training and testing RMSE is too high, about 1125 and 1141. Thus, we should look for some other models with lower RMSE.

Model 2: Decision Tree Regression

The second model we use is the decision tree regression, by simply putting:

dt = DecisionTreeRegressor(random_state = 0)

dt.fit(X_train,y_train)

The R2 score is 0.9645, better than the first model. However, the RMSE of training set is 9.12 whereas the testing set is 760.012607, much larger than the training set. This means that the model might be overfitting.

The combined histogram shows that the prediction result and the testing data are highly overlapped, almost a perfect fit. This also indicates that the model is overfitting.

To adjust the model and avoid overfitting, we need to make a new decision tree with more conditions.

The new decision tree has a lower R2 score of 0.9417. Now the RMSE of both training and testing datasets are around 970, which means the model is not overfitting.

Now the combined histogram shows that the prediction result is not perfectly fitting with the test data.

Model 3: XGBoosting

The last model is the XGBoosting. To make this model, we need to create a data matrix and new train test split specifically for this model.

The R2 score is 0.97497, the best among the three. The train and testing RMSE is 600.426150 and 632.293968, better than the previous two models and not overfitting.

We can also visualize the boost tree, though the tree has too many branches and may not be so useful.

We can visualize the importance of the features in this model. It shows that the x, the carat, and y are the top three important features that determine the price.

In conclusion, XGBoost give us the best accuracy of the three models and also avoid the problem of overfitting. With the trained model and the nine features, we should be able to predict the price of any diamonds in the future!

--

--