Product Recommender: Decision Tree and Random Forest
Anton A. Nesterov | an (at) vski.sh |
Version | 1.0 |
Product recommendation systems are an integral part of modern e-commerce and streaming services. Their goal is to predict what a user might like, driving engagement and sales. We can build a simple recommendation system using supervised learning models, starting with a basic Decision Tree for interpretability and then refining it with a more powerful Random Forest.
The Dataset: User Preferences
First, let's create a synthetic dataset that simulates user preferences for products based on their features: 'Product_Category', 'Price_Range', and 'Brand'. The target variable is User_Preference, which is either 'Liked' or 'Disliked'.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Markdown
import pandas as pd
import seaborn as sns
printdf = lambda df: display(Markdown(df.to_markdown()))
# Create a synthetic dataset to simulate user preferences
data = {
'Product_Category': ['Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
'Price_Range': ['Low', 'Medium', 'High', 'Medium', 'Low', 'Medium', 'High', 'Low', 'Medium', 'Low', 'High', 'Medium', 'Low', 'Medium', 'High', 'Medium', 'Low', 'Medium', 'High', 'Low'],
'Brand': ['Brand_A', 'Brand_B', 'Brand_C', 'Brand_A', 'Brand_C', 'Brand_B', 'Brand_A', 'Brand_B', 'Brand_C', 'Brand_A', 'Brand_B', 'Brand_C', 'Brand_A', 'Brand_C', 'Brand_B', 'Brand_A', 'Brand_C', 'Brand_B', 'Brand_A', 'Brand_B'],
'User_Preference': ['Liked', 'Disliked', 'Disliked', 'Liked', 'Liked', 'Disliked', 'Liked', 'Disliked', 'Disliked', 'Liked', 'Disliked', 'Disliked', 'Liked', 'Liked', 'Disliked', 'Liked', 'Liked', 'Disliked', 'Liked', 'Liked']
}
df = pd.DataFrame(data)
print("Original Dataset:")
printdf(df.head(3))
Original Dataset:
Product_Category | Price_Range | Brand | User_Preference | |
---|---|---|---|---|
0 | Electronics | Low | Brand_A | Liked |
1 | Books | Medium | Brand_B | Disliked |
2 | Clothing | High | Brand_C | Disliked |
Data Preprocessing
We will use OneHotEncoder to convert the categorical features into a numerical format that the models can process:
# Separate features (X) and target (y)
X = df.drop('User_Preference', axis=1)
y = df['User_Preference']
# Use OneHotEncoder to convert categorical features
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_encoded = ohe.fit_transform(X)
# Get the new feature names for better interpretation
feature_names = ohe.get_feature_names_out(X.columns)
target_names = np.unique(y)
#print("\nEncoded Features (first 5 rows):\n", X_encoded[:5])
#print("\nEncoded Feature Names:\n", feature_names)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
Model 1: Decision Tree Classifier
A single Decision Tree makes decisions by creating a set of simple, transparent rules. This model provides an interpretable baseline for our recommendation system.
# Instantiate and train the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42, max_depth=3) # Limit depth for visualization
dt_model.fit(X_train, y_train)
# Make predictions on the test set and evaluate performance
dt_pred = dt_model.predict(X_test)
print(f"D-Tree Accuracy: {accuracy_score(y_test, dt_pred):.2f}")
printdf(
pd.DataFrame(classification_report(y_test, dt_pred, target_names=target_names, output_dict=True))
)
# Visualize the Decision Tree to see its rules
plt.figure(figsize=(20, 10))
plot_tree(dt_model,
feature_names=feature_names,
class_names=target_names,
filled=True,
rounded=True,
fontsize=10)
plt.title("Decision Tree for Product Recommendation", fontsize=16)
plt.savefig('decision_tree_recommendation.png')
plt.show()
D-Tree Accuracy: 1.00
Disliked | Liked | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|
precision | 1 | 1 | 1 | 1 | 1 |
recall | 1 | 1 | 1 | 1 | 1 |
f1-score | 1 | 1 | 1 | 1 | 1 |
support | 2 | 2 | 1 | 4 | 4 |
Model 2: Random Forest Classifier (Refinement)
To improve performance and robustness, we use a Random Forest. This model builds a collection of decision trees and aggregates their predictions to make a final, more accurate decision. This approach reduces the risk of a single tree overfitting to the training data.
# Instantiate and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions on the test set and evaluate performance
rf_pred = rf_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, rf_pred):.2f}")
printdf(
pd.DataFrame(classification_report(y_test, rf_pred, target_names=target_names, output_dict=True))
)
Accuracy: 1.00
Disliked | Liked | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|
precision | 1 | 1 | 1 | 1 | 1 |
recall | 1 | 1 | 1 | 1 | 1 |
f1-score | 1 | 1 | 1 | 1 | 1 |
support | 2 | 2 | 1 | 4 | 4 |
Explanation
By comparing the performance metrics, you will typically observe that the Random Forest provides a more accurate and robust result than a single Decision Tree. While the Decision Tree gives us a clear, interpretable set of rules, the Random Forest's collective intelligence from multiple trees makes it more reliable for predicting user preferences. This demonstrates a common trade-off in machine learning: a simpler model is easier to interpret, but a more complex ensemble model often provides superior predictive power, leading to better recommendations and a more satisfied user base.