Product Recommender: Decision Tree and Random Forest


Anton A. Nesterov	an (at) vski.sh
Version	1.0

Product recommendation systems are an integral part of modern e-commerce and streaming services. Their goal is to predict what a user might like, driving engagement and sales. We can build a simple recommendation system using supervised learning models, starting with a basic Decision Tree for interpretability and then refining it with a more powerful Random Forest.

The Dataset: User Preferences

First, let's create a synthetic dataset that simulates user preferences for products based on their features: 'Product_Category', 'Price_Range', and 'Brand'. The target variable is User_Preference, which is either 'Liked' or 'Disliked'.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import numpy as np
 
from IPython.display import display, Markdown
import pandas as pd
import seaborn as sns
 
printdf = lambda df: display(Markdown(df.to_markdown()))
 
# Create a synthetic dataset to simulate user preferences
data = {
    'Product_Category': ['Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
    'Price_Range': ['Low', 'Medium', 'High', 'Medium', 'Low', 'Medium', 'High', 'Low', 'Medium', 'Low', 'High', 'Medium', 'Low', 'Medium', 'High', 'Medium', 'Low', 'Medium', 'High', 'Low'],
    'Brand': ['Brand_A', 'Brand_B', 'Brand_C', 'Brand_A', 'Brand_C', 'Brand_B', 'Brand_A', 'Brand_B', 'Brand_C', 'Brand_A', 'Brand_B', 'Brand_C', 'Brand_A', 'Brand_C', 'Brand_B', 'Brand_A', 'Brand_C', 'Brand_B', 'Brand_A', 'Brand_B'],
    'User_Preference': ['Liked', 'Disliked', 'Disliked', 'Liked', 'Liked', 'Disliked', 'Liked', 'Disliked', 'Disliked', 'Liked', 'Disliked', 'Disliked', 'Liked', 'Liked', 'Disliked', 'Liked', 'Liked', 'Disliked', 'Liked', 'Liked']
}
df = pd.DataFrame(data)
 
print("Original Dataset:")
printdf(df.head(3))

Original Dataset:

	Product_Category	Price_Range	Brand	User_Preference
0	Electronics	Low	Brand_A	Liked
1	Books	Medium	Brand_B	Disliked
2	Clothing	High	Brand_C	Disliked

Data Preprocessing

We will use OneHotEncoder to convert the categorical features into a numerical format that the models can process:

# Separate features (X) and target (y)
X = df.drop('User_Preference', axis=1)
y = df['User_Preference']
 
# Use OneHotEncoder to convert categorical features
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_encoded = ohe.fit_transform(X)
 
# Get the new feature names for better interpretation
feature_names = ohe.get_feature_names_out(X.columns)
target_names = np.unique(y)
 
#print("\nEncoded Features (first 5 rows):\n", X_encoded[:5])
#print("\nEncoded Feature Names:\n", feature_names)
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

Model 1: Decision Tree Classifier

A single Decision Tree makes decisions by creating a set of simple, transparent rules. This model provides an interpretable baseline for our recommendation system.

# Instantiate and train the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42, max_depth=3) # Limit depth for visualization
dt_model.fit(X_train, y_train)
 
# Make predictions on the test set and evaluate performance
dt_pred = dt_model.predict(X_test)
 
print(f"D-Tree Accuracy: {accuracy_score(y_test, dt_pred):.2f}")
 
printdf(
  pd.DataFrame(classification_report(y_test, dt_pred, target_names=target_names, output_dict=True))
)
 
# Visualize the Decision Tree to see its rules
plt.figure(figsize=(20, 10))
plot_tree(dt_model,
          feature_names=feature_names,
          class_names=target_names,
          filled=True,
          rounded=True,
          fontsize=10)
plt.title("Decision Tree for Product Recommendation", fontsize=16)
plt.savefig('decision_tree_recommendation.png')
plt.show()

D-Tree Accuracy: 1.00

	Disliked	Liked	accuracy	macro avg	weighted avg
precision	1	1	1	1	1
recall	1	1	1	1	1
f1-score	1	1	1	1	1
support	2	2	1	4	4

png

Model 2: Random Forest Classifier (Refinement)

To improve performance and robustness, we use a Random Forest. This model builds a collection of decision trees and aggregates their predictions to make a final, more accurate decision. This approach reduces the risk of a single tree overfitting to the training data.

# Instantiate and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
 
# Make predictions on the test set and evaluate performance
rf_pred = rf_model.predict(X_test)
 
print(f"Accuracy: {accuracy_score(y_test, rf_pred):.2f}")
 
printdf(
  pd.DataFrame(classification_report(y_test, rf_pred, target_names=target_names, output_dict=True))
)

Accuracy: 1.00

	Disliked	Liked	accuracy	macro avg	weighted avg
precision	1	1	1	1	1
recall	1	1	1	1	1
f1-score	1	1	1	1	1
support	2	2	1	4	4

Explanation

By comparing the performance metrics, you will typically observe that the Random Forest provides a more accurate and robust result than a single Decision Tree. While the Decision Tree gives us a clear, interpretable set of rules, the Random Forest's collective intelligence from multiple trees makes it more reliable for predicting user preferences. This demonstrates a common trade-off in machine learning: a simpler model is easier to interpret, but a more complex ensemble model often provides superior predictive power, leading to better recommendations and a more satisfied user base.