Predicting Customer Value with K-Nearest Neighbors (KNN)
Customer Lifetime Value (CLV) is a key metric that estimates the total revenue a business can expect from a single customer over their entire relationship. Predicting this value allows companies to prioritize high-value customers, tailor marketing campaigns, and optimize retention strategies.
This post will demonstrate how to use K-Nearest Neighbors (KNN) to predict customer value in two ways:
- Regression: Predicting the continuous CLV value.
- Classification: Predicting a customer's value segment (e.g., "high value" vs. "low value").
KNN is an intuitive, non-parametric algorithm that works by finding the "k" most similar data points (neighbors) to a new data point and using their values to make a prediction.
Data Preparation
We'll start by loading the dataset, performing some basic preprocessing, and splitting the data into features (X) and our target (Y). For a KNN model, it's crucial to scale the features, as distance calculations are highly sensitive to different value ranges.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown
printdf = lambda df: display(Markdown(df.to_markdown()))
df = pd.read_csv('Customer-Value-Analysis.csv')
# --- Preprocessing ---
# Drop irrelevant features and the target variable
X = df.drop(columns=['Customer', 'Customer Lifetime Value'])
y = df['Customer Lifetime Value']
# Convert categorical features to numerical using one-hot encoding
X = pd.get_dummies(X, drop_first=True)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
KNN for Regression: Predicting Continuous Customer Value
For regression, KNN predicts a continuous value. The KNeighborsRegressor algorithm will find the k nearest neighbors to a new customer and predict its CLV as the average of its neighbors' CLV values.
# Create a KNN Regressor model
knn_regressor = KNeighborsRegressor(n_neighbors=5)
# Train the model on the scaled training data
knn_regressor.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred_reg = knn_regressor.predict(X_test_scaled)
# Evaluate the model's performance using Mean Squared Error
mse = mean_squared_error(y_test, y_pred_reg)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f}")
# Display a few example predictions vs. actual values
comparison_df = pd.DataFrame({'Actual CLV': y_test, 'Predicted CLV': y_pred_reg})
print("\nSample Predictions:")
printdf(comparison_df.head())
RMSE: 6952.69
Sample Predictions:
Actual CLV | Predicted CLV | |
---|---|---|
708 | 4222.63 | 7320.64 |
47 | 5514.34 | 9126.83 |
3995 | 3808.12 | 5296.22 |
1513 | 7914.82 | 15411.6 |
3686 | 7931.72 | 8780.14 |
# Plotting the regression results
plt.figure(figsize=(10, 6))
sns.regplot(x=y_test, y=y_pred_reg, scatter_kws={'alpha':0.3, 'color':'skyblue'}, line_kws={'color':'red'})
plt.xlabel("Actual Customer Lifetime Value")
plt.ylabel("Predicted Customer Lifetime Value")
plt.title("KNN Regressor: Actual vs. Predicted CLV")
plt.grid(True)
plt.show()
KNN for Classification: Classifying Customer Value Segments
For classification, KNN predicts a discrete class. First, we need to create these classes from our continuous CLV data. A common approach is to segment customers into tiers like "High Value" and "Low Value." The KNeighborsClassifier will then find the k nearest neighbors and predict the class based on the most frequent class among those neighbors.
# --- Data Preparation for Classification ---
# Create a binary target: 'High Value' if CLV is above the median, otherwise 'Low Value'
median_clv = df['Customer Lifetime Value'].median()
df['Value_Segment'] = np.where(df['Customer Lifetime Value'] > median_clv, 'High Value', 'Low Value')
# Drop the original CLV column and re-define features and target
X_clf = df.drop(columns=['Customer', 'Customer Lifetime Value', 'Value_Segment'])
y_clf = df['Value_Segment']
# Re-preprocess and split the data for the classification task
X_clf = pd.get_dummies(X_clf, drop_first=True)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)
X_train_clf_scaled = scaler.fit_transform(X_train_clf)
X_test_clf_scaled = scaler.transform(X_test_clf)
# Create a KNN Classifier model
knn_classifier = KNeighborsClassifier(n_neighbors=5)
# Train the model
knn_classifier.fit(X_train_clf_scaled, y_train_clf)
# Make predictions and evaluate
y_pred_clf = knn_classifier.predict(X_test_clf_scaled)
accuracy = accuracy_score(y_test_clf, y_pred_clf)
print(f"Accuracy: {accuracy:.2f}")
# Display a detailed report
print("\nClassification Report:")
printdf(pd.DataFrame(classification_report(y_test_clf, y_pred_clf, output_dict=True)))
Accuracy: 0.62
Classification Report:
High Value | Low Value | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|
precision | 0.612221 | 0.632172 | 0.622879 | 0.622197 | 0.622562 |
recall | 0.592045 | 0.651531 | 0.622879 | 0.621788 | 0.622879 |
f1-score | 0.601964 | 0.641706 | 0.622879 | 0.621835 | 0.622564 |
support | 880 | 947 | 0.622879 | 1827 | 1827 |
cm = confusion_matrix(y_test_clf, y_pred_clf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['High Value', 'Low Value'], yticklabels=['High Value', 'Low Value'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for KNN Classifier')
plt.show()
Why Use KNN for This Task?
KNN offers a simple yet powerful approach to customer value prediction. Its non-parametric nature means it makes no assumptions about the data's underlying distribution, making it flexible for various datasets.
-
For Regression, KNN provides a direct, interpretable prediction based on the average value of similar customers.
-
For Classification, it segments customers into meaningful groups based on the majority vote of their nearest neighbors.
While more complex models often outperform KNN, its simplicity and interpretability make it an excellent starting point for understanding your data and building a reliable baseline for customer value analytics.