Cluster Space Reduction Using Prototype Generation and Multi-Objective Genetic Algorithm
This notebook introduces a cluster space reduction technique using Multi-Objective Genetic Algorithm.
The "Curse of Dimensionality" and its Impact on KNN
The primary impediment to the widespread application of KNN in the context of "big data" is its computational complexity. For a training dataset containing instances, each described by features (dimensions), classifying a single new instance requires the calculation of distances in -dimensional space. This results in a computational cost of for each prediction. When is in the millions or billions, this cost becomes prohibitive, making the algorithm unsuitable for real-time applications or environments that require rapid classification of many new data points.
Beyond the raw computational and memory costs, KNN's performance is susceptible to a more subtle but equally damaging phenomenon known as the "curse of dimensionality". As the number of features increases, the volume of the feature space grows exponentially. Consequently, the available training data becomes increasingly sparse, meaning that the density of data points in any given region of the space becomes very low.
There are two general approaches to space reduction using Evolutionary Algorithms:
1. Evolutionary Instance Selection (IS): Pruning the Training Set - reducing the number of instances used for classification, IS directly lowers both the memory requirements and the computational time of the KNN algorithm. The task of finding the optimal subset of instances is an NP-hard problem, as an exhaustive search would require evaluating [1].
2. Evolutionary Prototype Generation (PG): Synthesizing Optimal Class Representatives - While Instance Selection is powerful, it is fundamentally limited by the content of the original training data; it can only select from what is already there. A more advanced and flexible approach, which directly aligns with the concept of evolving a population of representative vectors, is Evolutionary Prototype Generation (PG). Sophisticated PG frameworks, such as the Evolutionary Multi-Objective Prototype Generation and Feature Selection (EMOPG+FS) algorithm, provide a clear blueprint for this process [2].
We'll focus on Evolutionary Prototype Generation for Cluster Space Reduction.
Objectives
The goal is to replace a large cluster of data points with a small, strategically chosen set of "prototypes." A good set of prototypes for a class should be:
-
Representative: The prototypes should be close to the actual data points of their own class (high cohesion).
-
Discriminative: The prototypes should be far away from the data points of other classes (high separation).
We'll treat these two goals as separate objectives in a MOGA, which is designed to find optimal trade-offs between competing goals.
The Fitness Function
The Fitness Function Our fitness function needs to evaluate how good a given set of prototypes is. It will have two objectives:
- Minimize Cohesion (): The average distance from each point in a class to its nearest prototype. A lower value means the prototypes are a better representation of the class.
- Maximize Separation (): The average distance from each prototype to the nearest point in any other class. A higher value means the prototypes are further from confusing neighbors.
Python Implementation
To implement MOGA we'll use DEAP evolutionary computation framework:
! pip install -q numpy deap scikit-learn matplotlib
Generate a synthetic dataset.
We will create a dataset with 5 distinct, 3-dimensional clusters. This model will serve a baseline.
import numpy as np
import random
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
from deap import base, creator, tools, algorithms
import time
N_SAMPLES_PER_CLUSTER = 1000
N_CLUSTERS = 5
N_FEATURES = 3
RANDOM_STATE = 42
X, y = make_blobs(
n_samples=N_SAMPLES_PER_CLUSTER * N_CLUSTERS,
n_features=N_FEATURES,
centers=N_CLUSTERS,
cluster_std=1.5,
random_state=RANDOM_STATE
)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=RANDOM_STATE, stratify=y
)
def plot_original_data(X, y, title="Original Data Clusters"):
"""Helper function to plot the 3D data."""
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', marker='o', alpha=0.6)
legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
ax.add_artist(legend1)
ax.set_title(title)
ax.set_xlabel('Feature X')
ax.set_ylabel('Feature Y')
ax.set_zlabel('Feature Z')
plt.show()
plot_original_data(X_train, y_train, title="Training Data Clusters")
Train Baseline Model
Train a standard KNN classifier on the full training dataset. This will be our performance baseline to compare against later.
baseline_knn = KNeighborsClassifier(n_neighbors=5)
start_time = time.time()
baseline_knn.fit(X_train, y_train)
baseline_fit_time = time.time() - start_time
start_time = time.time()
y_pred_baseline = baseline_knn.predict(X_test)
baseline_predict_time = time.time() - start_time
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
print(f"Baseline KNN trained on {len(X_train)} points.\n")
print(f"Fit time: {baseline_fit_time:.4f} seconds\n")
print(f"Prediction time: {baseline_predict_time:.4f} seconds\n")
print(f"Baseline Accuracy: {baseline_accuracy:.4f}\n")
print("-" * 30 +"\n")
Baseline KNN trained on 3500 points.
Fit time: 0.0025 seconds
Prediction time: 0.0037 seconds
Baseline Accuracy: 0.9720
Fittness Function
Since MOGAs in DEAP typically minimize all objectives, we will frame our problem as minimizing cohesion and minimizing negative separation.
# Fitness Function
# Objective 1: Minimize intra-class distance (cohesion)
# Objective 2: Minimize negative inter-class distance (maximize separation)
creator.create("FitnessMulti", base.Fitness, weights=(-1.0, -1.0))
creator.create("Individual", list, fitness=creator.FitnessMulti)
toolbox = base.Toolbox()
def evaluate_prototypes(individual, own_class_data, other_classes_data):
"""
Fitness function.
- individual: A list of indices pointing to the selected prototypes.
"""
prototypes = own_class_data[individual]
# Objective 1: Cohesion (Average distance from each point to nearest prototype)
total_min_dist_cohesion = 0
for point in own_class_data:
distances = np.linalg.norm(prototypes - point, axis=1)
total_min_dist_cohesion += np.min(distances)
cohesion = total_min_dist_cohesion / len(own_class_data)
# Objective 2: Separation (Average distance from each prototype to nearest 'other' point)
total_min_dist_separation = 0
for p in prototypes:
distances = np.linalg.norm(other_classes_data - p, axis=1)
total_min_dist_separation += np.min(distances)
separation = total_min_dist_separation / len(prototypes)
# We want to minimize cohesion and maximize separation
return cohesion, separation
Running MOGA
N_PROTOTYPES_PER_CLASS = 50
POPULATION_SIZE = 100
CXPB, MUTPB, NGEN = 0.7, 0.2, 50 # Crossover, Mutation, Generations
all_prototypes = []
all_prototype_labels = []
for i in range(N_CLUSTERS):
# Isolate data for the current class and all other classes
class_indices = np.where(y_train == i)[0]
own_class_data = X_train[class_indices]
other_classes_data = X_train[y_train != i]
# Attribute generator: A random index from the current class's data
toolbox.register("attr_idx", random.sample, range(len(own_class_data)), N_PROTOTYPES_PER_CLASS)
# Structure initializers
toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.attr_idx)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", evaluate_prototypes, own_class_data=own_class_data, other_classes_data=other_classes_data)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutUniformInt, low=0, up=len(own_class_data)-1, indpb=0.05)
toolbox.register("select", tools.selNSGA2) # Using NSGA-II for multi-objective selection
pop = toolbox.population(n=POPULATION_SIZE)
hof = tools.ParetoFront() # Hall of fame to store the best non-dominated solutions
algorithms.eaSimple(pop, toolbox, cxpb=CXPB, mutpb=MUTPB, ngen=NGEN,
stats=None, halloffame=hof, verbose=False)
# A simple strategy: pick the one with the best separation score
best_individual = max(hof, key=lambda ind: ind.fitness.values[1])
best_prototypes_indices = best_individual
best_prototypes = own_class_data[best_prototypes_indices]
all_prototypes.extend(best_prototypes)
all_prototype_labels.extend([i] * len(best_prototypes))
X_reduced = np.array(all_prototypes)
y_reduced = np.array(all_prototype_labels)
def plot_prototypes(X, y, prototypes, prototype_labels, title="Selected Prototypes"):
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', marker='o', alpha=0.1, label='Original Data')
scatter = ax.scatter(prototypes[:, 0], prototypes[:, 1], prototypes[:, 2],
c=prototype_labels, cmap='viridis', marker='x', s=100,
edgecolor='k', linewidth=1.5, label='Prototypes')
legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
ax.add_artist(legend1)
ax.set_title(title)
ax.set_xlabel('Feature X')
ax.set_ylabel('Feature Y')
ax.set_zlabel('Feature Z')
plt.show()
plot_prototypes(X_train, y_train, X_reduced, y_reduced)
/tmp/ipykernel_2520995/731900088.py:47: UserWarning: You passed a edgecolor/edgecolors ('k') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future. scatter = ax.scatter(prototypes[:, 0], prototypes[:, 1], prototypes[:, 2],
Comparing Models
reduced_knn = KNeighborsClassifier(n_neighbors=5)
start_time = time.time()
reduced_knn.fit(X_reduced, y_reduced)
reduced_fit_time = time.time() - start_time
start_time = time.time()
y_pred_reduced = reduced_knn.predict(X_test)
reduced_predict_time = time.time() - start_time
reduced_accuracy = accuracy_score(y_test, y_pred_reduced)
print("\n--- Final Results Comparison ---\n")
print("\nBaseline KNN (Full Dataset):\n")
print(f"Training points: {len(X_train)}\n")
print(f"Fit time: {baseline_fit_time:.4f} s\n")
print(f"Prediction time: {baseline_predict_time:.4f} s\n")
print(f"Accuracy: {baseline_accuracy:.4f}\n")
print("\nReduced KNN (GA Prototypes):\n")
print(f"Training points: {len(X_reduced)}\n")
data_reduction = (1 - len(X_reduced) / len(X_train)) * 100
print(f"Data Reduction: {data_reduction:.2f}%\n")
print(f"Fit time: {reduced_fit_time:.4f} s\n")
print(f"Prediction time: {reduced_predict_time:.4f} s\n")
speed_increase = (baseline_predict_time / reduced_predict_time)
print(f"Prediction Speed Increase: {speed_increase:.2f}x\n")
print(f"Accuracy: {reduced_accuracy:.4f}\n")
print("-" * 30 + "\n")
--- Final Results Comparison ---
Baseline KNN (Full Dataset):
Training points: 3500
Fit time: 0.0025 s
Prediction time: 0.0037 s
Accuracy: 0.9720
Reduced KNN (GA Prototypes):
Training points: 250
Data Reduction: 92.86%
Fit time: 0.0009 s
Prediction time: 0.0022 s
Prediction Speed Increase: 1.64x
Accuracy: 0.9727
Conclusion
This notebook demonstrates that using evolutionary algorithms for prototype generation is a highly viable strategy. It transforms the KNN algorithm from a "lazy" learner that requires storing all data into a more efficient model, making it practical for applications where prediction speed and memory are critical.
References
- [1] International Journal of Metaheuristics - Sarah Madi, Ahmed Riadh Baba-Ali, A multi-stage genetic algorithm for instance selection dedicated to k nearest neighbours classification.
- [2] Instituto Nacional de Astrofısica, Optica y Electronica (INAOE) - Evolutionary Multi-Objective Approach for Prototype Generation and Feature Selection