Skip to content

Cluster Space Reduction Using Prototype Generation and Multi-Objective Genetic Algorithm

DOI

This notebook introduces a cluster space reduction technique using Multi-Objective Genetic Algorithm.

The "Curse of Dimensionality" and its Impact on KNN

The primary impediment to the widespread application of KNN in the context of "big data" is its computational complexity. For a training dataset containing NN instances, each described by dd features (dimensions), classifying a single new instance requires the calculation of NN distances in dd-dimensional space. This results in a computational cost of O(Nd)O(N_d) for each prediction. When NN is in the millions or billions, this cost becomes prohibitive, making the algorithm unsuitable for real-time applications or environments that require rapid classification of many new data points.

Beyond the raw computational and memory costs, KNN's performance is susceptible to a more subtle but equally damaging phenomenon known as the "curse of dimensionality". As the number of features dd increases, the volume of the feature space grows exponentially. Consequently, the available training data becomes increasingly sparse, meaning that the density of data points in any given region of the space becomes very low.

There are two general approaches to space reduction using Evolutionary Algorithms:

1. Evolutionary Instance Selection (IS): Pruning the Training Set - reducing the number of instances used for classification, IS directly lowers both the memory requirements and the computational time of the KNN algorithm. The task of finding the optimal subset of instances is an NP-hard problem, as an exhaustive search would require evaluating [1].

2. Evolutionary Prototype Generation (PG): Synthesizing Optimal Class Representatives - While Instance Selection is powerful, it is fundamentally limited by the content of the original training data; it can only select from what is already there. A more advanced and flexible approach, which directly aligns with the concept of evolving a population of representative vectors, is Evolutionary Prototype Generation (PG). Sophisticated PG frameworks, such as the Evolutionary Multi-Objective Prototype Generation and Feature Selection (EMOPG+FS) algorithm, provide a clear blueprint for this process [2].

We'll focus on Evolutionary Prototype Generation for Cluster Space Reduction.

Objectives

The goal is to replace a large cluster of data points with a small, strategically chosen set of "prototypes." A good set of prototypes for a class should be:

  1. Representative: The prototypes should be close to the actual data points of their own class (high cohesion).

  2. Discriminative: The prototypes should be far away from the data points of other classes (high separation).

We'll treat these two goals as separate objectives in a MOGA, which is designed to find optimal trade-offs between competing goals.

The Fitness Function

The Fitness Function Our fitness function needs to evaluate how good a given set of prototypes is. It will have two objectives:

  1. Minimize Cohesion (fcohesionf_{cohesion}): The average distance from each point in a class to its nearest prototype. A lower value means the prototypes are a better representation of the class.
fcohesion(Pi)=1DixDiminpPid(x,p)f_{cohesion}(P_i) = \frac{1}{|D_i|} \sum_{x \in D_i} \min_{p \in P_i} d(x, p)
  1. Maximize Separation (fseparationf_{separation}): The average distance from each prototype to the nearest point in any other class. A higher value means the prototypes are further from confusing neighbors.
fseparation(Pi)=1PipPiminyDotherd(p,y)f_{separation}(P_i) = \frac{1}{|P_i|} \sum_{p \in P_i} \min_{y \in D_{other}} d(p, y)

Python Implementation

To implement MOGA we'll use DEAP evolutionary computation framework:

! pip install -q numpy deap scikit-learn matplotlib

Generate a synthetic dataset.

We will create a dataset with 5 distinct, 3-dimensional clusters. This model will serve a baseline.

import numpy as np
import random
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
from deap import base, creator, tools, algorithms
import time
 
N_SAMPLES_PER_CLUSTER = 1000
N_CLUSTERS = 5
N_FEATURES = 3
RANDOM_STATE = 42
 
X, y = make_blobs(
    n_samples=N_SAMPLES_PER_CLUSTER * N_CLUSTERS,
    n_features=N_FEATURES,
    centers=N_CLUSTERS,
    cluster_std=1.5,
    random_state=RANDOM_STATE
)
 
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE, stratify=y
)
 
def plot_original_data(X, y, title="Original Data Clusters"):
    """Helper function to plot the 3D data."""
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', marker='o', alpha=0.6)
    legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
    ax.add_artist(legend1)
    ax.set_title(title)
    ax.set_xlabel('Feature X')
    ax.set_ylabel('Feature Y')
    ax.set_zlabel('Feature Z')
    plt.show()
 
plot_original_data(X_train, y_train, title="Training Data Clusters")

png

Train Baseline Model

Train a standard KNN classifier on the full training dataset. This will be our performance baseline to compare against later.

baseline_knn = KNeighborsClassifier(n_neighbors=5)
 
start_time = time.time()
baseline_knn.fit(X_train, y_train)
baseline_fit_time = time.time() - start_time
 
start_time = time.time()
y_pred_baseline = baseline_knn.predict(X_test)
baseline_predict_time = time.time() - start_time
 
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
 
print(f"Baseline KNN trained on {len(X_train)} points.\n")
print(f"Fit time: {baseline_fit_time:.4f} seconds\n")
print(f"Prediction time: {baseline_predict_time:.4f} seconds\n")
print(f"Baseline Accuracy: {baseline_accuracy:.4f}\n")
 
print("-" * 30 +"\n")

Baseline KNN trained on 3500 points.

Fit time: 0.0025 seconds

Prediction time: 0.0037 seconds

Baseline Accuracy: 0.9720


Fittness Function

Since MOGAs in DEAP typically minimize all objectives, we will frame our problem as minimizing cohesion and minimizing negative separation.

# Fitness Function
# Objective 1: Minimize intra-class distance (cohesion)
# Objective 2: Minimize negative inter-class distance (maximize separation)
creator.create("FitnessMulti", base.Fitness, weights=(-1.0, -1.0))
creator.create("Individual", list, fitness=creator.FitnessMulti)
 
toolbox = base.Toolbox()
 
def evaluate_prototypes(individual, own_class_data, other_classes_data):
    """
    Fitness function.
    - individual: A list of indices pointing to the selected prototypes.
    """
    prototypes = own_class_data[individual]
 
    # Objective 1: Cohesion (Average distance from each point to nearest prototype)
    total_min_dist_cohesion = 0
    for point in own_class_data:
        distances = np.linalg.norm(prototypes - point, axis=1)
        total_min_dist_cohesion += np.min(distances)
    cohesion = total_min_dist_cohesion / len(own_class_data)
 
    # Objective 2: Separation (Average distance from each prototype to nearest 'other' point)
    total_min_dist_separation = 0
    for p in prototypes:
        distances = np.linalg.norm(other_classes_data - p, axis=1)
        total_min_dist_separation += np.min(distances)
    separation = total_min_dist_separation / len(prototypes)
 
    # We want to minimize cohesion and maximize separation
    return cohesion, separation
 

Running MOGA

N_PROTOTYPES_PER_CLASS = 50
POPULATION_SIZE = 100
CXPB, MUTPB, NGEN = 0.7, 0.2, 50 # Crossover, Mutation, Generations
 
all_prototypes = []
all_prototype_labels = []
 
for i in range(N_CLUSTERS):
    # Isolate data for the current class and all other classes
    class_indices = np.where(y_train == i)[0]
    own_class_data = X_train[class_indices]
    other_classes_data = X_train[y_train != i]
 
    # Attribute generator: A random index from the current class's data
    toolbox.register("attr_idx", random.sample, range(len(own_class_data)), N_PROTOTYPES_PER_CLASS)
    
    # Structure initializers
    toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.attr_idx)
    toolbox.register("population", tools.initRepeat, list, toolbox.individual)
 
    toolbox.register("evaluate", evaluate_prototypes, own_class_data=own_class_data, other_classes_data=other_classes_data)
    toolbox.register("mate", tools.cxTwoPoint)
    toolbox.register("mutate", tools.mutUniformInt, low=0, up=len(own_class_data)-1, indpb=0.05)
    toolbox.register("select", tools.selNSGA2) # Using NSGA-II for multi-objective selection
 
    pop = toolbox.population(n=POPULATION_SIZE)
    hof = tools.ParetoFront() # Hall of fame to store the best non-dominated solutions
 
    algorithms.eaSimple(pop, toolbox, cxpb=CXPB, mutpb=MUTPB, ngen=NGEN,
                        stats=None, halloffame=hof, verbose=False)
 
    # A simple strategy: pick the one with the best separation score
    best_individual = max(hof, key=lambda ind: ind.fitness.values[1])
    best_prototypes_indices = best_individual
    best_prototypes = own_class_data[best_prototypes_indices]
 
    all_prototypes.extend(best_prototypes)
    all_prototype_labels.extend([i] * len(best_prototypes))
 
X_reduced = np.array(all_prototypes)
y_reduced = np.array(all_prototype_labels)
 
def plot_prototypes(X, y, prototypes, prototype_labels, title="Selected Prototypes"):
    fig = plt.figure(figsize=(12, 9))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', marker='o', alpha=0.1, label='Original Data')
    scatter = ax.scatter(prototypes[:, 0], prototypes[:, 1], prototypes[:, 2],
                         c=prototype_labels, cmap='viridis', marker='x', s=100,
                         edgecolor='k', linewidth=1.5, label='Prototypes')
    legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
    ax.add_artist(legend1)
    ax.set_title(title)
    ax.set_xlabel('Feature X')
    ax.set_ylabel('Feature Y')
    ax.set_zlabel('Feature Z')
    plt.show()
 
plot_prototypes(X_train, y_train, X_reduced, y_reduced)

/tmp/ipykernel_2520995/731900088.py:47: UserWarning: You passed a edgecolor/edgecolors ('k') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future. scatter = ax.scatter(prototypes[:, 0], prototypes[:, 1], prototypes[:, 2],

png

Comparing Models

 
reduced_knn = KNeighborsClassifier(n_neighbors=5)
start_time = time.time()
reduced_knn.fit(X_reduced, y_reduced)
reduced_fit_time = time.time() - start_time
 
start_time = time.time()
y_pred_reduced = reduced_knn.predict(X_test)
reduced_predict_time = time.time() - start_time
reduced_accuracy = accuracy_score(y_test, y_pred_reduced)
 
print("\n--- Final Results Comparison ---\n")
 
print("\nBaseline KNN (Full Dataset):\n")
print(f"Training points: {len(X_train)}\n")
print(f"Fit time: {baseline_fit_time:.4f} s\n")
print(f"Prediction time: {baseline_predict_time:.4f} s\n")
print(f"Accuracy: {baseline_accuracy:.4f}\n")
 
print("\nReduced KNN (GA Prototypes):\n")
print(f"Training points: {len(X_reduced)}\n")
 
data_reduction = (1 - len(X_reduced) / len(X_train)) * 100
print(f"Data Reduction: {data_reduction:.2f}%\n")
print(f"Fit time: {reduced_fit_time:.4f} s\n")
print(f"Prediction time: {reduced_predict_time:.4f} s\n")
 
speed_increase = (baseline_predict_time / reduced_predict_time)
print(f"Prediction Speed Increase: {speed_increase:.2f}x\n")
print(f"Accuracy: {reduced_accuracy:.4f}\n")
 
print("-" * 30 + "\n")

--- Final Results Comparison ---

Baseline KNN (Full Dataset):

Training points: 3500

Fit time: 0.0025 s

Prediction time: 0.0037 s

Accuracy: 0.9720

Reduced KNN (GA Prototypes):

Training points: 250

Data Reduction: 92.86%

Fit time: 0.0009 s

Prediction time: 0.0022 s

Prediction Speed Increase: 1.64x

Accuracy: 0.9727


Conclusion

This notebook demonstrates that using evolutionary algorithms for prototype generation is a highly viable strategy. It transforms the KNN algorithm from a "lazy" learner that requires storing all data into a more efficient model, making it practical for applications where prediction speed and memory are critical.

References

Anton Nesterov © 2025 | vski·science