K Nearest Neighbors Method

The k nearest neighbors method is one of the simplest supervised learning methods. It involves finding the most similar inputs in the set of input output pairs.

Here is sample Python code to determine the accuracy of the k nearest neighbors method on data:

#!/usr/bin/env python3

"""
Determines the accuracy of the k nearest neighbors method on data.

Usage:
        ./k_nn <data file> <data split> <number of nearest neighbors>

Data files must be space delimited with one input output pair per line.

initialization steps:
        Input output pairs are shuffled.
        Inputs             are min max normalized.

Requires SciPy and NumPy.
"""

import scipy.stats
import numpy
import sys

def minmax(data):
        """
        Finds the min max normalizations of data.
        """

        return (data - numpy.min(data)) / (numpy.max(data) - numpy.min(data))

def init_data(data_file, data_split):
        """
        Creates the model and testing data.
        """

        data         = numpy.loadtxt(data_file)
        numpy.random.shuffle(data)
        data[:, :-1] = minmax(data[:, :-1])
        data_split   = int((data_split / 100) * data.shape[0])

        return data[:data_split, :], data[data_split:, :]

def accuracy(model_data, test_data, n_nn):
        """
        Calculates the accuracies of models.
        """

        model_ = model(test_data[:, :-1], model_data, n_nn)

        return 100 * (model_ == test_data[:, -1]).astype(int).mean()

def model_(input_, model_data, n_nn):
        """
        model helper function
        """

        squares = (input_ - model_data[:, :-1]) ** 2
        indices = numpy.sum(squares, 1).argsort()[:n_nn]

        return scipy.stats.mode(numpy.take(model_data[:, -1], indices))[0][0]

def model(inputs, model_data, n_nn):
        """
        Finds the model outputs.
        """

        return numpy.apply_along_axis(lambda e : model_(e, model_data, n_nn),
                                      1,
                                      inputs)

model_data, test_data = init_data(sys.argv[1], float(sys.argv[2]))
n_nn                  = int(sys.argv[3])
print(f"testing data accuracy: {accuracy(model_data, test_data, n_nn):.2f}%")

Here are sample results for the popular Iris_flower_dataset available from many sources such as Sklearn:

% ./k_nn Iris_flower_dataset.csv 80 1
testing data accuracy: 96.67%

% ./k_nn Iris_flower_dataset 80 2
testing data accuracy: 93.33%