$k$-nearest neighbors algorithm

Dr. Huidae Cho
Institute for Environmental and Spatial Analysis...University of North Georgia

1   Introduction

A distance-based nonparametric algorithm by Fix and Hodges (1951).

KnnClassification.svg

$k$ is a hyperparameter, which is used to control the learning process, not a parameter, which is learned during training.

2   Applications

Supervised machine learning for data classification

Regression

3   Exercise: $k$-nearest neighbors classification

3.1   Import modules

import matplotlib.pyplot as plt
import numpy as np

3.2   Define hyperparameters

# specify the number of nearest neighbors
k = 3

# specify the size of training data
train_n = 100

3.3   Generate training data

# make this exercise repeatable
np.random.seed(1)

# generate training data
train_x = np.random.rand(train_n, 2)
# generate training labels
train_y = np.random.randint(2, size=train_n)

3.4   Generate new data

# generate new data
new_x = np.random.rand(1, 2)

3.5   Plot data

# plot data
plt.scatter(train_x[train_y==0,0], train_x[train_y==0,1], color='red', marker='^')
plt.scatter(train_x[train_y==1,0], train_x[train_y==1,1], color='blue', marker='s')
plt.scatter(new_x[:,0], new_x[:,1], color='green')
plt.show()

3.6   Calculate Euclidean distances

# calculate Euclidean distances between the new data and training data
dist = ((train_x[:,0]-new_x[:,0])**2 + (train_x[:,1]-new_x[:,1])**2)**0.5

3.7   Find $k$-nearest neighbors

# sort by distance
sort_idx = np.argsort(dist)

# extract k-nearest neighbors
knn_idx = sort_idx[0:k]
knn_x = train_x[knn_idx]
knn_y = train_y[knn_idx]
knn_dist = dist[knn_idx]

3.8   Calculate the new class

# calculate the new class
# https://stackoverflow.com/a/6252400
knn_counts = np.bincount(knn_y)
new_y = knn_counts.argmax()

print(f'counts: {knn_counts}')
print(f'new class: {new_y}')

3.9   Plot everything

# plot everything
plt.scatter(train_x[train_y==0,0], train_x[train_y==0,1], color='red', marker='^')
plt.scatter(train_x[train_y==1,0], train_x[train_y==1,1], color='blue', marker='s')
plt.scatter(new_x[:,0], new_x[:,1], color='green')

for r in knn_dist:
    # https://stackoverflow.com/a/29184075
    circle = plt.Circle((new_x[:,0], new_x[:,1]), r, color='gray', fill=False)
    plt.gca().add_patch(circle)

plt.show()

4   Exercise: Color-based forest classification using the $k$-nearest neighbors algorithm

5   Exercise: Tiling-color-based forest classification using the $k$-nearest neighbors algorithm

nrows, ncols, _ = im.shape

for i in range(0, nrows, tile_size):
    i2 = min(i + tile_size, nrows)
    for j in range(0, ncols, tile_size):
        j2 = min(j + tile_size, ncols)
        tile = im[i:i2,j:j2]

6   References