$k$-nearest neighbors algorithm
Institute for Environmental and Spatial Analysis...University of North Georgia
Contents
1 Introduction
A distance-based nonparametric algorithm by Fix and Hodges (1951).
$k$ is a hyperparameter, which is used to control the learning process, not a parameter, which is learned during training.
2 Applications
Supervised machine learning for data classification
Regression
3 Exercise: $k$-nearest neighbors classification
3.1 Import modules
import matplotlib.pyplot as plt
import numpy as np
3.2 Define hyperparameters
# specify the number of nearest neighbors
k = 3
# specify the size of training data
train_n = 100
3.3 Generate training data
# make this exercise repeatable
np.random.seed(1)
# generate training data
train_x = np.random.rand(train_n, 2)
# generate training labels
train_y = np.random.randint(2, size=train_n)
3.4 Generate new data
# generate new data
new_x = np.random.rand(1, 2)
3.5 Plot data
# plot data
plt.scatter(train_x[train_y==0,0], train_x[train_y==0,1], color='red', marker='^')
plt.scatter(train_x[train_y==1,0], train_x[train_y==1,1], color='blue', marker='s')
plt.scatter(new_x[:,0], new_x[:,1], color='green')
plt.show()
3.6 Calculate Euclidean distances
# calculate Euclidean distances between the new data and training data
dist = ((train_x[:,0]-new_x[:,0])**2 + (train_x[:,1]-new_x[:,1])**2)**0.5
3.7 Find $k$-nearest neighbors
# sort by distance
sort_idx = np.argsort(dist)
# extract k-nearest neighbors
knn_idx = sort_idx[0:k]
knn_x = train_x[knn_idx]
knn_y = train_y[knn_idx]
knn_dist = dist[knn_idx]
3.8 Calculate the new class
# calculate the new class
# https://stackoverflow.com/a/6252400
knn_counts = np.bincount(knn_y)
new_y = knn_counts.argmax()
print(f'counts: {knn_counts}')
print(f'new class: {new_y}')
3.9 Plot everything
# plot everything
plt.scatter(train_x[train_y==0,0], train_x[train_y==0,1], color='red', marker='^')
plt.scatter(train_x[train_y==1,0], train_x[train_y==1,1], color='blue', marker='s')
plt.scatter(new_x[:,0], new_x[:,1], color='green')
for r in knn_dist:
# https://stackoverflow.com/a/29184075
circle = plt.Circle((new_x[:,0], new_x[:,1]), r, color='gray', fill=False)
plt.gca().add_patch(circle)
plt.show()
4 Exercise: Color-based forest classification using the $k$-nearest neighbors algorithm
5 Exercise: Tiling-color-based forest classification using the $k$-nearest neighbors algorithm
nrows, ncols, _ = im.shape
for i in range(0, nrows, tile_size):
i2 = min(i + tile_size, nrows)
for j in range(0, ncols, tile_size):
j2 = min(j + tile_size, ncols)
tile = im[i:i2,j:j2]