What is KNN? How To Use The K Nearest Neighbors Algorithm to Classify Data and Make Predictions

jason on July 8, 2025, 08:19 PM

Ok, so what is KNN? The acronym stands for K Nearest Neighbors. Ok, so what does the K stand for? Basically, K is just a variable name that’s commonly used in Data Science and Machine Learning to represent a discrete amount of something. So KNN is an algorithm that is used to predict which group or ‘class’ something belongs to by examining a certain number (K) of data points that are most similar to it. The main assumption we make when using KNN to predict which class something belongs to is that data points which are similar to each other will be close to each other in terms of their values. So considering its primary function, K-Nearest Neighbors is what is called a classification algorithm. The main use case is when you want to predict which class a new data point will belong to based on its values. A simple example would be predicting a flower’s type based on its similarities to other ones in your data set. If you have labeled the types of all of the other flowers, the new flower coming into the dataset can have its label predicted based on their similarity to the other labeled flowers.

Because KNN makes its predictions based on pre-labeled data, it fits into a category of Machine Learning algorithms called “Supervised Learning” algorithms. These differ from the other type, “Unsupervised Learning” algorithms, which reveal patterns from data that has not been labeled beforehand. So in real world applications, how can we use KNN? And how does it work? Let’s go over those questions in a bit more detail.

Overview of How KNN works

So, the first thing we’ll need if we want to use KNN is some data that contains labels. Our goal is to predict the label for a new datapoint that the model hasn’t seen yet. With KNN, since we need to find the nearest neighbors, we’ll have to pick a certain number of datapoints with the smallest distances from our new point to check and see what their labels are. Whatever the most common label is in that neighboring set, we’ll say is the prediction of what our new datapoint is. Last, but not least, we’ll evaluate the performance of our KNN model on the dataset so that we can find which value for K works best. To start off, we’ll just pick a value for K that seems reasonable and tweak it until we get optimal performance.

Preparing the Data by Scaling it

Ok, so since we now know that the way KNN works is by finding the distances of points close to a specific one, we have to consider the ranges of the different data points and scale them if they vary too much. Consider for example if we had two different features in a dataset, age, and income. The age might range from 20yrs - 70yrs (50), and income might range from $10,000 per year to $200,000 per year (190,000). This means that the differences in distances between income data points will generally be much larger than those of age, so income would dominate the results we get from calculating distances. This may or may not be what we want, but in most cases, it’s safest to do the scaling so that we can have all of our features contribute to the result pretty equally. In this example, we’ll use sklearn’s standard scaler. It’s pretty simple though. All it’s doing is ensuring that all values in a column adhere to a zero mean and a unit variance. It does that with this simple formula:

scaled_value = (original_value - mean) / standard_deviation

By subtracting the mean value and dividing by the standard deviation, we retain the proportional distance between each point while forcing them to be in a smaller range that won’t dominate the results. We don’t have to worry about remembering the formula though since we can just use sklearn’s standard scaler like this (assuming we have a pandas data frame called df):

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(df.drop(‘label_we_want_to_predict’), axis=1))
scaled_features = scaler.transform(‘label_we_want_to_predict’), axis=1))

Finding the Distance between Points using The Pythagorean Theorem

So now that we’ve got our feature values in ranges that will play nicely with each other, we can calculate the distances between them. Using a simple example, imagine a two dimensional graph with an X and Y axis where we have data points scattered around the graph. If we want to find the distance between two specific points, we always know how far apart they are in the X direction, and how far they’re apart in the Y direction. If you were to draw a line directly connecting the two points, you would create a right angle triangle. Now, you might remember from your high-school geometry class that the long side of a right-angle triangle is called the hypotenuse and we can always find the length of it using the Pythagorean theorem, a² + b² = c², where c² is the hypotenuse. That means we can find the distance between any two points by taking the square root of the combined difference in X and Y, d = √[(x₂-x₁)² + (y₂-y₁)²]. This example is what’s called Euclidean distance, and even though this is the two dimension version of it, the same idea applies to calculating the distance between points in higher dimensions, so the basic formula looks like this:

distance = √[(a₁-b₁)² + (a₂-b₂)² + (a₃-b₃)² + ... + (aₙ-bₙ)²]

So for example, calculating the distance between points in 3 dimensions would look like this:

d = √[(x₂-x₁)² + (y₂-y₁)² + (z₂-z₁)²]

In general, a “Nearest Neighbors” model can use different types of distance metrics. It doesn’t always have to be Euclidean distance. Some other distance metrics that can be used are Manhattan Distance or Cosine Similarity.

Finding the Nearest Neighbors to a Point

So, now that we know how to find the distance between any two points, all we have to do to find a certain amount of nearest neighbors is calculate all the distances of a group of points close to one specific one and keep track of all the points with the smallest distances. That part seems simple enough. But you might have already wondered… how do we know what the best value for K is? On each end of the spectrum, if K is too small, we’re letting outliers have too much influence over which group our new point belongs to, i.e. if K is 1 and our new point is near an outlier, the outlier will be its only neighbor. If K is too large, all we’re doing is assigning our new point to whatever the largest, or most common class is. In this example, we’ll just use the sklearn Python library to do KNN for us. Here’s how that would look:

# import all the necessary functions from libraries
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# split the data into train and test sets so we can do validation on performance
X_train, X_test, y_train, y_test = train_test_split(scaled_features, df[‘Flower_Species’], test_size=0.3)

# create the model using KNN
model = KNeighborsClassifier(n_neighbors = 2)
model.fit(X_train, y_train)
prediction = model.predict(X_test)

print(confusion_matrix(y_test, prediction))
print(classification_report(y_test, prediction))

Determining the best value for K to Get Optimal Performance

One thing to note about KNN, is that we really don’t know the best number of neighbors to check to get the best performance. So to find that, we have to do a bit of trial and error. Luckily for us, we’re working with code and not having to do it all by hand. So what we can do is write a loop that tries different values of K in some range that seems reasonable. For example, we could start with K=1 and try values up to K=10, then plot a graph of the error rate. When we look at this graph, we should be able to see some area of the plot where the error rate is more consistently low.

error_rate = []

for i in range(1, 20):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train, y_train)
    prediction_at_i = model.predict(X_test)
    error_rate.append(np.mean(prediction_at_i != y_test))

plt.figure(figsize=(12, 6))
plt.plot(range(1, 20), error_rate, color="orange")

plt.title("Error Rate vs. K Value")
plt.xlabel("K value")
plt.ylabel("Error Rate")
plt.show()

Checking the graph, we can see the value of K where the error rate is generally, lowest. We can assume the value of K at this point to our optimal value and use it for all future predictions. Now we can use a confusion matrix to evaluate the performance of KNN on our test data for that specific value of K.

model = KNeighborsClassifier(n_neighbors = 5)
model.fit(X_train, y_train)

print(confusion_matrix(y_test, prediction))
print(classification_report(y_test, prediction))

The More Dimension, the Better… Right?

One of the shortcomings of KNN, is that, although it can work on higher dimensional data, the performance drops off as your dataset has more dimensions. This is sometimes referred to as “The Curse of Dimensionality”. The reason for this is, the higher the dimension of a space, the further apart points tend to be. Think about it like this; In a one dimensional space, if you plot a bunch of random points, there won’t be much empty space between each one. If you increase the space to 2-dimensions, and plot on an X Y plane, you end up with a lot more empty space in comparison to the actual data points. Increase that to 3D, and you have even more empty space between each point. Even though it’s difficult to imagine higher dimensional spaces than the third, this same trend continues. Because of this, if you have data with lots of dimensions, it could be helpful to do some type of dimensionality reduction on it before using it with KNN.

Implementing it All From Scratch

Ok, so it’s definitely nice to have sklearn do most of the heavy lifting for us. But what if we wanna peak under the hood and implement KNN from scratch just to get a better understanding of how it work? Let’s put together a simple version of it

from typing import List
from typing import NamedTuple
from collections import Counter
from scratch.linear_algebra import Vector, distance

class LabeledPoint(NamedTuple):
    point: Vector
    label: str

def get_most_frequent_labels(labels: List[str]) -> str:
    label_counts = Counter(labels)
    mode, mode_count = label_counts.most_common(1)[0] #maybe change this to mode?
    num_modes = len([count
                       for count in label_counts.values()
                       if count == mode_count])
    if num_modes == 1:
        return mode
    else:
        return get_most_frequent_labels(labels[:-1])

def knn_classify(k: int, labeled_points: List[LabeledPoint], new_point: Vector) -> str:
    by_distance = sorted(labeled_points, key=lambda lp: distance(lp.point, new_point))

    k_nearest_labels = [lp.label for lp in by_distance[: k]]

    return get_most_frequent_labels(k_nearest_labels)

Now you should have a good idea of What KNN is, how it works, and how to use it when you need make predictions to classify data.

Share this Post

1
491
0