What is KNN:

Supervised learning algorithm for classification.

Some assumptions:

  • Real world data will not be this nice
  • Assuming that the features that we extracted to represent data are good, we can assume that there will be some form of clustering for each class. More formally, Given and , we can distribute X such that: and
    Meaning: All points must be clustered and no point can be clustered to more than one class

Visualize the data

Task

Image you are given data D = ( , ) where and , Can you find the value of y for a new arbitrary test point (in black)

Training (sort of)

Intuitively:

  • At every new data point, we ask “Which are the closest k points to this”
  • Then, we predict the label by looking at those k things and seeing which y shows up the most
  • If everything is tied, just take the closest one

Given a new test point , we can predict the label ( y_t ) by:

  • Compute the Euclidean distance between ( x_t ) and each ( x_i ) in the D:
  • Identify the k nearest neighbors by selecting the k smallest distances: where correspond to the indices of the nearest neighbors
  • Predict the label by majority voting among the labels of the ( k ) nearest neighbors: where is the indicator function:

Note for ties: If there is a tie among labels, select the label with the closest distance to Formally, let be the set of labels tied for the majority vote. Then:

Example here: I created a meshgrid and predicted every point on it. The graph contours should what label would be predicted if the data point existed at (x, y)

I was playing around with different looking inputs and trying to see the boundaries. It was fun. I deployed it here if you want to do the same: https://knn-implementation.onrender.com/ Pls don’t abuse. This runs on a free tier, and is very slow. If it does not load, wait for a minute till it spins up again lol