K Nearest Neighbor Algorithm: Explained from Scratch.

Akshar Rastogi
4 min readJun 25, 2021

K Nearest Neighbor is a supervised machine learning algorithm which means it uses predefined classes in which objects are assigned.(P.S -The reference of Charts and Formula is from Springer’s An Introduction to Statistical Learning)

Why K-Nearest Neighbor is used?

K Nearest Neighbor is a kind of advancement in Naive Bayes, it overcomes the shortcoming of Naive Bayes. Naive Bayes assumes data to be separated perfectly by Bayes Decision Boundary which is not so often in real-life data as it is distributed Randomly.

How KNN works?

k in K Nearest Neighbor signifies the no of closet Euclidean Distance Points we want which are called the Neighbors.

KNN classifies a target based upon the probabilistic vicinities calculation i.e. it classifies based upon the probability of occurrence of a particular type-together. KNN assumes that same type of objects occur together.(This also leads to a major drawback which is discussed below.)

Let, a positive integer K & test observation xo(read it as x-nought), the KNN first identifies the k points nearest to xo which are represented by No . It then estimates the conditional probability for class j as fractions of points in No. {No here represents all the points and j represents the range of Y whose probability has to be calculated when X=xo already occured }

In simple words the whole process is :

  • KNN identifies the K number of neighbors.
  • It calculayes the nearest neighbors by calculating the distance. The distance between two points in space is calculated by any of the distances such as Euclidean, Minkowski, Manhattan(putting q=1,2 in Minkiwoski gives Manhattan & Euclidian respectively).
  • Now it does a probability test after calculating the number of neighbors.
  • In our case there are 2 blue and 1 orange. P(B) =2/3 >P(O)=1/3.
  • Hence the target is identified as Blue.

KNN Code Implementation

Here we have updated the code for finding Euclidian Distance. Now in order to confirm our Euclidian Formula is working we will check it for similar and different values which will give zero and a float respectively.

Now let’s move to the second step and know what are our nearest neighbors or what are nearest k number of points.

The third step is the prediction from these neighbors obtained from get_neighbors() in the above code.

This is the whole process of obtaining the KNN from scratch.

Choosing of K also determines our model accuracy and the approach is to test the K’s required for the best result. But there always exist a k-sweet spot which where the prediction is best and the bias-variance tradeoff is balanced. When the value of k is very much a high variance and low bias is observed which shows overfitting.

Where and When KNN Lags?!

  • KNN don’t have any training process basically or because it directly does the calculations on test data by finding the Euclidean Distance for the nearest K points.
  • KNN can’t be used on big datasets because of the above point. If the data is very big then KNN will take a lot of time calculating the nearest neighbor of every point which is not at all practical.
  • When the clusters overlap KNN collapse because KNN works only on two dimensions same is with Naive Bayes also. SVM overcomes this problem.

--

--