Imagine we have a set of labeled and unlabeled data, and we want to build a classifier which takes the unlabeled data as input and labels that data as output. With this kind of situation, we’ll need to build a classification model that will learn from already-labeled data (training data). Later we’ll use that model to predict our unlabeled data (test data).

This type of machine learning is called supervised learning, which we can define as feeding data into a machine learning algorithm. In doing so, we’re actually showing that groups exist, and which data belong to which groups.

There are many supervised learning models. Examples include, Support Vector Machines (SVM), logistic regression, decision trees, factorization machines, random forests, and K-Nearest Neighbors (KNN) — which will be the focus of this article.

KNN is a non parametric technique, and in its classification it uses *k*, which is the number of its nearest neighbors, to classify data to its group membership. It primarily works by implementing the following steps.

First, it calculates the distance between all points. Second, it finds the *k* points that are closest based on the previously calculated distances. Finally, the class is chosen by the majority of the surrounding points.

*K* is a positive integer which varies. If you have *k* as 1, then it means that your model will be classified to the class of the single nearest neighbor. The choice of *k* is very important in KNN because a larger *k* reduces noise. However, to choose an optimal *k*, you will use GridSearchCV,** **which is an exhaustive search over specified parameter values.

In the above plot, black and red points represent two different classes of data. We need to classify our blue point as either red or black. If k = 1, KNN will pick the nearest of all and it will automatically make a classification that the blue point belongs to the nearest class. If k > 1, then a vote by majority class will be used to classify the point.

We’re going to work through a practical example using Python’s scikit-learn. Therefore, we need to install pandas, which we’ll use while working with dataframes. We also need to install numpy, which will help us work with numpy arrays. Finally, we’ll install scikit-learn, which is a machine learning package in Python that helps us work with algorithms like KNN.

Join more than 14,000 of your fellow machine learners and data scientists. Subscribe to the premier newsletter for all things deep learning.

When using scikit-learn’s KNN classifier, we’re provided with a method

which takes 9 optional parameters. Let’s go through them one by one.**KNeighborsClassifier**()

— This is an integer parameter that gives our algorithm the number of**n_neighbors***k*to choose. By default k = 5, and in practice a better k is always between 3–10.**weights**— Since the prediction is made based on the votes of the nearest points, all the other points in the dataset are completely ignored. This results in a discontinuous function. The best way to solve this is by introducing weights. If we don’t define weights,*uniform*will be automatically used. This works by weighting all points in each neighborhood equally.*Distance*is another option for weights, which uses a principle of closer neighbors having more influence than ones further away.**algorithm**—*auto*is the default algorithm used in this method, but there are other options:`kd_tree`

and`ball_tree`

. Both of these algorithms help to execute fast nearest neighbor searches in KNN. The ultimate difference between them is that ball_tree works with more distance metrics than`kd_tree`

.- Other method parameters include:

a). `leaf_size`

— (default = 30) which is passed to `kd_tree`

and `ball_tree`

. This affects the speed of construction and query.

b). p — for power parameter for Minkowski metric if p=2 it is equivalent to using euclidean distance and if p=1 it is equivalent to using manhattan distance,

c). metric — which is the distance metric for the tree

d). metric params

e). `n_jobs`

— which is the number of parallel jobs to run for neighbors search.

In our case study, we’re going to use two datasets to show how KNN can be used to create a model and later make a prediction based on the k-nearest neighbors of the test dataset. The first dataset we’re going to use is the commonly-used Iris dataset. This dataset has 150 instances, and each instance has a class of either setosa, versicolor, or virginica (types of flowers). Every type of flower has 50 instances.

The second case study will involve trying to build a KNN model to predict whether a person will be a defaulter or not in a credit scoring system. We’ll use two predictor variables (age, loan amount) and one target variable (default).

Let’s start with our first case study using the Iris dataset. First we’ll import our necessary packages for this project: scikit-learn datasets, `model_selection`

& neighbors, and numpy. Scikit-learn datasets contain the Iris dataset. Scikit-learn also has a model selection method, which will help us prevent overfitting by assisting us in partitioning data to training and testing datasets. Scikit-learn also has a neighbors method, which gives us the ability to implement the KNN algorithm in Python.

Scikit-learn’s Iris data set is already divided into `iris.data`

and `iris.target`

. `Iris_data`

has 4 columns, and these are our prediction variables. They include sepal height, sepal width, petal height, and petal width. Iris_target has labels for each row — it can either be setosa, versicolor, or virginica. Below is a scatter plot to visualize the distribution of label points in a 2D graph, using petal length and width.

We can now split our data into training and test data using scikit-learn’s `train_test_split`

function. Since we don’t have a large dataset, we’ll use 75:25 as a ratio of training to testing, which is scikit-learn’s default training:test ratio.

After splitting the data, we can now build our classifier using *class *sklearn.neighbors.**KNeighborsClassifier()****. **We will pass our *k *as `n_neighbors = 13`

, weights, and the type of algorithm to use.

Finally, let’s fit our training data, then use that model to predict labels of our testing set. Using `score()`

, we’re able to know the accuracy of our model in predicting our test data.

## Getting optimal parameters

In order to have more accurate predictions in your test data, you will need to have optimal parameters. This is obtained by using GridSearchCV, found in Scikit-learn `model_selection`

.

Lets create a dictionary for our parameter values: `k_range`

is range for k in this case it will be ranging from 1 to 31. Weight variable is holding two weight options— `uniform`

and `distance`

.

Now let’s pass our parameter values to GridSearchCV, our classifier `param_grid`

; `cv`

, which is an integer to specify the number of folds; and `scoring`

to evaluate predictions on the test set.

Our results shows that the best score is 0.98 when *k* is set as 13 and weights as `uniform`

.

In this case study, we’re going to classify whether a person of age 43 who borrowed a loan of $60,000 is going to repay the loan or default. Our labels are 1 for default and 0 for repay. First we’re going to create a numpy array with training data, with age and amount borrowed as our prediction variables and default as the label.

We can now convert our array to a pandas dataframe and separate our prediction variables with labels using pandas drop:

Finally, we will create a KNN classifier and use it to classify our test data:

Our results using KNN predict the person as a non-defaulter. This is simply because our model classified the person as having more similar characteristics or features (age and amount of loan) to those who didn’t default on their loan rather than those who defaulted.

KNN is an effective machine learning algorithm that can be used in credit scoring, prediction of cancer cells, image recognition, and many other applications. The main importance of using KNN is that it’s easy to implement and works well with small datasets.

However, KNN also has disadvantages. Specifically, it doesn’t work well with large datasets because for every test data, distance between all training data points and the test data in question is computed, resulting in large space and a long required timeframe.

**Discuss this post on ****Hacker News****.**

*Editor’s Note:** Heartbeat** is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.*

*Editorially independent, Heartbeat is sponsored and published by *Comet*, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.*

*If you’d like to contribute, head on over to our** call for contributors**. You can also sign up to receive our weekly newsletters (**Deep Learning Weekly** and the *Comet Newsletter*), join us on** **Slack**, and follow Comet on **Twitter** and **LinkedIn** for resources, events, and much more that will help you build better ML models, faster.*