%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from Data import Data
from kNN import kNN
pd.options.display.precision = 4
The distance functions to be used in kNN.
For $cosine \: similarity$, the distance measure used is $1 - cosine \: similarity$
def l2norm(p):
return np.sqrt(np.sum(p ** 2))
def euclidean_dist(p1, p2):
return l2norm(p1 - p2)
def cosine_dist(p1, p2):
sim = np.dot(p1, p2) / (l2norm(p1) * l2norm(p2))
return 1 - sim
Read the iris dataset and create a $60/20/20$ split for train/dev/test data using 5-fold cross validation.
See Data.py
data_sets = list(Data.read('iris.data', 5, 0.75))
print(f'Train Data = {data_sets[0].train.shape}')
print(f'Development Data = {data_sets[0].dev.shape}')
print(f'Test Data = {data_sets[0].test.shape}')
all_models = {
'euclidean': [],
'normalized euclidean': [],
'cosine': [],
}
for data in data_sets:
all_models['euclidean'].append(kNN(data, euclidean_dist))
all_models['normalized euclidean'].append(kNN(data.normalized(), euclidean_dist))
all_models['cosine'].append(kNN(data, cosine_dist))
hyper_params = [1, 3, 5, 7]
Calculate the average accuracy for each value of $k$ in each model by testing development data against train data.
See kNN.py for k_accuracy()
accuracy = pd.DataFrame(columns=all_models.keys(), index=hyper_params)
for mdl_name, models in all_models.items():
for k in hyper_params:
accuracies = [mdl.k_accuracy(k) for mdl in models]
accuracy.at[k, mdl_name] = np.average(accuracies)
accuracy.plot.bar()
plt.legend(loc='lower right')
plt.show()
accuracy
Calculate the final accuracy against test data.
final_accuracy = pd.DataFrame(columns=all_models.keys(), index=hyper_params)
for mdl_name, models in all_models.items():
for k in hyper_params:
accuracies = [mdl.final_accuracy(k) for mdl in models]
final_accuracy.at[k, mdl_name] = np.average(accuracies)
final_accuracy.plot.bar()
plt.legend(loc='lower right')
plt.show()
final_accuracy
The best accuracy found during hyper-parameter optimization is not perfectly consistent with the best final accuracy found.