Running all cells in this notebook might take 2-3 minutes!
%matplotlib inline
from Data import Data
from NBC import Model, average_accuracy
from Vocabulary import Vocabulary
import pandas as pd
import matplotlib.pyplot as plt
Reading the vocabulary file included with the data folder.
See Vocabulary.py
vocab = Vocabulary(r'aclImdb\\imdb.vocab')
print(f'Stopwords: {vocab.stopwords}')
Reading the train reviews file using 5 fold cross-validation.
See Data.py and Reviews.py
k = 5
data_sets = list(Data.read_train('aclImdb', k))
Creating NBC models for each of the data set that was produced by 5 fold cross-validation.
See NBC.py
models = [Model(x.train, vocab) for x in data_sets]
print(f'{len(models)} NBC models created from the data sets')
reviews = data_sets[0].all_train
index_the = vocab.get_index('the')
Calculating $P[“the”]$ = num of documents containing ‘the’ / num of all documents
print(f'P["the"] = {reviews.count(index_the) / len(reviews.all)}')
Calculating $P[“the” | Positive]$ = # of positive documents containing “the” / num of all positive review documents
print(f'P["the" | Positive] = {reviews.count_positive(index_the) / len(reviews.positive)}')
print(f'P["the" | Negative] = {reviews.count_negative(index_the) / len(reviews.positive)}')
Calculating the average accuracy of these models without any smoothing and ignoring stop words only.
average_accuracy()
is defined in NBC.py
dev_data = [x.dev for x in data_sets]
accuracy = average_accuracy(models, dev_data, smoothen=0, min_occurrence=0)
print(f'Average accuracy = {accuracy:.4%}')
Calculating the average accuracy using smoothing hyperparameters in the range $[0, 1]$ with step size $0.1$
h_params = {}
for i in (x * 0.1 for x in range(0, 11)):
h_params[i] = average_accuracy(models, dev_data, smoothen=i, min_occurrence=0)
smoothing_accuracies = pd.DataFrame.from_dict(h_params, orient='index', columns=['Accuracy'])
smoothing_accuracies
smoothing_accuracies.plot()
It's worth noting that increase the smoothing parameter from 0
even in the slightest increases accuracy considerably.
This is because a lot of the words were forcing the probability calculation to be 0
rendering any other words in the same review useless.
The second hyperparameter to optimize is min_occurrence
.
min_occurence
specifes the percentage of total reviews that a word must occur in for it to be considered.
min_occurence=0.00025
implies that a word must occur in atleast $0.025\%$ of the reviews i.e. $5/20000$
h_params = {}
for i in (x * 0.00025 for x in range(0, 11)):
h_params[i] = average_accuracy(models, dev_data, smoothen=0, min_occurrence=i)
min_occurrence_accuracies = pd.DataFrame.from_dict(h_params, orient='index', columns=['Accuracy'])
min_occurrence_accuracies
min_occurrence_accuracies.plot()
The accuracy improves significantly if we ignore words that occur rarely,
specially those that occur $0$ times in either Positive
or Negative
class.
The above plot is for varying values of min_occurence
on x-axis with smoothen=0
.
Redrawing the same plot with smoothen=1
.
h_params = {}
for i in (x * 0.00025 for x in range(0, 11)):
h_params[i] = average_accuracy(models, dev_data, smoothen=1, min_occurrence=i)
min_occurrence_accuracies = pd.DataFrame.from_dict(h_params, orient='index', columns=['Accuracy'])
min_occurrence_accuracies.plot()
The accuracy is now decereasing but the change in accuracy is rather small in comparison to before.
Simply maximizing both the hyperparameters does not yield better results.
The ideal model is a balance between the 2 hyperparameters which is rather expensive to compute in this example.
data_sets = list(Data.read_all('aclImdb', k))
models = [Model(x.train, vocab) for x in data_sets]
test_data = [x.test for x in data_sets]
accuracy = average_accuracy(models, test_data, smoothen=1, min_occurrence=0.00025)
print(f'Average accuracy = {accuracy:.4%}')
pos_words, neg_words = models[0].top_words(top_count=10, min_occurrence=0)
print(f'Top 10 positive predicting words:\n{pos_words}')
print()
print(f'Top 10 negative predicting words:\n{neg_words}')
The top words seem like typos or otherwise meaningless because
these words occured just once in their prediction class and never occured again.
pos_words, neg_words = models[0].top_words(top_count=10, min_occurrence=0.01)
print(f'Top 10 positive predicting words:\n{pos_words}')
print()
print(f'Top 10 negative predicting words:\n{neg_words}')
The top words now seem meaningful after filtering out words that occur too rarely (in less than $1\%$ of the reviews).