A Probablistic Approach in Pattern Recognition and Bayes' Theorem
In supervised learning, data is provided to us which can be considered as
evidence. For example, in a text classification system, we may have a collection
of texts (corpus) that can be percieved as evidence as to how language is used
in real world that can give us insight to the text genre, author gender, text
sentiment, etc. And based on these evidence, we can try to get a better opinion
to classify a new text.
In a website where user can post articles and upload comments, we are tasked to
identify language abuse in order to folster a friendly online environment. We
have been given 10,000 comments that are labeled to help develop a model.
Suppose we already know 1 out of every 5 comments contains language abuse,
without knowing anything else at this point, our model classifies all comments
to be non-abusive, as a result of minimizing the classification error. Since
now we have 10,000 comments to look at as evidence, we should be able to do
better than this pure guesswork. So we seek to answer a question, that given the
evidence, what is our renewed belief as to whether this text is abusive.
Bayes’ Theorem - a way to quantify the updated confidence on an event based on new evidence
The definition of conditional probability is this:
By the same definition, we also have this:
Using algebra, substituting P(A∩B) in the first definition using the second
definition, we have:
The above is referred to as Bayes’ Theorem named after the English
statistician Thomas Bayes. A
critical role this theorem plays in the task of estimating a conditional
probability is it reexpresses the conditional probability with another
conditional probability that is often easier to calculate.
Maximum likelihood - after seeing the evidence, which is the most likely event
Continuing our imaginary task of identifying language abuse, suppose we define:
as the event that the text being classified is abusive, and:
as the event that the text being classified is non-abusive. And we have:
Using Bayes’ Theorem, we can calculate two conditional probabililites:
and
where t is the text being classified. The calculations above can be
interpreted as saying, given evidence t, what is our updated belief that
t is abusive or non-abusive depending on which calculation we are
performing. We then compare the result of two calculcations (in a case with
more than 2 classes, we can perform calculations with each class.), and classify
the text to the class for which we have the largest value of the class-
conditional probability.
Since P(t) is class-independant in estimations for all classes, and it is
the comparative differences between estimations that determine the result of
classification, we can drop the P(t) in all calculations for each class, and
we have a discriminant function:
This same observation also allows us to further transform our discriminant
function using any monotonic function, because the magnitude of differences
will be preserved. A logarithm is an obvious one to first try because in
addition to being monotonic, it transforms the multiplication into addition:
Spam classifier - let’s put all of these into code
In [27]:
In [28]:
numpy: 1.11.0
nltk: 3.2.2
In [29]:
The following is the preprocessing that converts the .eml files to plain text
files. Note that the conversion ignored all properties of an email file and only
used the subject and the body. (There are very good reasons to make use of
properties of an email file such as sender IP address. It shouldn’t be
surprising that an industry spam classifier, such as that of Gmail, to make use
of many properties of an email in feature engineering, and even very likely have
an ensemble of models)
In [30]:
In [31]:
Naive Bayes - a word on how to encode a text into our discriminant function
Recall that our discrimimnant function looks like this:
The P(C) can be derived out of the data by simply counting the members of
each class/group and normalize it by the total number of samples. This is
sometimes called prior(actually, determining prior is of greater
importance and requires further reasoning. I am leaving this out of the scope of
this article, only pointing out that we are drawing our prior out of our
training data set). The P(t) is what we are examining right now. I feel
inclined to direct our discussion to a more intuitive level at this point, that
in our text classification task, using the discriminant function we have devised
from the Bayes’ Theorem, we seek to update our prior P(C) with evidence
t in the hope that the evidence improves the guesswork we would otherwise
have to rely solely on the prior. So now we ask, how do we encode our evidence?
Intuitively, a spam email is usually observed to do several things:
such as encouraging reader to click an email,
or faking a story in order to trick the reader to return an email that
contains important information,
or just a marketing email that bombards the reader with product information,
etc.
Authoring emails like this can usually result in a pattern of choosing words
(such as “please click LINK”, “or the service is free”). So the encoding scheme
we seek to encode our evidence is to capture such intuition. To begin with, we
can think of the word frequency. For example, suppose in the 1000 (500 spams
and 500 non-spam emails) emails we collected and labeled, we calculate the word
frequency like this:
the vocabulary V is defined by unioning all the words in all emails.
frequency of a word is defined by the total number of occurences of it in all
emails (in certain group)
The computation complexity of the above process is linear to the amount of
training text and the average length of them. After this process, we then come
to answer the next question in line, how do we encode the text t that is
being classified and is our evidence? In principle, if we want to encode a text
as a distribution of words, there are certain things we need to take into
account. The sentence “How are you” is a deterministic combination of words
because of the English grammar, so the probablistic encoding should be:
So, in order for our encoding of the text to quality as a probablistic
distribution of words, our encoding is very difficult to calculate because of
the dependency of one word on another or on others.
An assumption was made by people who first tried to apply Bayes’ Theorem in text
classification, that one word does not depend on another in their appearance in
texts. By the definition of conditional probability:
If we apply that assumption above into the above definition, we have:
note how the dependant P(A) is assumed away above. Back to our “How are you”
example, we then have:
This assumption is refered to as Naive Bayes, where the word “naive” means
simplified, and the word “bayes” refers to the Bayes’ Theorem.
Now we can finally revisit our initial question as to how to encode our evidence
t. With calculated word frequencies, the t is encoded as a frequency
distribution of words (word that are not in the vocabulary V is not
accounted). Intuitively, a spam-email-potential likely contains words that show
a lot in the spam emails, and does so proportionally more obviously than in the
non-spam emails.
We are now ready to code.
In [32]:
In [33]:
We define a testing utility as cross validation using the splitted data set
which itself is preprocessed.
In [34]:
In [35]:
In [36]:
4327 emails parsed
Splitting the data set into a training data set and a testing data set.
In [37]:
The update_vocabulary is an abstration to allow us fine-tune the model using
observations about the model. Basically, it offers the flexibility to the caller
to do one thing, to remove a subset of vocabulary. The flexibility includes:
a regular expression pattern
a set
a threshold of frequency. words that have a frequency lower than the lower
bound or higher than the upper bound will be discarded.
In [38]:
purged 39376 word
purged 41881 word
purged 70627 word
updated vocabulary size: 7230
In [39]:
0.8660508083140878
Our model reports a near 87% success rate, and this is not too bad. Now we can
take a look at the model and see the usage of word in different categories: