Previously, the linear regression
model is defined as the linear sum of the parameters and input:
Using the Mean Squared Error as the cost, the training data is used to fit
the model so that its predictions differ from the data as little as possible.
In classification problems, our goal is to assign the data point to a class C
for a given x. Given two classes, using the same model above, we can interpret
the result as the the decision boundary between two classes. The decision
boundary can be geometrically viewed as the hyperplane in a d(dimension of
x) dimentional space. In the simplest case when d = 2, the decision boundary
is a straight line that separates the input space into two classes, x
belongs to C1 when y(x) > 0, and C2 when y(x) < 0. The
SVM is closely related
to this choice of decision boundary and a different choice of the cost function
relates it also closely to other discriminant functions to be discussed here.
The generalized discriminant function takes the form:
In [1]:
Populating the interactive namespace from numpy and matplotlib
Sigmoid function
The word sigmoid means S-shaped. The sigmoid function is defined as:
In [4]:
In [5]:
Model representation and interpretation
Use sigmoid function as the descriminant function, we can define the model as:
The sigmoid function allows the result of the model to be interpreted as the
posterior probability (p(C|x)). When we have two classes, we can draw:
Thus
Thus
That means if we have our hypothesis equals to 0.4, the interpretation says
there is a 40% chance that y = 1, or there is (1 - 40%) = 60% chance that
y = 0.
Cost - Given the labelled data, what are the most likely parameters?
Having defined the discriminant function that will output the probability
P(C|x), we are now motivated to find a cost function that can be used to
quantize the quality of the prediction. In the linear regression model, we
used the MSE to do that. Bear in mind that in classification problem, the data
is labelled. For a two class problem, x in the training set is either labelled
1 denoting one class, or 0 denoting the other. Since the model outputs a
probability P(C|x), it is natural to think of a likelihood function:
says that the likelihood of a set of parameter values, θ, given outcomes x, is
equal to the probability of those observed outcomes given those parameter
values. To use the training data X = {x_0, …, x_n} to help define the
parameters, the definition of the likelihood function therefore is:
Or to say, the likelihood of the parameters is the joint probability of each
x given parameters.
The factorial term in the likelihood function presents some computation problem,
such as underflow. And also working with other computations can be much more
convenient, a very special choice of mathematical trick is then performed on the
likelihood function by applying the natural log on it. And it becomes:
Because all log functions maps multiplication into addition where:
The log-likelihood function becomes:
which is computationally a lot easier to work with (such as because it avoids
the underflow problem). For our purpose, we also take the negation of the log-
likelihood (because we want to maximize the probability of parameters), and we
use it as the cost function of our discriminant function:
Combining the two together we get:
If the cost is then defined over a training data set, we take the mean of the
summed error:
The following code plots the sigmoid function side by side with the error
measure:
In [5]:
Intuitively as the graph above shows, the error measure decreases as
h(Theta) approaches to the value indicating the correct output.
The choice of these error measure also has desired mathematical property,
whereas -log() is continuous or differentiable and it is a convex function,
meaning it has a global minimum.
Income Prediction
An archive of the survey on income was used here. In this
post, a logistic model is built, using a subset of features selected by visual
guide. The model was then trained using the entire set of the training
data
contains 32561 entries. The model reports a 16273/16281 accuracy on the
test data that contains 16281 entries.
In [7]:
In [8]:
In [9]:
manual 5229
sample 3974305
test 2003153
In [10]:
In [11]:
In [12]:
(16281, 15)
In [13]:
(32561, 15)
Categorical Value
Or, discrete independent predictor in other context, are values that can
only take on integer values, while continuous values can take any value. In
the data set we have here, columns such as sex, occupation, relationship,
and race, they are all categorical values. In logistic regression, when
scaling the features, we expect to normalize the features into a range [-1, 1]
or [0, 1], with categorical values, this will not make sense. So the preprocess
need to handle these categorical values (There exists other algorithms that can
handle categorical value natively, such as Random Forest).
Let’s start by observing data, this is where we can apply some common sense, I
think.
In [17]:
In [18]:
In [19]:
In [20]:
In [21]:
In [22]:
In [23]:
It looks like people with less promising education experiences (from having no
bachelor degree to lesser, e.g., hs_grad) fall into the 50K group much
less often. It’s also obvious to see that most professors and PhDs make more
than 50K (Though it also shows that higher eduction don’t correspond to the 50K
pay that strictly, I guess this might be because they are students that are not
working).
In [24]:
It looks country doesn’t serve as a informative predictor because it takes
up too absolutely too many samples.
In [25]:
In [26]:
In [27]:
In [28]:
Somehow, people in the never married group doesn’t fall into the 50K group
that much.
Summing all the observations, I am going to experiment with an LR model with
these predictors:
age, age
hours of work per week, hrperw
years of education, eductnum
whether is married, marital_never_married
whether only made it to high school, educt_hs_grad
whether only made it to some college, educt_some_college
whether works in a service industry, occup_other_service