Gradient Descent is one of many wildly used optimization algorithms. It’s built on measuring the change of a function with respect to the parameter.There are other variants that extend the vanilla version of Gradient Descent and performs better than it. But a good understanding of it is important to begin with.
Measure The Rate of Change
The learning process is in other words an optimization process. To begin on what
gradient descent is and how it works, it is extremely useful to hang onto this
part of math started by Isaac Newton:
Suppose this is a function that represents our problem:
This is the derivative of that function:
It makes things much easier to realize that derivative is a way of measuring
the rate of change of the function (with respect to the variable). This
realization helps simplify the symbol form of this math into the understanding
that will pave your way out to grasp several complicated algorithms in the
future(e.g., backward propagation).
The function we had above has only one variable. In problems that many machine
learning algorithms are solving, one can have a few to millions of variables:
It will come the time that you need to measure the rate of the change of the
function with respect to one single variable, this measure using the
derivative is called the partial derivative. It is just the normal
derivative taken with respect to one variable while considering all the rest of
the variables constants.
In [1]:
Populating the interactive namespace from numpy and matplotlib
In [2]:
The plot above shows the tangent line at the four positions on the defined
function.
Given one function:
The tangent line function at position (x, y) is given by:
The tangent line gives us much information about that position:
whether the change is increasing or decreasing
how fast it increases or decreases
(In fact, the second derivative will reveal even more information to us,
such as the existance of a local minimum in a certain range of the function.
This property is better studied as convexity.)
The cost definition
A cost function is one that describes the quality of the prediction of the
model. An MSE, or Mean Squared Error measures the average
differences between the predictions and the actual output given one training
data set:
In [3]:
Utilizing the information revealed by that derivative (slope of the tangent
line) we can decide how to move the x so that the function converges to a
local minimum:
In the equation above, the λ is an added control on the size of the step,
also called the learning rate, and here below is a direct translation of
that observation into our gradient descent algorithm:
In [4]:
The partial derivative of the cost J(Θ) defined above with respect to Θ
is deducted to:
Vectorization - Computation Efficiency
The vectorization transforms the representation of the equation, into the form
called vectorized equation even though the equation changes cosmetically. It
doesn’t change the equation, instead it merely changes the way we compute it
with computers. There are many libraries, such as numpy in Python, that provide
these vector and matrice representations and the arithmetics over them. Their
internal implementations rely on the technology called
SIMD that works right on the CPU. It is
data parrallism on a computer chip that allows one single CPU instruction to
work over multiple data, thus comes with computation efficiency at hardware
level.
The numpy I used in this writeup is the fundation of the popular scientific
computation in Python eco-system. This documentation on broadcast explains its
implementation on vectorized computation.
Briefly, numpy builds on top of LAPACK that stands for Linear Algebra Package,
which in turn builds on top of BLAS that stands for Basic Linear Algebra
Subprograms. The word basic in BLAS lies in the sense that it does the
lowest level of operation. In fact, BLAS has
three levels of operations: L1 for
scalar, vector, vector-to-vector operation, L2 for matrix-to-vector operation,
and L3 for matrix-to-matrix operation. The LAPACK designs to exploit the L3
subprograms of BLAS. Using the specifications in BLAS, there are other
implementations of BLAS such as OpenBLAS and Intel®
MKL that are meant to be more
efficient.
In [5]:
Predicting the SAT score
Here is my application of the algorithm on a score data set. The linear model
will be trained using the data
and predict the score.
There are several factors that will affect how GD converges, the learning
step, the quality of the training data. It is useful to observe how GD
behaves during training. One way to show how GD works is to plot the cost by the
number of iteration to show if GD is decreasing the cost after each iteration.
Here let’s do it.
In [12]:
In [13]:
In [14]:
Let’s change our learning step and watch how that affects GD