Linear Regression & Gradient Descent : Described Simply without Scikit Library Implementation

Akshar Rastogi
6 min readJun 23, 2021

First thing first, What actually here Linear and Regression means and why and where we should use them!!?? (Not like other blogs that will begin with it’s the most widely used and blah, blah..).p.s- The equations are borrowed from Andrew Ng’s CS229 and I have also studied from the same lecture notes.

Linearity means any relationship that can be plotted in a straight line or a line with a constant slope.

Regression is the process of determining relationships between dependent and independent variables. Regression is preferably used when we have continuous data and the distribution of our independent variable is an almost normal distribution. It’s usually not used on categorical data that have classes such as gender, race, smoker/non-smoker etc.

So this means we will use Linear regression when we want to establish a linear relationship on some continuous data.

Linear Regression is prediction rather than inference.[Prediction vs Inference]

Linear Regression is used for prediction, forecasting, and error reduction. Some examples will be a stock market prediction, house price prediction, weather forecasting, etc.

Let just forget about linear regression and consider the basic prediction formula

Y^ = f(X)

This means that Y^ is the predicted value which is dependent on X by a function f(X). f(X) is our black box we don’t know what’s inside it but definitely, we know it gives us a prediction.

We all know the line equation: y=mx + c (never mind if you still don't know it xD). In linear regression, we have the same equation as-

Y = βX + ε {eqn(1)}

So in the above equation, our Y is dependent on X, but wait a minute is X dependent on ε? The answer is ‘NO’,ε is neither dependent upon X nor Y (but Y is dependent on ε). This is the error in our equation.

As Y is dependent on ε, if we somehow reduce ε we can get better prediction results (i.e. our black-box will get better) but here comes a twist. Error are of two types : Reducible and Irreducible Error.

Reducible Error is the betterment of algorithms and parameters to get better results.

Irrudicble Error is defined as the variability in our data which is given or fed to us as it is we can not reduce it it will always be there but by using some regularization techniques we can shrink it further which is the case of L1 & L2 regularization(Ridge and LASSO).

Code Implementation of Linear Regression:

def mean(arr):
sum = 0
for i in range(0, len(arr)):
sum = sum + arr[i]
return sum / len(arr)
def covariance(arr1, arr2):
sum = 0
for i in range(0, len(arr1)):
sum = (sum + (arr1[i] - mean(arr1)) *(arr2[i] - mean(arr2)))
return sum / (len(arr1) - 1)
def variance(x):
xbar = mean(x)
mean_difference_squared_readings = [pow((x - xbar), 2) for xs in x]
variance = sum(mean_difference_squared_readings)
return variance/float(len(x)-1)
def beta(x,y):
beta = covariance(x,y) / variance(x)
return beta
def epsilon(x, y):
epsilon = mean(y ) - (beta(x,y)*mean(x))
return epsilon

The weights obtained arrays are-

a = epsilon(df1,df2)
b = beta(df1,df2)
print(a,b)

output-

array([ -791.5065989 , -1309.24081572, -1700.44334054, ...,         -278.33176374,   181.59598996, -2604.84740872])array([0.50833517, 0.70749599, 0.85798289, ..., 0.31092824, 0.13400428,        1.20588698])

The Linear Relation obtained on graph

sns.lineplot(x=b, y=a)

The above eqn(1) is usually represented as-

[1]

Here θ are the parameters or weights which we defined as c/ε & m in eqn(1) and X is the independent variable, here we are calling Y as h(x).

Now we will define here a cost function that we want to get as small as possible to get better results. Don’t worry this is not a very complex mathematical formula it's just the difference of actual and predicted value whole raised to the square. We have squared here values of J(θ) to eliminate the value of the negative as getting negative difference values will affect our result by canceling out positive difference values. This gives rise to ordinary least squares.

Cost Function[2]

Don’t get scared with this formula just see that hθ(x) is the actual value and y is the predicted value and i on the superscript shows the term as i ranges from 1 to m shown in summation.

Now we need to obtain a point that minimizes this J(θ) a.k.a our Cost Function. So the basic approach is hit and trial of every value obviously we will get the best one.(Just ignore the fact that it is very time-consuming we will think everything from the baby steps).

[3]

In the above formula := is a computer science borrowed concept which means the LHS is assigned equal to RHS. The alpha parameter here is the learning rate(Learning Rate is the interval by how much we move to obtain minimum) and the partial differentiation of J(θ) means that we are feeding this formula with decreasing value of J(θ), which in the broad sense is the meaning of differentiation.

let just assume a training example (x, y)

[4]

value of hθ(x) is taken from [1]. Now putting the value of partial derivation of cost function in [3] we derive this equation.

[5]

This updated rule is known as Least Mean Square.

The combination of [1], [2], [3], [4], [5] is Gradient Descent.

The objectrive of gradient is descent is to finally find a point which gives the minimum value of cost function also called the Global Minima.

Code Implementation of Gradient Descent:

This is not a universal code replacement of scikit but simple and easily readable code that will make us understand how gradient descent works.

x_nought = 3 #wheer the algorithm startsalpha = 0.01 #alpha_learning_rateprecision = 0.000001previous_step_size  = 1max_iters = 10000 #maximum numbers of iterations that need to be carriediters =0 #initiating the iter counterdf = lambda x: 2*(x+10) #gradient of our functionplot = []iter_plot = []
while previous_step_size > precision and iters < max_iters:prev_x = x_nought #store current x value in prev_xx_nought = x_nought - alpha*df(prev_x) #grad descentprevious_step_size = abs(x_nought - prev_x)# change in xiters = iters+1 #iteration countprint('Iterations', iters, 'X value is', x_nought) #prints all the iterations
print('The local minima is', x_nought,'Occurs at',iters,'th iteration')

output:

Iterations 1 X value is 2.74 
Iterations 2 X value is 2.4852000000000003
Iterations 3 X value is 2.2354960000000004
Iterations 4 X value is 1.9907860800000003
Iterations 5 X value is 1.7509703584000003
.
.
Iterations 616 X value is -9.99994880754244
Iterations 617 X value is -9.999949831391591
Iterations 618 X value is -9.999950834763759
Iterations 619 X value is -9.999951818068483
The local minima is -9.999951818068483 Occurs at 619 th iteration

The lambda x: 2*(x+10) in code is the differentiation of the Cost Function J(θ).

sns.lineplot(x=plot, y=iter_plot)
sns.distplot(x=plot)

--

--