# A Ruby Program on Logistic Regression

# A Ruby Program on Logistic Regression

### Learn about a case where Ruby was even easier than Python for stat crunching.

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I wrote (rather translated) a small program in Ruby on logistic regression. There is a GitHub repository "__python-ML-minimal__" that has programs written for Prof. Andrew Ng's machine learning class assignments. The cool thing is they are written with only the core language features without any dependency on libraries.

The are good for learning and understanding basic concepts, but Python has a construct called list comprehension which looks like a reverse-for loop. This construct is not so easy when you are trying to read code and understand the logic behind the code. So I promptly translated one program, on __logistic regression__, into Ruby.

Ruby isn't known as a primary language for math. But its functional syntax for operating on collections and ability to handle formatted files cleanly make it an elegant choice to understand what an algorithm is doing. Further in this particular program, I used straight forward loops and array appends in place of the Python list comprehensions, so double easying steps: no dependency on non-core libraries plus no list comprehensions.

But first a __recap__ : In logistic regression, you aim to classify an entity into one of two mutually-exclusive classes, such as spam / not-spam or sick / not-sick. Starting concept is the odds which is the ratio where the numerator is the probability of an event of interest and the denominator is 1 - probability of the event. You input the log of odds ratio to the logit function and you get a probability in the range 0 - 1. Using a threshold for probability, say 0.5, you do the classification.

The data file required to run is “logistic_regression_data.txt“ and here are a sample few rows:

34.62365962451697, 78.0246928153624, 0

30.28671076822607, 43.89499752400101, 0

35.84740876993872, 72.90219802708364, 0

60.18259938620976, 86.30855209546826, 1

79.0327360507101, 75.3443764369103, 1

Each line is a training example. In machining learning, you have pre-classified data, which is called as training data, for some reason that I don’t know of. In my programmer-speak it is a "known" entity.

The program reads the file line by line, splits each line and stores the fields upto the second last field as an array in x and the last field in the variable y. Then the x's are stored in another array m which thus has all the training examples, which in my programmer speak is the array of known entities. n is the number of features, which in my programmer speak is the number of attributes of each entity.

The program then calls the function scalefeatures passing x,m,n as arguments. What does scalefeatures function do? It first calculates mean and standard deviation of the data. Then it does feature scaling a.k.a data normalization.

You do data normalization when your range of values is too broad for different variables. For example, age will be two-digit numbers whereas salary can be five or six-digit numbers. Hence you bring them all to closer values by applying various normalization techniques, one of which is feature scaling that converts input data to standard scores.

The remaining part of the program is expressed in three functions, each corresponding to a mathematical formula in the logistics regression __lesson__ as given below:

**h_logistic_regression**

```
def h_logistic_regression(theta, x, n)
theta_t_x = 0
0.upto n do |i|
theta_t_x += theta[i] * x[i]
end
begin
k = 1.0 / (1 + Math.exp(-theta_t_x))
rescue
if theta_t_x > 10 ** 5
k = 1.0 / (1 + Math.exp(-100))
else
k = 1.0 / (1 + Math.exp(100))
end
end
if k == 1.0
k = 0.99999
end
return k
end
```

**gradientdescent_logistic**

```
def gradientdescent_logistic(theta, x, y, m, n, alpha, iterations)
0.upto iterations-1 do |i|
thetatemp = theta.clone
0.upto n do |j|
summation = 0.0
0.upto m-1 do |k|
summation += (h_logistic_regression(theta, x[k], n) - y[k]) *
x[k][j]
end
thetatemp[j] = thetatemp[j] - alpha * summation / m
end
theta = thetatemp.clone
end
return theta
end
```

**cost_logistic_regression**

```
def cost_logistic_regression(theta, x, y, m, n)
summation = 0.0
0.upto m-1 do |i|
summation += y[i] * Math.log(h_logistic_regression(theta, x[i], n)) +
(1 - y[i]) *
Math.log(1 - h_logistic_regression(theta, x[i], n))
end
return -summation / m
end
```

The program starts with an initial array of θθ's of 0's. It uses these to apply gradient descent algorithm for 4000 iterations to arrive at a final cost. It then prints the initial cost and final cost.

Here's the full program:

```
def scalefeatures(data, m, n)
mean = [0]
1.upto n do |j|
sum = 0.0
0.upto m-1 do |i|
sum += data[i][j]
end
mean << sum / m
end
stddeviation = [0]
1.upto n do |j|
temp = 0.0
0.upto m-1 do |i|
temp += (data[i][j] - mean[j]) ** 2
end
stddeviation << Math.sqrt(temp / m)
end
1.upto n do |j|
0.upto m-1 do |i|
data[i][j] = (data[i][j] - mean[j]) / stddeviation[j]
end
end
return data
end
def h_logistic_regression(theta, x, n)
theta_t_x = 0
0.upto n do |i|
theta_t_x += theta[i] * x[i]
end
begin
k = 1.0 / (1 + Math.exp(-theta_t_x))
rescue
if theta_t_x > 10 ** 5
k = 1.0 / (1 + Math.exp(-100))
else
k = 1.0 / (1 + Math.exp(100))
end
end
if k == 1.0
k = 0.99999
end
return k
end
def gradientdescent_logistic(theta, x, y, m, n, alpha, iterations)
0.upto iterations-1 do |i|
thetatemp = theta.clone
0.upto n do |j|
summation = 0.0
0.upto m-1 do |k|
summation += (h_logistic_regression(theta, x[k], n) - y[k]) *
x[k][j]
end
thetatemp[j] = thetatemp[j] - alpha * summation / m
end
theta = thetatemp.clone
end
return theta
end
def cost_logistic_regression(theta, x, y, m, n)
summation = 0.0
0.upto m-1 do |i|
summation += y[i] * Math.log(h_logistic_regression(theta, x[i], n)) +
(1 - y[i]) *
Math.log(1 - h_logistic_regression(theta, x[i], n))
end
return -summation / m
end
def main()
x = [] # List of training example parameters
y = [] # List of training example results
while line = $stdin.gets
data = line.chomp.split(',').map(&:to_f)
x << data[0..-2]
y << data[-1]
end
m = x.length # Number of training examples
n = x[0].length # Number of features
# Append a column of 1's to x
x.each {|i| i.unshift(1)}
# Initialize theta's
initialtheta = [0.0] * (n + 1)
learningrate = 0.001
iterations = 4000
x = scalefeatures(x, m, n)
# Run gradient descent to get our guessed hypothesis
finaltheta = gradientdescent_logistic(initialtheta,
x, y, m, n,
learningrate, iterations)
# Evaluate our hypothesis accuracy
puts "Initial cost: #{cost_logistic_regression(initialtheta, x, y, m, n)}"
puts "Final cost: #{cost_logistic_regression(finaltheta, x, y, m, n)}"
end
main()
```

You run it as:

ruby logistic-regression.rb < logistic_regression_data.txt

The data file is available __here__

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Published at DZone with permission of Mahboob Hussain , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}