Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Ruby Program on Logistic Regression

DZone's Guide to

A Ruby Program on Logistic Regression

Learn about a case where Ruby was even easier than Python for stat crunching.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

I wrote (rather translated) a small program in Ruby on logistic regression. There is a GitHub repository "python-ML-minimal" that has programs written for Prof. Andrew Ng's machine learning class assignments. The cool thing is they are written with only the core language features without any dependency on libraries.

The are good for learning and understanding basic concepts, but Python has a construct called list comprehension which looks like a reverse-for loop. This construct is not so easy when you are trying to read code and understand the logic behind the code. So I promptly translated one program, on logistic regression, into Ruby.

Ruby isn't known as a primary language for math. But its functional syntax for operating on collections and ability to handle formatted files cleanly make it an elegant choice to understand what an algorithm is doing. Further in this particular program, I used straight forward loops and array appends in place of the Python list comprehensions, so double easying steps: no dependency on non-core libraries plus no list comprehensions.


But first a recap : In logistic regression, you aim to classify an entity into one of two mutually-exclusive classes, such as spam / not-spam or sick / not-sick. Starting concept is the odds which is the ratio where the numerator is the probability of an event of interest and the denominator is 1 - probability of the event. You input the log of odds ratio to the logit function and you get a probability in the range 0 - 1. Using a threshold for probability, say 0.5, you do the classification.

The data file required to run is “logistic_regression_data.txt“ and here are a sample few rows:

34.62365962451697, 78.0246928153624, 0

30.28671076822607, 43.89499752400101, 0

35.84740876993872, 72.90219802708364, 0

60.18259938620976, 86.30855209546826, 1

79.0327360507101, 75.3443764369103, 1

Each line is a training example. In machining learning, you have pre-classified data, which is called as training data, for some reason that I don’t know of. In my programmer-speak it is a "known" entity.

The program reads the file line by line, splits each line and stores the fields upto the second last field as an array in x and the last field in the variable y. Then the x's are stored in another array m which thus has all the training examples, which in my programmer speak is the array of known entities. n is the number of features, which in my programmer speak is the number of attributes of each entity.

The program then calls the function scalefeatures passing x,m,n as arguments. What does scalefeatures function do? It first calculates mean and standard deviation of the data. Then it does feature scaling a.k.a data normalization.

You do data normalization when your range of values is too broad for different variables. For example, age will be two-digit numbers whereas salary can be five or six-digit numbers. Hence you bring them all to closer values by applying various normalization techniques, one of which is feature scaling that converts input data to standard scores.

The remaining part of the program is expressed in three functions, each corresponding to a mathematical formula in the logistics regression lesson as given below:

h_logistic_regression

?


def h_logistic_regression(theta, x, n)
    theta_t_x = 0
    0.upto n do |i|
        theta_t_x += theta[i] * x[i]
    end

    begin
        k = 1.0 / (1 + Math.exp(-theta_t_x))
    rescue
        if theta_t_x > 10 ** 5
            k = 1.0 / (1 + Math.exp(-100))
        else
            k = 1.0 / (1 + Math.exp(100))
        end
    end

    if k == 1.0
        k = 0.99999
    end

    return k
end


gradientdescent_logistic

?

def gradientdescent_logistic(theta, x, y, m, n, alpha, iterations)
    0.upto iterations-1 do |i|
        thetatemp = theta.clone
        0.upto n do |j|
            summation = 0.0
            0.upto m-1 do |k|
                summation += (h_logistic_regression(theta, x[k], n) - y[k]) *
                             x[k][j]
             end
             thetatemp[j] = thetatemp[j] - alpha * summation / m
        end
        theta = thetatemp.clone
    end
    return theta
end


cost_logistic_regression

?


def cost_logistic_regression(theta, x, y, m, n)
    summation = 0.0
    0.upto m-1 do |i|
        summation += y[i] * Math.log(h_logistic_regression(theta, x[i], n)) +
                     (1 - y[i]) *
                     Math.log(1 - h_logistic_regression(theta, x[i], n))
    end
    return -summation / m
end


The program starts with an initial array of θθ's of 0's. It uses these to apply gradient descent algorithm for 4000 iterations to arrive at a final cost. It then prints the initial cost and final cost.

Here's the full program:


def scalefeatures(data, m, n)
    mean = [0]
    1.upto n do |j|
        sum = 0.0
        0.upto m-1 do |i|
            sum += data[i][j]
        end
        mean << sum / m
    end

    stddeviation = [0]
    1.upto n do |j|
        temp = 0.0
        0.upto m-1 do |i|
            temp += (data[i][j] - mean[j]) ** 2
        end
        stddeviation << Math.sqrt(temp / m)
    end

    1.upto n do |j|
        0.upto m-1 do |i|
            data[i][j] = (data[i][j] - mean[j]) / stddeviation[j]
        end
    end    

    return data
end

def h_logistic_regression(theta, x, n)
    theta_t_x = 0
    0.upto n do |i|
        theta_t_x += theta[i] * x[i]
    end

    begin
        k = 1.0 / (1 + Math.exp(-theta_t_x))
    rescue
        if theta_t_x > 10 ** 5
            k = 1.0 / (1 + Math.exp(-100))
        else
            k = 1.0 / (1 + Math.exp(100))
        end
    end 

    if k == 1.0
        k = 0.99999
    end

    return k
end

def gradientdescent_logistic(theta, x, y, m, n, alpha, iterations)
    0.upto iterations-1 do |i|
        thetatemp = theta.clone
        0.upto n do |j|
            summation = 0.0
            0.upto m-1 do |k|
                summation += (h_logistic_regression(theta, x[k], n) - y[k]) *
                             x[k][j]
             end
             thetatemp[j] = thetatemp[j] - alpha * summation / m
        end
        theta = thetatemp.clone
    end
    return theta
end

def cost_logistic_regression(theta, x, y, m, n)
    summation = 0.0
    0.upto m-1 do |i|
        summation += y[i] * Math.log(h_logistic_regression(theta, x[i], n)) +
                     (1 - y[i]) *
                     Math.log(1 - h_logistic_regression(theta, x[i], n))
    end
    return -summation / m
end

def main()
    x = []  # List of training example parameters
    y = []  # List of training example results

    while line = $stdin.gets
        data = line.chomp.split(',').map(&:to_f)
        x << data[0..-2]
        y << data[-1]
    end

    m = x.length      # Number of training examples
    n = x[0].length   # Number of features

    # Append a column of 1's to x
    x.each {|i| i.unshift(1)}

    # Initialize theta's
    initialtheta = [0.0] * (n + 1)
    learningrate = 0.001
    iterations   = 4000

    x = scalefeatures(x, m, n)

    # Run gradient descent to get our guessed hypothesis
    finaltheta = gradientdescent_logistic(initialtheta,
                                          x, y, m, n,
                                          learningrate, iterations)

    # Evaluate our hypothesis accuracy
    puts "Initial cost: #{cost_logistic_regression(initialtheta, x, y, m, n)}"
    puts "Final cost: #{cost_logistic_regression(finaltheta, x, y, m, n)}"
end

main()

You run it as:

ruby logistic-regression.rb < logistic_regression_data.txt

The data file is available here

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
ruby ,python ,machine learning ,regressions

Published at DZone with permission of Mahboob Hussain, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}