DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Derivation of the Normal Equation for Linear Regression

Eli Bendersky user avatar by
Eli Bendersky
·
Jan. 15, 15 · Interview
Like (4)
Save
Tweet
Share
19.43K Views

Join the DZone community and get the full member experience.

Join For Free

i was going through the coursera "machine learning" course, and in the section on multivariate linear regression something caught my eye. andrew ng presented the normal equation as an analytical solution to the linear regression problem with a least-squares cost function. he mentioned that in some cases (such as for small feature sets) using it is more effective than applying gradient descent; unfortunately, he left its derivation out.

here i want to show how the normal equation is derived.

first, some terminology. the following symbols are compatible with the machine learning course, not with the exposition of the normal equation on wikipedia and other sites - semantically it's all the same, just the symbols are different.

given the hypothesis function:

<math>

we'd like to minimize the least-squares cost:

<math>

where <math> is the i-th sample (from a set of m samples) and <math> is the i-th expected result.

to proceed, we'll represent the problem in matrix notation; this is natural, since we essentially have a system of linear equations here. the regression coefficients <math> we're looking for are the vector:

<math>

each of the m input samples is similarly a column vector with n+1 rows, <math> being 1 for convenience. so we can now rewrite the hypothesis function as:

<math>

when this is summed over all samples, we can dip further into matrix notation. we'll define the "design matrix" x (uppercase x) as a matrix of m rows, in which each row is the i-th sample (the vector <math> ). with this, we can rewrite the least-squares cost as following, replacing the explicit sum by matrix multiplication:

<math>

now, using some matrix transpose identities, we can simplify this a bit. i'll throw the <math> part away since we're going to compare a derivative to zero anyway:

<math> <math>

note that <math> is a vector, and so is y. so when we multiply one by another, it doesn't matter what the order is (as long as the dimensions work out). so we can further simplify:

<math>

recall that here <math> is our unknown. to find where the above function has a minimum, we will derive by <math> and compare to 0. deriving by a vector may feel uncomfortable, but there's nothing to worry about. recall that here we only use matrix notation to conveniently represent a system of linear formulae. so we derive by each component of the vector, and then combine the resulting derivatives into a vector again. the result is:

<math>

or:

<math>

now, assuming that the matrix <math> is invertible, we can multiply both sides by <math> and get:

<math>

which is the normal equation.


Linear regression

Published at DZone with permission of Eli Bendersky. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How Do the Docker Client and Docker Servers Work?
  • What Should You Know About Graph Database’s Scalability?
  • Visual Network Mapping Your K8s Clusters To Assess Performance
  • Core Machine Learning Metrics

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: