DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • From Command Lines to Intent Interfaces: Reframing Git Workflows Using Model Context Protocol
  • How to Push Docker Images to AWS Elastic Container Repository Using GitHub Actions
  • Complete Guide: Managing Multiple GitHub Accounts on One System
  • Developer Git Commit Hygiene

Trending

  • The Middleware Gap in AI Agent Frameworks
  • Optimizing Databricks Spark Pipelines Using Declarative Patterns
  • Design Patterns for GenAI Creative Systems in Advertising
  • Evolving Spring Boot APIs to an Event-Driven Mesh
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. Building a full-text index of git commits using lunr.js and Github APIs

Building a full-text index of git commits using lunr.js and Github APIs

By 
Gary Sieling user avatar
Gary Sieling
·
May. 20, 13 · Interview
Likes (0)
Comment
Save
Tweet
Share
8.1K Views

Join the DZone community and get the full member experience.

Join For Free

Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer employers). Github APIs return JSON, which has the nice property of resembling a tree structure – results can be recursed over without fear of infinite loops. Note that to download the entire commit history for a repository, you need to page through it by sha hash. The API I use here lacks diffs, which must be retrieved elsewhere.

To test this, access a URL like so. The configurable arguments are the repository owner and name fields.
https://api.github.com/repos/torvalds/linux/commits

This is what a commit looks like:

{
  "sha": "7638417db6d59f3c431d3e1f261cc637155684cd",
  "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7638417db6d59f3c431d3e1f261cc637155684cd",
  "author": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "[email protected]"
  },
  "committer": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "[email protected]"
  },
  "message": "my commit message",
  "tree": {
    "url": "https://api.github.com/repos/octocat/Hello-World/git/trees/827efc6d56897b048c772eb4087f854f46256132",
    "sha": "827efc6d56897b048c772eb4087f854f46256132"
  },
  "parents": [
    {
      "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7d1b31e74ee336d15cbd21741bc88a537ed063a0",
      "sha": "7d1b31e74ee336d15cbd21741bc88a537ed063a0"
    }
  ]
}

To make the test simple, I download these as JSON locally, then start a python webserver. Were I to make many such calls on a public site, I’d set up a proxy to the github APIs.

python -m SimpleHTTPServer

This data has a number of nested objects and must be flattened to fit into the lunr.jsfull-text index. This example uses the commit number (0, 1, 2..N) as the location in the index, but a real environment should use the commit hash to allow partitioning the ingestion process. Nested objects are flattened by joining subsequent keys with underscores in between. A production-worthy solution needs to escape these to prevent collisions.

var documents = [];
 
function recurse(doc_num, base, obj, value) {
  if ($.isPlainObject(value)) {
    $.each(value, function (k, v) {
      recurse(doc_num, base + obj + "_", k, v);
    });
  } else {
    process(doc_num, base + obj, value);
  }
}
 
function process(doc_num, key, value) {
  if (documents.length <= doc_num)
    documents[doc_num] = {};
 
  if (value !== null)
    documents[doc_num][key] = value + '';
}
 
$.each(data, function(doc_num, commit) {
  $.each(commit, function(k, v) {
    recurse(doc_num, '', k, v)
  });
});

Normally, one sets up a lunr full-text index by specifying all the fields, much like Solr’s numerous XML config files. Lunr doesn’t have nearly as many configuration options, since you only specify the ‘boost’ parameter to increase the value of certain fields in ranking. I imagine this will change as the project grows, at the very least to include type hints.

Given the simplicity of field objects, you can infer infer the field list from JSON payloads. The code below provides two modes, one where you inspect the entire JSON payload, or one where you limit how many commits you check, a good option when JSON data is consistent.

The function accepts configuration objects resembling ExtJS config objects, which lets you override as desired. If fields derived from existing data are required, they can be inserted after any documents are inserted.

function inferIndex(documents, config) {
  return lunr(function() {
    this.ref('id');
    var found = {};
    var idx = this;
 
    $.each(documents,
      function(doc_num, doc) {
 
        if (config &&
            config.limit &&
            config.limit < doc_num)
          return;
 
        $.each(doc, function(k, v) {
          if (!found[k]) {
            if (config && config[k]) {
              idx.field(k, config[k]);
            } else {
              idx.field(k);
            }
            found[k] = true;
          }
        });
    });
  });
}
 
var index =
  inferIndex(documents,
    {limit: 1,
     'commit_author_name':{boost:10}});

Inserting flattened documents into the index becomes simple. The method below provides a callback, should you desire to add calculated fields fields.

$.each(documents,
  function(doc_num, attrs, doc_cb) {
    var doc =
      $.extend(
        {id: doc_num}, attrs);
 
    if (doc_cb) {
      doc = doc_cb(doc);
    }
 
    index.add(doc);
});

At this point we’ve indexed the entire commit history from a git repository, which lets us search for commits by topic. While this is useful, it’d be really nice to be able to facet on fields, which would return the number of documents in a category, like a SQL group by. I’ve found it particularly convenient to facet on author, date, or author’s company.

If you have access to the original documents, you can easily construct facets based on the results of a lunr search:

function facet(index, query, data, field) {
  var results = index.search(query);
 
  var facets = {};
  $.each(results, function(index, searchResult) {
    var doc = data[searchResult.ref];
 
    facets[doc[field]] =
      (facets[doc[field]] === undefined ? 0 :
      facets[doc[field]]) + 1; } );
 
  return facets;
}

Commit messages in repositories where I work often contain names of clients who requested a feature or bug fix. Consequently doing a search faceted by author provides a list of who worked with each client the most – this can also tell you who has worked with various pieces of technology.

The following query demonstrates this technique:

var facets =
   facet(index,
        'driver',
        documents,
        'commit_author_name');
{"Wolfram Sang":24,"Linus Torvalds":3}

The approach shown here works well, but requires retrieving results requires access to the original document data. If we want to filter the results to a category, we need a richer search API than lunr currently provides, as well as callback options within the search API. In Solr there are also options to skip lower-casing data, as that may be inappropriate for category titles. Mitigating these issues will be explored further in future essays.

Commit (data management) GitHub Git

Published at DZone with permission of Gary Sieling. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • From Command Lines to Intent Interfaces: Reframing Git Workflows Using Model Context Protocol
  • How to Push Docker Images to AWS Elastic Container Repository Using GitHub Actions
  • Complete Guide: Managing Multiple GitHub Accounts on One System
  • Developer Git Commit Hygiene

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook