DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Developer Git Commit Hygiene
  • A Comprehensive Guide to GitHub
  • Understanding Git
  • The Art of the Bug Fix: Boosting Software Quality Through Effective Git Commits

Trending

  • Implementing API Design First in .NET for Efficient Development, Testing, and CI/CD
  • How to Merge HTML Documents in Java
  • Supervised Fine-Tuning (SFT) on VLMs: From Pre-trained Checkpoints To Tuned Models
  • The Role of AI in Identity and Access Management for Organizations
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. Building a full-text index of git commits using lunr.js and Github APIs

Building a full-text index of git commits using lunr.js and Github APIs

By 
Gary Sieling user avatar
Gary Sieling
·
May. 20, 13 · Interview
Likes (0)
Comment
Save
Tweet
Share
7.8K Views

Join the DZone community and get the full member experience.

Join For Free

Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer employers). Github APIs return JSON, which has the nice property of resembling a tree structure – results can be recursed over without fear of infinite loops. Note that to download the entire commit history for a repository, you need to page through it by sha hash. The API I use here lacks diffs, which must be retrieved elsewhere.

To test this, access a URL like so. The configurable arguments are the repository owner and name fields.
https://api.github.com/repos/torvalds/linux/commits

This is what a commit looks like:

{
  "sha": "7638417db6d59f3c431d3e1f261cc637155684cd",
  "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7638417db6d59f3c431d3e1f261cc637155684cd",
  "author": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "schacon@gmail.com"
  },
  "committer": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "schacon@gmail.com"
  },
  "message": "my commit message",
  "tree": {
    "url": "https://api.github.com/repos/octocat/Hello-World/git/trees/827efc6d56897b048c772eb4087f854f46256132",
    "sha": "827efc6d56897b048c772eb4087f854f46256132"
  },
  "parents": [
    {
      "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7d1b31e74ee336d15cbd21741bc88a537ed063a0",
      "sha": "7d1b31e74ee336d15cbd21741bc88a537ed063a0"
    }
  ]
}

To make the test simple, I download these as JSON locally, then start a python webserver. Were I to make many such calls on a public site, I’d set up a proxy to the github APIs.

python -m SimpleHTTPServer

This data has a number of nested objects and must be flattened to fit into the lunr.jsfull-text index. This example uses the commit number (0, 1, 2..N) as the location in the index, but a real environment should use the commit hash to allow partitioning the ingestion process. Nested objects are flattened by joining subsequent keys with underscores in between. A production-worthy solution needs to escape these to prevent collisions.

var documents = [];
 
function recurse(doc_num, base, obj, value) {
  if ($.isPlainObject(value)) {
    $.each(value, function (k, v) {
      recurse(doc_num, base + obj + "_", k, v);
    });
  } else {
    process(doc_num, base + obj, value);
  }
}
 
function process(doc_num, key, value) {
  if (documents.length <= doc_num)
    documents[doc_num] = {};
 
  if (value !== null)
    documents[doc_num][key] = value + '';
}
 
$.each(data, function(doc_num, commit) {
  $.each(commit, function(k, v) {
    recurse(doc_num, '', k, v)
  });
});

Normally, one sets up a lunr full-text index by specifying all the fields, much like Solr’s numerous XML config files. Lunr doesn’t have nearly as many configuration options, since you only specify the ‘boost’ parameter to increase the value of certain fields in ranking. I imagine this will change as the project grows, at the very least to include type hints.

Given the simplicity of field objects, you can infer infer the field list from JSON payloads. The code below provides two modes, one where you inspect the entire JSON payload, or one where you limit how many commits you check, a good option when JSON data is consistent.

The function accepts configuration objects resembling ExtJS config objects, which lets you override as desired. If fields derived from existing data are required, they can be inserted after any documents are inserted.

function inferIndex(documents, config) {
  return lunr(function() {
    this.ref('id');
    var found = {};
    var idx = this;
 
    $.each(documents,
      function(doc_num, doc) {
 
        if (config &&
            config.limit &&
            config.limit < doc_num)
          return;
 
        $.each(doc, function(k, v) {
          if (!found[k]) {
            if (config && config[k]) {
              idx.field(k, config[k]);
            } else {
              idx.field(k);
            }
            found[k] = true;
          }
        });
    });
  });
}
 
var index =
  inferIndex(documents,
    {limit: 1,
     'commit_author_name':{boost:10}});

Inserting flattened documents into the index becomes simple. The method below provides a callback, should you desire to add calculated fields fields.

$.each(documents,
  function(doc_num, attrs, doc_cb) {
    var doc =
      $.extend(
        {id: doc_num}, attrs);
 
    if (doc_cb) {
      doc = doc_cb(doc);
    }
 
    index.add(doc);
});

At this point we’ve indexed the entire commit history from a git repository, which lets us search for commits by topic. While this is useful, it’d be really nice to be able to facet on fields, which would return the number of documents in a category, like a SQL group by. I’ve found it particularly convenient to facet on author, date, or author’s company.

If you have access to the original documents, you can easily construct facets based on the results of a lunr search:

function facet(index, query, data, field) {
  var results = index.search(query);
 
  var facets = {};
  $.each(results, function(index, searchResult) {
    var doc = data[searchResult.ref];
 
    facets[doc[field]] =
      (facets[doc[field]] === undefined ? 0 :
      facets[doc[field]]) + 1; } );
 
  return facets;
}

Commit messages in repositories where I work often contain names of clients who requested a feature or bug fix. Consequently doing a search faceted by author provides a list of who worked with each client the most – this can also tell you who has worked with various pieces of technology.

The following query demonstrates this technique:

var facets =
   facet(index,
        'driver',
        documents,
        'commit_author_name');
{"Wolfram Sang":24,"Linus Torvalds":3}

The approach shown here works well, but requires retrieving results requires access to the original document data. If we want to filter the results to a category, we need a richer search API than lunr currently provides, as well as callback options within the search API. In Solr there are also options to skip lower-casing data, as that may be inappropriate for category titles. Mitigating these issues will be explored further in future essays.

Commit (data management) GitHub Git

Published at DZone with permission of Gary Sieling, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Developer Git Commit Hygiene
  • A Comprehensive Guide to GitHub
  • Understanding Git
  • The Art of the Bug Fix: Boosting Software Quality Through Effective Git Commits

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!