Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using the New York Times API to Chart Occurrences in Headlines

DZone's Guide to

Using the New York Times API to Chart Occurrences in Headlines

· Java Zone
Free Resource

The single app analytics solutions to take your web and mobile apps to the next level.  Try today!  Brought to you in partnership with CA Technologies

This weekend while at a conference, I discovered that the New York Times has a pretty deep developer API. It covers both their newspaper data as well as government and regional information. I played around with it a bit and found it very easy to use via JavaScript (I don't think they document it, but they do use CORS so you can skip JSONP) and thought I'd try to build a little experiment. What if we could use the API to map the number of times a keyword appeared in headlines over time?

I began by looking at the API documentation for article searching and discovered that they let you narrow your search term to a specific part of the article. I began with a simple example that just searched the entire database for a term in the headline.

$.get("http://api.nytimes.com/svc/search/v2/articlesearch.json", {
    "api-key":API,
    sort:"oldest",
    fq:"headline:(\""+term+"\")",
    fl:"headline,snippet,multimedia,pub_date"}, function(res) {
    console.dir(res.response);
  }, "JSON");

Fairly simple. I also limited the result set back. I'm not currently displaying any actual result data, just aggregate data, but I focused on a few key items for the result that I thought I might add later. The important thing though is the response object. It contains a "meta" key with the total number of results.

Ok, that's half way there. The next step was to limit the results to one calendar year. Again, the API makes this pretty darn simple. I wrote a function that accepted a year and search term and handled performing the search.

function fetchForYear(year, term) {
  //YYYYMMDD
  var startYearStr = year + "0101";
  var endYearStr = year + "1231";
  console.log('doing year '+year);
  
  return $.get("http://api.nytimes.com/svc/search/v2/articlesearch.json", {
    "api-key":API,
    sort:"oldest",
    begin_date:startYearStr,
    end_date:endYearStr,
    fq:"headline:(\""+term+"\")",
    fl:"headline,snippet,multimedia,pub_date"}, function(res) {
      //Ok, currently assume a good response
      //todo - check the response
      //console.dir(res.response);
      totalDone++;
  }, "JSON");
  
}

Note that I'm returning the promise generated by the jQuery's Ajax handler. Why? Well I need to perform a large number of Ajax calls. To handle knowing when they are all done, I can combine them into an array and wait for them all to finish. Promises are seriously cool for stuff like this.

function search() {
  var term = $search.val();
  if(term === '') return;
  totalDone = 0;
  
  console.log("Searching for "+term);
  var currentYear = START_YEAR;

  var promises = [];
  
  //Gather up all the promises...
  for(var i=START_YEAR; i<=END_YEAR; i++) {
    promises.push(fetchForYear(i,term));  
  }
  
  //And when done, lets graph this!
  $.when.apply($, promises).done(function() {
    console.log('DONE');
    console.dir(arguments);
  });
  
}

The code above worked perfectly. I tested it on a bunch of calls and it correctly handled waiting for them all to finish and giving me easy access to all the results. Then I ran into a problem. I modified my data to go from 10 calls to 100 and I then hit the NYT's API rate limiter. For some silly reason (and honestly, this is really stupid) the NYT refuses to tell you what their rate limits are until you sign up and make an app. Now, it is free to sign up so it doesn't cost you anything, but there is no valid reason for this! If you provide an API, be up front and direct with what your limits are. Ok, rant done. So their limit is 10 per second. This threw me for a loop. I'm aware of code that will limit the number of calls per time period, but these methods kill (as far as I know) the "extra" calls. I wasn't able to find a library that said, "If I give you X in time period Y and the limit is some smaller number, just push the rest till after Y." A follower on Twitter recommended simply building a recursive function and adding a timeout, so that's the approach I took.

I built a function called processSets, which basically looks at my desired data as a set of ... well sets. Given I want 100+ years of data, and given that the NYT wants no more than 10 hits per second, I thought doing 10 hits at once and then recursively calling myself in a timeout may work. I got this working, although I had to use a few global variables and I feel like this is a bit dirty. I should really abstract this out in a module. For now though, it works. Here is my current solution.

/*
Given an array of data, I process X items async at a time.
When done, I see if I need to do more, and if so, I call it in
Y miliseconds. The idea being I do chunks of aysnc requests with
a 'pad' between them to slow down the requests.
*/
var globalData;
var searchTerm;
var currentYear = START_YEAR;
var PER_SET = 10;
function processSets() {
  var promises = [];
  for(var i=0;i<PER_SET;i++) {
    var yearToGet = currentYear + i;
    if(yearToGet <= END_YEAR) {
      promises.push(fetchForYear(yearToGet,searchTerm));
    }
  }
  $.when.apply($, promises).done(function() {
    console.log('DONE with Set '+promises.length);
    
    //update progress
    var percentage = Math.floor(totalDone/(END_YEAR-START_YEAR)*100);
    $progress.text("Working on data, "+percentage +"% done.");
    
    //massage into something simpler
    // handle cases where promises array is 1
    if(promises.length === 1) {
      var toAddRaw = arguments[0];
      globalData.push({
        year:currentYear,
        results:toAddRaw.response.meta.hits
      });     
    } else {
      for(var i=0,len=arguments.length;i<len;i++) {
        var toAddRaw = arguments[i][0];
        var year = currentYear+i;
        globalData.push({
          year:year,
          results:toAddRaw.response.meta.hits
        });
      }
    }
    currentYear += PER_SET;

    //Am I done yet?
    if(currentYear <= END_YEAR) {
      setTimeout(processSets, 900);
    } else {
      $progress.text("");
      render(globalData); 
    }
  });

}

function search() {
  var term = $search.val();
  if(term === '') return;
  totalDone = 0;
  
  console.log("Searching for "+term);

  globalData = [];
  searchTerm = term;
  $progress.text("Beginning work...");
  processSets();
}

Hopefully this makes some sense. Again, I'm sure some of my readers will tear me a new one for how I did this, but, this is version 0 so be nice. ;) The final part of the puzzle was rendering the result. I started off trying to use Chart.js, but I was unable to get it to render a bar/line chart X-Axis with a smaller number of ticks than my dataset. I then switched to Highcharts. I was able to find a demo pretty darn close to what I wanted, but it was a bit of a struggle to understand some of their documentation. In fact, it took me almost as long to get the chart right as it did to get past my "10 per second" issue. Highcharts is very complex, or I was very tired, but I was able to get it working. One very cool part of their docs is that they have links to JSFiddle. In fact, this is how I ended up finally solving my issue. I used one of their fiddles, modified it, and figured out what I needed to do. The final result is impressive I think. Remember that I'm no designer so this could be done better probably, but I really like the look.

First, an example of a search for Internet. Click for larger image.

And here is a search for War. Again, click for full size.

Forgive the typo in the header - apparently my spelling skills are only slightly worse than my ability to dress well. So - want to play with this? I've attached a zip of my code to this blog entry, minus my API key. I'm also going to link to a demo, but please note that I only have 10K API hits allowed per day. My average blog post gets about 1000 page views, and if you all try it, it will quickly expire. I certainly don't fault the NYT for that, but keep it in mind if you try it. I don't have good error reporting (i.e. "any") so check the console. Because I'm assuming I will hit the limit, here is a video of it in action.

Download attached file

CA App Experience Analytics, a whole new level of visibility. Learn more. Brought to you in partnership with CA Technologies.

Topics:

Published at DZone with permission of Raymond Camden, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}