The AI revolution has been in full swing in the news lately.
Stephen Hawking, Elon Musk, and Bill Gates fear it, Google is using it to beat the world’s best players at Go, while Mark Zuckerburg is busy telling everyone just how close his AI researches sit to his own desk. And, of course, we all remember when IBM’s Watson beat out its human competition in Jeopardy.
What is most interesting about this AI revolution to me as a developer is that it is being exposed as an API driven service. Google provides a prediction API, while Watson is being offered as part of the IBM’s BlueMix platform.
I have dabbled with projects like Weka, and while I found that the accompanying book was written at a level I could understand, the whole idea of integrating machine learning algorithms into something practical seemed like too much work for not enough reward. But having an API means that all the grunt work of actually configuring and running these algorithms is done for me.
In my case, I wanted to remove the grunt work of scanning a list of blog posts for content that was suitable for syndication. It had the potential to remove a few hours of work from my week, while ensuring that I would not miss a hidden gem in my sea of RSS feeds.
This is what I learned.
Find Existing Examples of Solutions
My initial thought was to attempt to match page content to page views. I’m not the first one to think of this:
To be blunt, I don’t know if it’s even possible to do this accurately and easily. I suspect it isn’t. The main problem is that headlines are far from the only thing that influences the number of pageviews an article sees. I suspect that the quality of an article and its newsworthiness has a lot to do with it, too. I also haven’t been able to find other people or companies trying to do this, which isn’t a terribly good sign.
Eventually, I realized that this problem was really just a spam filter. I wasn’t actually interested in how popular content would be — all I wanted to know was that it was similar to the kind of content that was already chosen for syndication.
A spam filter is essentially just a classification of Spam or Ham, or in my case Syndicate or Ignore. And spam filters are good these days.
My first lesson in machine learning is that it's easy to get caught up in the hype and believe that these algorithms can give you magic answers. What you want is a very simple analysis, and there are good examples that you can emulate.
Machine Learning is 90% Data Preparation
Where is the “export my data in a useful format” button? Nowhere, that’s where.
While accessing machine learning algorithms through a RESTFul API is incredibly easy, it means nothing if you don’t have the raw data to analyse. And often the data you need is spread across multiple systems that you probably don’t have the permissions to access.
Be prepared to write a lot of code to query databases, scrape web pages, traverse REST APIs, and sync directories, because the data you need will never be easy to find, collate, and export.
Where to Go From Here
My expectations surrounding AI has been tempered significantly by my attempts to use them. It is clear to me that these platforms have a long way to go before anyone has to worry about their day job being replaced by a machine.
The flip side is that when you do find a manual task that matches the strengths of these algorithms, you may find that the effort required to prepare you data will be more than matched by the effort saved by no longer having to perform tedious, manual decision-making.
There is a reason why the big players in computing are putting a lot of effort into AI. When a few million average developers like me see these APIs and start thinking 'I wonder how that could make my life easier,' you know you are part of a revolution.