Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

When Traditional Programming Meets Machine Learning

DZone's Guide to

When Traditional Programming Meets Machine Learning

Learn how to use Java-based algorithms and collaborative filtering to create a machine learning model that suggests books to readers.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

In this post, we are going to develop an autocompletion component using tries data structure and collaborating filtering to choose the best book title suggestions to a user. It is interesting to notice that algorithms, data structures, and machine learning are all working together towards the final solution. The full code and working application are provided together with results.

Problem Formulation

What we want to build from a high-level perspective is an autocomplete field so that when we type some characters, it suggests book titles that start with those characters.

  • From GUI perspective, what is required is a TextField or ComboBox that displays a list of options that are provided by some service like findTitlesThatStartWith(chars [] ch). Fortunately, there are already existing GUI components out there in Swing (also JavaScript or jQuery). For this post, building GUI autocomplete components is not the focus, although it may sometimes be a great challenge to build them.
  • On the other hand, implementing findTitlesThatStartWith(chars [] ch) is of more interest, as it gives an opportunity to optimize what is offered to customers from a data perspective. There could be a long list starting with particular characters, so we can return only a limited number of titles. What is included in that short list needs to make sense from a user perspective as much as possible.

There are various options to return the suggestions:

  1. Sort the list by some criteria (alphabetic order) and return only the top 10 (or any number that makes sense).
  2. Keep count of how many times titles are picked up by users and only show the top 10 titles with the highest count.
  3. Show the top 10 most-rated titles by users.
  4. Show the top 10 that are of most interest based on current user preferences.
  • Once we clarify on the high-level what service will return, it's time to explore how it will search for the titles in a fairly large collection of titles.

Again, there are various options:

  • We search all the list/array and for each title and we see if ut starts with those character or not:
for (String title: allTitles) {
 if (title.startsWith(charsEnteredByUser)) {
  options.add(title);
 }
}

If N is the size of the list and k is the length of the words, we need θ(N*k) time to search. Inserting a new title takes constant time(θ(1)), although adding new films happens fairly rarely.

  • Since this is a search problem, HashTable may be an option because of its very fast constant time access and insert(θ(1)). Unfortunately, HashTables can only look up entire word matches and not prefix matches (i.e. titles that start with...).
  • Similarly, we may consider a well-balanced binary tree. as it gives us θ(log(N), i.e. the size of all titles multiplied by the complexity of searching and inserting. Again, binary trees are not helpful because they cannot find a prefix match but rather an exact match.
  • Fortunately, there is an existing a data structure ready for finding prefix matches. The great thing about this data structures is that with a small modification, it gives you search time complexity θ(k), where k is the length of the prefix. Yes, there is a catch: you may need a lot more storage.

Tries

In this section, we will explore how tries can help with searching for a prefix match in a list of titles (words). Tries are fairly easy to understand once you understand how the words are inserted:

We insert a word's characters in separate nodes when characters don't already exist and we re-use existing ones. We also mark the end of each word with a special sign so that later on, we know when a full word is reached.

Let's see how we can search for titles starting with "te":

When searching, we start with the root and look up immediate children for our first character (t)match. When a node matching the character is found, we treat it as the root and continue to look for direct children that match the next character (e). This logic continues until there are no characters left on the prefix. If that is the case, the suggestion list is the subtree below our last node match. We traverse all subtrees and add words when we reach the end of the word sign.

You may be thinking, Not so fast! The complexity there is not θ(k) where k is the length of the prefix! Indeed, the complexity is rather θ(k+M) where k is the length of the prefix and M is the size of the suggestion list or the subtree under the last node match (immediate children are kept on HashTable, so time is constantly needed to look up character matches). Anyway, we need to traverse the subtree to collect the suggestion words/titles — and if this list many results, it can considerably slow down the algorithm.

Of course, it's better than θ(k*N) where k is the length of the prefix and N is the size of all the lists. But can we do better then that? Well, we can slightly augment nodes to store more information than just the character, as below:2017-11-04_22h02_38

As you've surely noticed by now, we also store the word we are inserting in each node beside the character (in practice, a reference to the word). Step by step, each node will have a list of words that passed on the path.

This modification can greatly help to avoid going down all the subtrees under the last matched node since the node already has the list of the words the subtree contains. Let's see below how the search would look now:2017-11-04_22h16_25

The difference with this solution is that when we reached the last node, the list of words starting with the prefix was already ready, meaning there was no need for subtree traversal. Therefore, the complexity is now θ(k), where k is the length of the prefix.

Final Change

There is a final small trick to do before the algorithm is ready to be implemented. Titles usually are sentences rather than a single word. It will not be very useful if we search only the beginning of the title because, for example, a lot of title starts with "The" or "A." Therefore, we will not include those words if the user searches for a title starting with one of those words.

The solution is easy! We just insert each of the words separately in the tree and save all sentences of the title to the node suggestions list. Now, instead of only having words suggestions, the node has a list of sentences. In this way, we can search for middle words and at the same time be able to suggest all title sentences. The code is fairly short (50 lines), so please feel free to have a look (Trie and TrieTest).

Recommender System

We have only a limit number of suggestions, so when it comes to what suggestions to show to the user, I think the best option is to show what is more relevant or more close to user interests. This leads us to recommender systems.

A recommender system suggests information based on user preferences and data trends. The main advantage of these systems is that they learn automatically and know more about users preferences. Basically, the more users interact with the system (i.e. likes or clicks particular books or movies), the better suggestions (i.e. more close user interest) the system going to make. In a previous post, we explained in detail how this is achieved using the collaborative filtering algorithm. Also, an application was built for suggesting movies based on user ratings. In this post, we are going to implement the same algorithm for suggesting books instead of movies.

Data

Thanks to this source for providing enough data to build a meaningful algorithm. The dataset is quite big — approximately 271.000 books, 1.1 million ratings, and 279,000 users.

Application

The application can be downloaded and executed without any knowledge of Java (although Java must be installed on your computer. We can run the application from the source by simply executing the RUN class, or, if you do not want to open it with an IDE, just run mvn exec:java.

You can try it by rating some books (please note that if books are not rated first, no suggestions will be made) and then search in the field for autocomplete suggestions. Feel free to play around (50 features do not take much time to train) and notice how the algorithm adapts according to your preferences. For 50 genres, it will take no more than 30 seconds to give you suggestions with predicted ratings, along with the error of the algorithm (mean squared error). You can also increase the rating data size up to 1,149,000 but please be aware of the slow-down of the training process.

The application was build using Swing as a GUI and Spark MLib for the collaboration filtering algorithm. After running, the below screen will show up:

2017-11-05_22h36_00

And that's it!

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
ai ,machine learning ,tutorial ,algorithm ,personalization ,recommendation ,tries ,collaborative filtering ,data structure

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}