AI/ML Resources

Maze-Walking Algorithm

Maze-walking algorithms seem straight forward enough. Author Nicolas Frankel "walks" us through his experience of attempting to implement some.

April 13, 2016

by Nicolas Fränkel

· 4,780 Views · 1 Like

C++11 Threads, Affinity, and Hyperthreading

C and C++ standards for years treated concurrency and multi-threading as outside the standard. Here's a bit of background, intro, and experimentation related to threads, hyperthreading, and affinity.

March 20, 2016

by Eli Bendersky

· 28,117 Views · 8 Likes

Comparison of Asynchronous Messaging Technologies: JMS, AMQP, and MQTT

A comparison of three different IoT messaging protocols by DZone user Chanaka Fernando.

March 17, 2016

by Chanaka Fernando

· 50,711 Views · 17 Likes

How to Detect and Analyze DDOS Attacks Using Log Analysis

Running anything online leaves you vulnerable to attack. In this post, we go into depth on an attack at Linode that lasted 10 days.

March 11, 2016

by Jennifer Marsh

· 19,001 Views · 6 Likes

Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds

Barclays Data Scientist Gianmario Spacagna and Harry Powell, Head of Advanced Analytics, describe how they iteratively process raw data directly from the central data warehouse into Spark and how Tachyon is their key enabling technology.

February 16, 2016

by Henry Powell

· 38,114 Views · 8 Likes

Enable Docker Remote API on Docker Machine on Mac OS X

Check out how to enable the Docker Remote REST API on a Docker Machine running on Mac OS X in this neat tutorial!

February 12, 2016

by Arun Gupta

· 9,683 Views · 2 Likes

LogPacker: A New Log Management Platform

Interested in log management? Check out LogPacker, a new log management platform! Neat features include scanning, aggregation, clustering, and more!

January 30, 2016

by Vladislav Chernov

· 5,538 Views · 4 Likes

Distributed Systems in the Real World

A non-academic’s guide to making good design decisions when building distributed applications.

January 6, 2016

by Tyler Treat

· 19,765 Views · 16 Likes

Deploying a Java App with an Oracle Database in a Docker Container

Are containers good for the enterprise? Yep. Is Oracle used everywhere for enterprise DBMS? Yes, all the numbers say. Learn how to deploy a Java app with two existing Oracle XE databases using Docker.

October 29, 2015

by Amjad Afanah

· 40,947 Views · 9 Likes

Getting Started With Couchbase Using Docker

This blog will explain how you can easily start Couchbase Server 4.0 as a Docker image.

October 23, 2015

by Arun Gupta

· 10,881 Views · 11 Likes

The Basics of Scaling Java EE Applications

How to understand how scaling Java applications works, as well as the different kinds of scaling and specific techniques.

October 21, 2015

by Abhishek Gupta

CORE

· 50,267 Views · 42 Likes

Implementing Simple Sort Algorithms in ARM Assembly

Create your own basic algorithms using assembly for your Raspberry Pi

September 8, 2015

by Kevin Hooke

· 11,815 Views · 1 Like

Creating a RabbitMQ Cluster on a Single Machine

Learn more about installing a cluster on a single machine and how to add more nodes to your cluster.

August 24, 2015

by Alex Theedom

· 7,605 Views · 3 Likes

How to Make a Snake Game in Swing

Learn to create a full Snake game in Swing using width and height constraints and the ImageIcon class.

July 30, 2015

by Das Nic

· 25,062 Views · 1 Like

Design Patterns in Automated Testing

Learn how to make your test automation framework better through Page Objects, Facades, and Singletons.

July 13, 2015

by Anton Angelov

· 81,143 Views · 7 Likes

60 Most Commonly Used R Packages in R Programming Language

A comprehensive list of 60 most commonly used R packages for data science and analytics.

July 6, 2015

by Ajitesh Kumar

· 10,488 Views · 2 Likes

Cornerstone OnDemand and TED join forces to spark innovation in professional learning and development

Curated TED Talk playlists integrated within Cornerstone Learning enable organisations to instantly access new, innovative ideas and share knowledge across their workforce June 30, 2015 - Cornerstone OnDemand, a global leader in cloud-based talent management software solutions, today announced the company is teaming with TED, the non-profit global community devoted to spreading ideas, to deliver curated TED Talks to Cornerstone clients for a new, innovative approach to professional learning and development. The first and only collaboration of its kind, Cornerstone clients now have the ability to provide their workforce with modern, mobile-enabled TED Talks from world-class leaders at the forefront of their fields from within Cornerstone Learning. Cornerstone's collaborative learning functionality also allows organisations to enable peer-to-peer knowledge capture and discussions that can extend the learning impact of TED Talks. Watched and listened to more than 1 billion times this year, TED Talks introduce ideas that can help companies transform how their people think and work. Cornerstone clients will have access to a series of curated TED Talk playlists designed to address key business challenges in an innovative format that is unique, powerful and inspiring. With curated TED Talk playlists through Cornerstone: Inspire your workforce. TED brings together the world's most inspiring and ingenious people whose ideas can strengthen how professionals understand and think about the world around them. TED's curation of talks on behalf of Cornerstone can help organisations generate excitement and engagement among employees, help management crystallise goals, start important conversations, and spark collaborations. Provide the best, most relevant content. Organisations will gain access to the very best collections of TED Talks across a wide range of topics that are central to innovation and talent development, including change management, culture building, leadership, technology, globalisation, diversity and design. Playlists have been curated to reflect talks from visionary leaders across the most influential industries, such as healthcare, education, technology, manufacturing, finance and more. Amplify the value of your learning and development strategy. Employees can view TED Talks from within Cornerstone Learning, the global learning management system (LMS) for over 1,800 leading organisations. Integrating TED Talks into professional development curriculum allows organisations to inspire each individual employee at any stage in their career. Organisations can easily target and deliver learning and development to groups or individuals with the support of TED Talks and measure impact on workforce development from within Cornerstone. Watch and Share Instantly on Mobile: As smartphones emerge as the leading platform for watching video and Web content among busy professionals, TED Talks allow employees to consume and share content on their mobile devices while on the go. Comments on the News "TED Talks are brilliantly crafted and make an emotional connection with viewers. Their ability to convey innovative and complex ideas through powerful, first-person stories is the type of talent management content that can inspire and drive real change in the workforce," said Kirsten Helvey, senior vice president, client success, Cornerstone OnDemand. "We are dedicated to helping people reach their potential by providing our clients with the most innovative talent management solutions that support their professional development and training initiatives." "With the growing demand from companies for TED Talks, Cornerstone provides TED with the expertise and efficiency in reaching millions of learners in organisations across the globe that can benefit from our content," said Deron Triff, TED's director of global distribution and licensing. "This collaboration also provides TED with an important opportunity to understand how the talks can be utilised for professional development to strengthen how we collaborate with the business community. Cornerstone will be a great alliance for bringing TED Talks to companies and sparking innovation among their employees." Additional Resources Learn more about curated TED Talk playlists for Cornerstone via the Cornerstone Marketplace: marketplace.csod.com/#/content/90 Read additional commentary by Cornerstone's director of talent management, Jeff Miller, on the value and influence of TED Talks for empowering today's workforce via the Cornerstone blog: www.cornerstoneondemand.com/blog/how-ted-gets-your-workforce-talking

June 30, 2015

by Fran Cator

· 985 Views

Spark Grows Up and Scales Out

Written by Craig Wentworth. To understand the furor that’s greeted recent vendor announcements around open source analytics computing engine Spark, and some commentary seemingly setting up a Spark versus Hadoop battle, it’s worth taking a moment to recap on what each actually is (and is not). As I covered in last year’s MWD report on Hadoop and its family of tools, when people talk about Apache Hadoop they’re often referring to a whole framework of tools designed to facilitate distributed parallel processing of large datasets. That processing was traditionally confined to MapReduce batch jobs in Hadoop’s early days, though Hadoop 2 brought the YARN resource scheduler and opened up Hadoop to streaming, real-time querying and a wider array of analytical programming applications (beyond MapReduce). Spark has been designed to run on top of Hadoop’s Distributed File System (amongst other data platforms) as an alternative to MapReduce – tuned for real-time streaming data processing and fast interactive queries, and with multi-genre analytics applicability (machine learning, time series, graph, SQL, streaming out-of-the-box). It gets that speed advantage by caching in-memory (rather than writing interim results to disk, as MapReduce does), but with that approach comes a need for higher-spec physical machines (compared with MapReduce’s tolerance for commodity hardware). So, Spark isn’t about to replace Hadoop -- but it may well supplant MapReduce (especially in growing real-time use cases). Those “Spark vs Hadoop” headlines are about as meaningful as one proclaiming “mushrooms vs pizza." Yes, mushroom might be a more suitable topping than, say, pepperoni (especially in a vegetarian use case), but it’ll still be deployed on the same dough and tomato sauce pizza platform. Nobody’s about to suggest the mushroom should go it alone! But what’s behind the headlines and the hype is a story of enterprise adoption – or at least vendors anticipating that adoption and investing in ‘the weaponization of Spark’ as it faces the more exacting standards of security, scaling performance, consistency, etc. which come with mainstream enterprise deployment. Big names like IBM, Databricks (the company formed by the originators of Spark), and MapR made commitments in and around the Spark Summit earlier this month. MapR has announced three new Quick Start Solutions for its Hadoop distribution to help customers get started with Spark in real-time security log analytics, genome sequencing, and time series analytics; Databricks’ cloud-hosted Spark platform (formerly known as Databricks Cloud) has become generally available; and IBM announced a raft of measures designed to give Spark a significant shot in the arm – it’s open sourcing its SystemML technology to bolster Spark’s machine learning capabilities, integrating Spark into its own analytics platforms, investing in Spark training and education, committing 3,500 of its researchers and developers to work on Spark-related projects, and offering Spark as a service on its Bluemix developer cloud. Given the overlap with Databricks’ business model (of offering development, certification, and support for Spark), IBM’s intentions are likely to tread on some toes before long – but for now, at least, both companies are content to focus on the combined push benefiting the Spark community and its enterprise aspirations overall (though clearly IBM’s betting on all this investment buying it some influence over where Spark goes next). It’s worth bearing in mind that not all its supporters champion Spark wholesale and all the interested parties tend to be interested in particular bits of Spark (as wide-ranging as it is) because of overlaps with their own preferred toolsets. For instance, although Spark supports many analytics genres, Cloudera focuses on its machine learning capabilities (as it has its own SQL-on-Hadoop tool in Impala), and MapR and Hortonworks also promote Drill and Hive as their favoured source of SQL-on-Hadoop. IBM’s support is focused on Spark’s machine learning and in-memory capabilities (hence the SystemML open sourcing news). In the face of such strong vendor preferences, how long before some of Spark’s current features fall away (or at least start to show the effects of being starved of as much care and feeding as is bestowed upon vendors’ favourite Spark components)? The Spark community is at much the same place the Hadoop one was at a while back – it’s showing great promise and suitability in key growth workloads (in Spark’s case, such as real-time IoT applications). However, the product as it stands is too immature for many enterprise tastes. Cue enterprise software vendors stepping up to help grow Spark up fast. Their challenge though is to smooth out the edges without smothering what made it so interesting in the first place.

June 28, 2015

by Angela Ashenden

· 2,395 Views

.NET Deployment Tips From New Relic Community Forum Users

[This article was written by Wyatt Lindsay] New Relic’s Community Forum is designed to be a place for our users to share their experiences, questions, problems, and fixes. The collective expertise and creativity of the New Relic community has generated some outstanding solutions to everyday issues, and we want to call out some of them in the area of .NET agent deployment excellence. Basic installation Installing New Relic’s .NET agent is designed to be simple: run the installer on the target host and choose the features you want to include. For Microsoft Azuredeployments, install one of our NuGet packages. Installation requires a reset of IIS for the software to load into your application. To upgrade, we recommend first stopping IIS, installing the newer agent version, and then starting up IIS again. It’s also possible to perform a “silent” (manual) install using msiexec.exe, for example: msiexec.exe /i C:\NewRelicAgent.msi /qb NR_LICENSE_KEY= INSTALLLEVEL=1 See the documentation for complete manual installation options. Leveraging PowerShell Scripting the silent installation provides convenience and flexibility. Here’s an example script provided by community member Jon Carl in this post: $msiName = $licenseKey = $arguments = "/i $msiName /L*v install.log /qn NR_LICENSE_KEY=$licenseKey" if ($msiName -ne $null) { $exitCode = (Start-Process -FilePath "msiexec" -ArgumentList $arguments -Wait -PassThru).ExitCode; if($exitCode -eq 0) { Write-Host "Installation successful!" -ForegroundColor Green } else { Write-Host "Installation unsuccessful. Exitcode: $exitCode" -ForegroundColor Red } } This script works great when run directly on the target machine. Another forum user (Kym McGain) noticed that the installation didn’t complete before the session ended when executing the script remotely. This caused the installer to quit partway through. Kym posted this script that uses a ‘while’ loop to ensure the installer completes. As a bonus, it stops IIS before and restarts it after the software installs. As mentioned above, these steps are usually needed when upgrading. $installNewRelic = { $runProcess = { param($process,$arguments) $res = Start-Process -FilePath $process -ArgumentList $arguments -Wait -PassThru while ($res.HasExited -eq $false) { Write-Host "Waiting for $process..." Start-Sleep -s 1 } $exitCode = $res.ExitCode if($exitCode -eq 0) { Write-Host "$process successful!" -ForegroundColor Green } else { Write-Host "$process unsuccessful. Exitcode: $exitCode" -ForegroundColor Red } } $msiName = $licenseKey = $arguments = "/i $msiName /L*v install.log /qn NR_LICENSE_KEY=$licenseKey" Invoke-Command $runProcess -ArgumentList "IISRESET","/STOP" Invoke-Command $runProcess -ArgumentList "msiexec.exe",$arguments Invoke-Command $runProcess -ArgumentList "IISRESET","/START" } Chef, Puppet, and Chocolatey Deployment options abound for modern Web developers. These solutions often require a known download path and installer name. New Relic offers a consistent filepath and MSI name for the agent in an effort to make automated deployment easier for .NET customers: http://download.newrelic.com/dot_net_agent/release/NewRelicDotNetAgent_x64.msi http://download.newrelic.com/dot_net_agent/release/NewRelicDotNetAgent_x86.msi Several Community members have created packages for these utilities. Chocolatey users are invited to use the following NuGet package created by kireevco: https://chocolatey.org/packages/newrelic-dotnet New Relic community member ePitty built a Puppet module to handle .NET agent deployment: https://github.com/epitty1023/puppet-newrelicappmon Chef users, meanwhile, can check out the following cookbook for .NET and many other platforms New Relic supports: http://community.opscode.com/cookbooks/newrelic New Relic Community member E_Bow wrote a Chef recipe that goes a step further by stopping IIS before the installation and starting it again after completion: #Stop IIS iis_site 'Website' do action [:stop] end # install latest Newrelic agent from web include_recipe 'newrelic::repository' include_recipe node['newrelic']['dotnet-agent']['dotnet_recipe'] license = node['newrelic']['application_monitoring']['license'] windows_package 'Install New Relic .NET Agent' do source node['newrelic']['dotnet-agent']['https_download'] options "/qb NR_LICENSE_KEY=#{license} INSTALLLEVEL=#{node['newrelic']['dotnet-agent']['install_level']}" installer_type :msi action :install end #Start IIS iis_site 'Website' do action [:start] end The author states that they were unable to pull the New Relic license key from the configuration JSON in Chef Overrides, requiring them to modify the config file on each machine and manually enter the key. We invite any Chef experts out there to extend and improve this recipe so that it correctly pulls the license key. We are continually impressed by the smarts and spirit of our New Relic Community Forum members, and jump at the chance to highlight their contributions. Look for more Forum projects in the New Relic blog in the future. Do you have your own approach, tips, or recipes? Please share them in the New Relic Community Forum.

June 26, 2015

by Fredric Paul

· 1,690 Views

Query Autofiltering Revisited -- Let's Be More precise!

In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find. A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this. A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns. The traditional way to do this is to use Natural Language Processing or NLP to parse the user query. This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”. In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed. The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search (Funny story: after posting this, I ran across a Mexican Restaurant in Toms River, New Jersey that does serve sushi – but still, most of them don’t!). The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way. The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting. It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”. It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it. This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context. The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata. Searching vs Navigating A little historical background here. With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links. In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata. The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference. Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will). A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching: In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use : syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user. However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words. The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful. So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender. But the information is still somewhere in the search index, we just need a way to get it back at at “query time”. Then, we can both have our cake AND eat it! Noun Phrases and the Hierarchy of meta information When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing. In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do. That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth. Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types. Rebuilding Context: Identifying field name patterns to find relevant phrase patterns So now its time to put Humpty Dumpty back together again. With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate metadata mapping on behalf of the user. We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store. Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example. The Query AutoFiltering Search Component To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things. The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map. This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field. If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query. The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer. Assuming that we have indexed the following records: This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records. However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query. Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values. So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”. Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too. If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”. It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type. From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query: q=blue red lion socks to a filter query: fq=color:blue&fq=brand:”red lion”&fq=product_type:socks This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%! Adding case sensitivity and synonym support: One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality. To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed. Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future. Source Code For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it). The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above. Conclusions The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post. Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can. This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost). The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications. The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.

June 22, 2015

by Lisa Warner

· 2,219 Views

The Latest AI/ML Topics