MongoDB Full Text Search vs. Regular Expressions
Join the DZone community and get the full member experience.
Join For Freetoday we would like to introduce you to the new mongodb full text search and compare its capabilities and performance with simple regular expressions, which are currently state of the art for searching in mongodb . we will provide code snippets explaining how to use both features in a java application as well as an empirical performance evaluation.
what is mongodb full text search?
mongodb full text search is a new feature in mongodb 2.4. however, up to now it is in beta state and not recommended to use in production systems.
why not continue to use regular expressions?
basically, there are two major reasons. first, regular expressions have their natural limitations because they lack any stemming functionality and cannot handle convenient search queries such as “action -superhero” in a trivial way. second, they cannot use traditional indexes which makes queries in large datasets really slow.
nevertheless, searching via regular expressions is really easy to implement using spring data as demonstrated by the following code snippet:
import org.springframework.data.mongodb.core.query.*; import org.springframework.data.mongodb.core.mongooperations; public list<movie> searchindescription(string searchstring, int limit, int offset) { criteria criteria = criteria.where("description").regex(searchstring); query query = query.query(criteria); // apply pagination, sorting would also be specified here query.limit(limit); query.skip(offset); return mongooperations.find(query, movie.class); }
how to use the mongodb full text search?
unfortunately, this is a little harder as spring data does not yet support the feature. for implementing our own solution, we have to understand how a full text search can be executed in the mongo shell:
> db.collection.runcommand("text", {search:"action", project:{"_id":1}}) { "querydebugstring" : "day||||||", "language" : "english", "results" : [ { "score" : 0.5089285714285714, "obj" : { "_id" : objectid("51c175a20364281420b1d17d") } } ], "stats" : { "nscanned" : 1, "nscannedobjects" : 0, "n" : 1, "nfound" : 1, "timemicros" : 77 }, "ok" : 1 }
the command returns a single json document with all objects that match the query and some statistics on the search that has just been executed. translating this into java code that extracts the ids of all matches works as follows.
import com.mongodb.*; import org.bson.types.objectid; import org.springframework.data.mongodb.core.mongooperations; public collection<objectid> findmatchingids(string searchstring) { commandresult result = executefulltextsearch(searchstring); return extractsearchresultids(result); } private commandresult executefulltextsearch(string searchstring) { basicdbobject textsearch = new basicdbobject(); textsearch.put("text", movie.collection_name); textsearch.put("search", searchstring); textsearch.put("limit", search_limit); // override default of 100 textsearch.put("project", new basicdbobject("_id", 1)); return mongooperations.executecommand(textsearch); } private collection<objectid> extractsearchresultids(commandresult result) { set<objectid> objectids = new hashset<objectid>(); basicdblist resultlist = (basicdblist) commandresult.get("results"); iterator<object> it = resultlist.iterator(); while (it.hasnext()) { basicdbobject resultcontainer = (basicdbobject) it.next(); basicdbobject resultobj = (basicdbobject) resultcontainer.get("obj"); objectid resultid = (objectid) resultobj.get("_id"); objectids.add(resultid); } return objectids; }
note that there is no indicator for the field to search in! as mongodb supports only one text index per collection this information is implicitly specified after defining it in the shell
db.collection.ensureindex({"description":"text"})
or from the java application
mongooperations.getcollection(movie.collection_name) .ensureindex(new basicdbobject("description", "text"));
in order to provide search results with pagination and custom sorting for the application’s ui layer, we need another standard spring data query that does exactly that.
import org.bson.types.objectid; import org.springframework.data.mongodb.core.query.*; import org.springframework.data.mongodb.core.mongooperations; public list<movie> searchindescription(string searchstring, int limit, int offset) { collection<objectid> searchresultids = findmatchingids(searchstring); criteria criteria = criteria.where("_id").in(searchresultids); query query = query.query(criteria); // apply pagination, sorting would also be specified here query.limit(limit); query.skip(offset); return mongooperations.find(query, movie.class); }
this two-step approach ensures we can use all the functionality of a regular mongodb query (sorting, pagination, additional criteria, …) while taking advantage of mongodb’s current full text search implementation.
what are limitations of the mongodb full text search?
the full text search does not work properly for really large datasets as all matches are returned as a single document and the command does not support a “skip” parameter to retrieve results page-by-page. despite of projecting to nothing but the “_id” field a huge set of matches will not be returned in its entirety if the result exceeds mongo’s 16mb per document limit.
how does the mongodb full text search perform compared to regular expressions?
to get a feeling how fast the mongodb full text search works in different cases, we built a small demo application which imports data from the movie database and displays them in a list. entering a search term in the search field, one can decide to run it with or without the mongodb full text search. the time results in ms are printed to the console.
for our example we imported 100,000 movies and searched for three different words, always retrieving the first page with up to 15 entries but counting the number of all matches (for calculating the number of required pages):
- “movie” which delivers 3,533 matches with the full text search and 3,317 with regular expressions (the number differs due to the full text search’s stemming functionality)
- “newspaper” which delivers 318 matches with the full text search and 320 with regular expressions
- “mayzie” which delivers 2 matches in both cases
the following bar chart illustrates the corresponding performances for counting the results and retrieving them:
the chart indicates two major trends. the regular expression search takes longer for queries with just a few results while the full text search gets faster and is clearly superior in those cases. why is that?
let’s explain the results for searching with regular expressions first. the time for counting the number of matches among the 100,000 entries is pretty consistent at about 200ms. obviously, this is the time required to scan the entire collection document by document as no index can be used here. on the other hand, the time to retrieve one page goes up tremendously for a smaller number of results. this is due to the fact that mongodb uses an index for iterating over all documents in the correct sorting order and can stop immediately as soon as 15 entries for the first page have been found. for a query with about 3,500 matches in 100,000 documents (“movie”) the expected value of documents to be scanned is only 15/0.035=428.6 while the entire collection needs to be checked for a very rare search term such as “mayzie”.
explaining the full text search performance is quite straightforward. in this case, mongodb can use an index and hence the query is always efficient. it only requires slightly more time for an increasing number of matches. an important issue to understand here is that the time to retrieve the first page includes executing the full text search, extracting all result ids in the application and running another standard query as explained above. the time required for the extraction step increases linearly with the number of matches which is the major reason for the rise of retrieval time for bigger result sets even though only 15 entries are returned.
what do you need to run this demo yourself?
our demo application uses spring, spring data, apache wicket, gradle and mongodb.
to get started download the code from https://github.com/comsysto/mongo-full-text-search-movie-showcase .
to start your mongodb with full text search enabled, shutdown your mongod if it’s currently running and then command:
mongod --setparameter textsearchenabled=true
alternatively, you can add this line to your mongodb.conf file for permanently enabling the feature (not recommended in a production environment):
setparameter = textsearchenabled=true
if you haven’t installed gradle, follow this manual . then command
gradle clean build
to start the application command
gradle jettyrun
what may help if you have problems?
if the full text search is not properly configured you will always obtain an empty result list no matter which term you were searching for. additionally, the message “### mongodb full text search does not work properly – cannot retrieve any results.” will be printed to the console.
this behavior can have multiple causes:
- you haven’t started your mongod server with the textsearchenabled=true option as described above.
- you have specified more than one text index for the collection which cannot be handled by mongodb. you can look this up by calling the following in a mongo shell:
use movie db.movie.getindexes()
[more problems and traps we know about]
how can this demo be extended?
starting with this little demo you can extend it as you wish to. if you are using the movie database, please create your own account here and use your own api key.
any questions?
if you have any feedback, please write to christan.kroemer@comsysto.com or elisabeth.engel@comsysto.com !
Published at DZone with permission of Comsysto Gmbh, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments