Over a million developers have joined DZone.

jsii – full text search in 1K LOC of JavaScript!

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

In the previous blog post I tried to introduce node.js and its nice features. Today I will introduce my little search engine prototype called jsii (javascript inverted index).

jsii provides an in-memory inverted index within approx. 1000 lines of JavaScript. Some more lines are necessary to set up a server via node.js, so that the index is queryable via http and returns Solr compatible json or xml. The sources are available @github:

git clone git@github.com:karussell/jsii.git

Try it out here: http://pannous.info:8124/select?q=google e.g. filter queries works like id:xy or queries with sorting works like &sort=id asc. The paramters start and rows can be used for paging. For those who come too late e.g. my server crashed or sth. ;-) , here is an image of the xml response:


The solr compatible xml response format makes it possible to use jsii from applications that are using SolrJ. For example I tried it for Jetwick and the basic search worked – just specify the xml reponse parser:

solrj.setParser(new XMLResponseParser());

His-story

The first thing I needed was a BitSet analogon in JavaScript to perform term look-ups fast and combine them via AND bit-operation. Therefor I took the classes and tests from a GWT patch and made them working for my jasmine specs.

While trying to understand the basics of a scoring function I stumbled over the lucene docs and this thread which mentions ‘Section 6 of a book‘ for a good reference on that subject.

My understanding of the basics is now the following:

  • The term frequency (tf) is to weight documents differently. E.g. document1 contains ‘java’ 10 times but doc2 has it 20 times. So doc2 is more important for a query ‘java’. If you index tweets you should do tf = min(tf, 3). Otherwise you will often get tweets ala ‘java java java java java java…’ instead of important once. So for tweets a higher entropy is also relevant
  • The inverted document frequency (idf) gives certain terms a higher (or lower) weight. So, if a term occurs in all documents the term frequency should be low to make that term of a query not so important compared to other terms where less documents were found


With jsii you can grab docs from a solr index or feed it via the javascript api. jsii is very basic and simple, but it seems to work reasonable fast. I get fair response times of under 50ms with ~20k tweets although I didn’t invest time to improve performance. There are some bugs and yes, jsii is a memory hog, but besides this it is amazing what can be done with a ‘script’ language. BTW: at the moment jsii is a 100% real time search engine because it does not support transactions or warming up ;-)

Hints

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.

Topics:

Published at DZone with permission of Peter Karussell. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}