jsii – full text search in 1K LOC of JavaScript!
Join the DZone community and get the full member experience.
Join For Freein the previous blog post i tried to introduce node.js and its nice features. today i will introduce my little search engine prototype called jsii (javascript inverted index).
jsii provides an in-memory inverted index within approx. 1000 lines of javascript. some more lines are necessary to set up a server via node.js, so that the index is queryable via http and returns solr compatible json or xml. the sources are available @github :
git clone git@github.com:karussell/jsii.git
try it out here:
http://pannous.info:8124/select?q=google
e.g. filter queries works like id:xy or queries with sorting works like
&sort=id asc. the paramters start and rows can be used for paging.
for those who come too late e.g. my server crashed or sth.
, here is an image of the xml response:
the solr compatible xml response format makes it possible to use jsii
from applications that are using solrj. for example i tried it for
jetwick
and the basic search worked – just specify the xml reponse parser:
solrj.setparser(new xmlresponseparser());
his-story
the first thing i needed was a bitset analogon in javascript to perform term look-ups fast and combine them via and bit-operation. therefor i took the classes and tests from a gwt patch and made them working for my jasmine specs .
while trying to understand the basics of a scoring function i stumbled over the lucene docs and this thread which mentions ‘section 6 of a book ‘ for a good reference on that subject.
my understanding of the basics is now the following:
- the term frequency (tf) is to weight documents differently. e.g. document1 contains ‘java’ 10 times but doc2 has it 20 times. so doc2 is more important for a query ‘java’. if you index tweets you should do tf = min(tf, 3). otherwise you will often get tweets ala ‘java java java java java java…’ instead of important once. so for tweets a higher entropy is also relevant
- the inverted document frequency (idf) gives certain terms a higher (or lower) weight. so, if a term occurs in all documents the term frequency should be low to make that term of a query not so important compared to other terms where less documents were found
with jsii you can grab docs from a solr index or feed it via the
javascript api. jsii is very basic and simple, but it seems to work
reasonable fast. i get fair response times of under 50ms with ~20k
tweets although i didn’t invest time to improve performance. there are
some bugs and yes, jsii is a memory hog, but besides this it is amazing
what can be done with a ‘script’ language. btw: at the moment jsii is a
100% real time search engine because it does not support transactions or
warming up
hints
- look into the todo file before posting an issue
- jsii feeding is not thread safe
- i readed this object oriented js with node and got some suggestions from node.js users
- as ide i’m using netbeans. i reported an issue to create a ‘pure javascript’ project in netbeans.
- git cheat sheet
- there is older, similar project called jssindex
Published at DZone with permission of Peter Karussell. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
How To Use an Automatic Sequence Diagram Generator
-
Security Challenges for Microservice Applications in Multi-Cloud Environments
-
Application Architecture Design Principles
-
What Is JHipster?
Comments