In this post I’m going to talk about a set of benchmarks that I’ve done with Solr. The goal behind it is to see how each parameter defined in the schema affects the size of the index and the performance of the system.
The first step was to fetch the set of documents that I was going to use in the tests. I wanted the documents to be composed of real text, so I started to look for sources in Internet. The first one that I really liked was Twitter. They provide a REST API that allows you to read a continuous stream of tweets, composed of approximately 1% of all the public tweets. Each tweet is expressed as a JSON Object, and carries meta-data about the message and the author. While this source allowed me to get a good number of documents in a short time (about 1.7 million tweets in 2 days), they were really small, so I started to look for a source of bigger documents, finally choosing Wikipedia. I downloaded the documents through HTTP using the “Random Article” feature in their site, obtaining about 160,000 articles in a couple of days. At the time of writting, the site download.wikipedia.org, which provides an easy way of downloading a bunch of articles, was out of service.
The next step was to design the schema. Because one of the objectives is to see how each change in the schema affects the size of the index, I used many different combination of parameters, as to measure the influence of each one of them. On each case, the database of stop-words was populated using the top 100 terms of each set of documents, obtained from the administration panel of Solr. For both datasets, the “omitNorms”, “termVectors” and “stopWords” parameters are referred to the “text” field. In all cases, the value of the parameters “termOffsets” and “termPositions” is the same as “termVectors”.
In the first figure you can see the size of the index for each
schema for the Twitter data-set, and which proportion of the index
corresponds to each parameter. Remember that this data-set has lots of
documents (about 1.7 million) but each one is small (240 bytes on
average). There are many remarkable things here. The first one is that
the space occupied by the term vectors (~280 MiB when not using stop
words) is almost equal to the space occupied by the inverted index
itself (~240 MiB). In second place, the space saved by omitting norms
is almost negligible (~2 MiB). Third, the space saved by using stop
word is doubled when storing term vectors, going from about 4% of the
index to about 10%. Finally, the space occupied by the stored fields
(~340 MiB) is considerably bigger than the space occupied by the
inverted index itself.
In the second figure you can see the same information for the Wikipedia data-set. The size occupied by the norms is still negligible (< 1MiB), however, the size occupied by the stop words has increased to about 22% of the index size when not storing term vectors, and about 25% when storing them. This time, the size occupied by the term vectors (~1067 MiB) is almost three times the space occupied by the inverted index itself (~380 MiB). Finally, the size of the stored documents (~6330 MiB) is more than four times the size of the index with term vectors stored.
- When the number of fields is small, the size of the norms is negligible, independently of the size and number of documents.
- When the documents are large, the stop words help reducing the size of the index significantly. Maybe here is important to note two things. In first place, the documents fetched from Wikipedia are writen using traditional language, and are all writen in English, while the documents fetched from Twitter are writen using modern language, and in many different languages. In second place, I didn’t measure the precision and recall of the system when using stop words, so it is possible that the findability in a real scenario won’t be good.
- If you’re storing the documents, and they are big enough, it’s not so important if you store the term vectors or not, so if you’re using a feature such as highlighting and you are looking for good performance, you should store them. If you’re not storing documents, or your documents are small, you should think twice before storing the term vectors, because they’re going to increase significantly your index’s size.
I hope you find this post useful. Currently I’m working on a set of benchmarks to measure the influence of each one of these parameters in the performance of the system, so if you liked this post, stay tuned!