Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Solr 4.7 – Efficient Deep Paging

DZone's Guide to

Solr 4.7 – Efficient Deep Paging

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

A long, long time ago, we described a problem called deep paging. To keep things short – the deeper you want to go in the results, the slower the query will be. This is because Solr needs to prepare the data from the beginning for each query. Until Solr 4.7 there wasn’t a good solution for that problem. With the recently released Solr version, we got a possibility of using so called cursor to drastically improve performance of deep paging.

The problem

The deep paging problem is quite easy to define. To return search results Solr must prepare an in-memory structure and return part of it. Returning the part of the structure is simple, if that part comes from the beginning of the structure. However, if we want to return page number 10.000 (where we return 20 results per page) Solr needs to prepare a structure containing minimum of 200.000 elements (10.000 * 20). You see that it not only takes time, but also memory.

The good thing is, that with the release of Solr 4.7 the situation had changed – the cursor has been introduced. Cursor is a logic structure, that doesn’t require its state to be stored on the server side. Cursor contains information about storing and lest document returned in the results. Because of that, Solr doesn’t need to start search from beginning each time we want to get next page of results. It results in drastic performance improvement when using cursor and going deep into results.

Usage

Cursor usage is very simple. To tell Solr to return cursor, in the first query we need to pass an additional parameter – cursorMark=*. In result, apart from documents, we will get a cursor identifier returned in the nextCursorMark parameter. Let’s look at the example.

The query

Let’s start with a very simple query:

1	curl 'localhost:8983/solr/select?q=*:*&rows=1&sort=score+desc,id+asc&cursorMark=*'

There are four things here that we are interested in. First of off, we either omit the start parameter or we set it to 0. The rows parameter can take values we need, there is no limitation on it. Of course, we passed the cursorMark=* parameter, to tell Solr that we want the cursor to be used. The final thing we did is sorting definition. We need to define sorting for cursor to be working, one that will tell cursor how to behave. That’s why we needed to overwrite default sorting and include sorting not only by score, by also by document identifier.

Search results

Our query returns the following search results:

01	<?xml version="1.0" encoding="UTF-8"?>
02	<response>
03	 <lst name="responseHeader">
04	  <int name="status">0</int>
05	  <int name="QTime">33</int>
06	  <lst name="params">
07	   <str name="sort">score desc,id asc</str>
08	   <str name="start">0</str>
09	   <str name="q">*:*</str>
10	   <str name="cursorMark">*</str>
11	   <str name="rows">1</str>
12	  </lst>
13	 </lst>
14	<result name="response" numFound="32" start="0">
15	 <doc>
16	  <str name="id">0579B002</str>
17	  <str name="name">Canon PIXMA MP500 All-In-One Photo Printer</str>
18	  <str name="manu">Canon Inc.</str>
19	  <str name="manu_id_s">canon</str>
20	  <arr name="cat">
21	   <str>electronics</str>
22	   <str>multifunction printer</str>
23	   <str>printer</str>
24	   <str>scanner</str>
25	   <str>copier</str>
26	  </arr>
27	  <arr name="features">
28	   <str>Multifunction ink-jet color photo printer</str>
29	   <str>Flatbed scanner, optical scan resolution of 1,200 x 2,400 dpi</str>
30	   <str>2.5" color LCD preview screen</str>
31	   <str>Duplex Copying</str>
32	   <str>Printing speed up to 29ppm black, 19ppm color</str>
33	   <str>Hi-Speed USB</str>
34	   <str>memory card: CompactFlash, Micro Drive, SmartMedia, Memory Stick, Memory Stick Pro, SD Card, and MultiMediaCard</str>
35	  </arr>
36	  <float name="weight">352.0</float>
37	  <float name="price">179.99</float>
38	  <str name="price_c">179.99,USD</str>
39	  <int name="popularity">6</int>
40	  <bool name="inStock">true</bool>
41	  <str name="store">45.19214,-93.89941</str>
42	  <long name="_version_">1461375031699308544</long></doc>
43	 </result>
44	 <str name="nextCursorMark">AoIIP4AAACgwNTc5QjAwMg==</str>
45	</response>

As we can see, in addition to standard search results, we got the cursor identifier in the nextCursorMark section. Now, to get the next results bound to that cursor, we need to pass that identifier using the cursorMark parameter.

Next query

Our next query looks as follows (not the cursorMark parameter value):

1	curl 'localhost:8983/solr/select?q=*:*&rows=1&sort=score+desc,id+asc&cursorMark=AoIIP4AAACgwNTc5QjAwMg=='

The results were as follows:

01	<?xml version="1.0" encoding="UTF-8"?>
02	<response>
03	 <lst name="responseHeader">
04	  <int name="status">0</int>
05	  <int name="QTime">2</int>
06	  <lst name="params">
07	   <str name="sort">score desc,id asc</str>
08	   <str name="indent">true</str>
09	   <str name="q">*:*</str>
10	   <str name="cursorMark">AoIIP4AAACgwNTc5QjAwMg==</str>
11	   <str name="rows">1</str>
12	  </lst>
13	 </lst>
14	<result name="response" numFound="32" start="0">
15	 <doc>
16	  <str name="id">100-435805</str>
17	  <str name="name">ATI Radeon X1900 XTX 512 MB PCIE Video Card</str>
18	  <str name="manu">ATI Technologies</str>
19	  <str name="manu_id_s">ati</str>
20	  <arr name="cat">
21	   <str>electronics</str>
22	   <str>graphics card</str>
23	  </arr>
24	  <arr name="features">
25	   <str>ATI RADEON X1900 GPU/VPU clocked at 650MHz</str>
26	   <str>512MB GDDR3 SDRAM clocked at 1.55GHz</str>
27	   <str>PCI Express x16</str>
28	   <str>dual DVI, HDTV, svideo, composite out</str>
29	   <str>OpenGL 2.0, DirectX 9.0</str>
30	  </arr>
31	  <float name="weight">48.0</float>
32	  <float name="price">649.99</float>
33	  <str name="price_c">649.99,USD</str>
34	  <int name="popularity">7</int>
35	  <bool name="inStock">false</bool>
36	  <date name="manufacturedate_dt">2006-02-13T00:00:00Z</date>
37	  <str name="store">40.7143,-74.006</str>
38	  <long name="_version_">1461375031846109184</long></doc>
39	 </result>
40	 <str name="nextCursorMark">AoIIP4AAACoxMDAtNDM1ODA1</str>
41	</response>

As we can see, the returned nextCursorMark was different again.

Further queries

Logic for further queries is simple – we use the cursorMark parameter with the value returned with the previous search results. So again, our next query would look as follows:

1	curl 'localhost:8983/solr/select?q=*:*&rows=1&sort=score+desc,id+asc&nextCursorMark=AoIIP4AAACoxMDAtNDM1ODA1'

Summary

Simple API and massive gain on performance in case of deep paging. That’s how I think the cursor introduced in Solr 4.7 could be summarized. I decided not do re-do performance tests, there are ones already done by Chris Hostetter in his entry about this functionality. If you are interested please look at: http://searchhub.org/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:

Published at DZone with permission of Rafał Kuć, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}