DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Text Analysis Within a Full-Text Search Engine
  • Connecting Elasticsearch Directly to your Java EE Application
  • How To Generate Scripts of Database Objects in SQL Server
  • It’s Time to Use a Data Privacy Vault

Trending

  • DevSecOps: Integrating Security Into Your DevOps Workflow
  • CI/CD Docker: How To Create a CI/CD Pipeline With Jenkins, Containers, and Amazon ECS
  • AWS Amplify: A Comprehensive Guide
  • LLMs for Bad Content Detection: Pros and Cons
  1. DZone
  2. Data Engineering
  3. Databases
  4. Deep paging problem

Deep paging problem

Rafał Kuć user avatar by
Rafał Kuć
·
Jul. 18, 11 · News
Like (0)
Save
Tweet
Share
9.36K Views

Join the DZone community and get the full member experience.

Join For Free

imagine the following problem – we have an application that expects solr to return the results sorted on the basis of some field. those results will be than paged in the gui. however, if the person using the gui application immediately selects the tenth, twentieth, or fiftieth page of search results there is a problem – the wait time. is there anything we can do about this? yes, we can help solr a bit.

a few numbers at the beginning

let’s start with the query and statistics. imagine that we have the following query, which is sent to solr to get the five hundredth page of search results:

q=*:*&sort=price+asc&rows=100&start=50000

what solr must do solr to retrieve and render the results list ? of course, read the documents from lucene index. there is of course the question of how many documents to be read from the index? is it 100? unfortunately, no. solr must collect 50.100 sorted documents from the lucene index, due to the fact that we want 100 documents starting from 50.000-th. kinda scary. now let’s look at the comparison of how long it takes for solr download the first page of search results (the query q=*:*&sort=price+asc&rows=100&start=0) and how long it takes to render the last page of search results (ie the query q=*:*&sort = price+asc&rows=100&start=999900). the test was performed on an index containing a million documents, consisting of four fields: id (string), name (text), description (text), price (long). before every test iteration solr was started the query was run and solr was truned off. these steps were repeated 100 times for each of the queries, and times is seen in the table are the arithmetic mean of the query time execution. here are the test results:

query average response time (ms)
q=*:*&sort=price+asc&rows=100&start=0 146
q=*:*&sort=price+asc&rows=100&start=999900 3184

typical solutions

of course we can try to set the cache or the size of queryresultwindowsize , but there will be a problem of how to set the size, there may be a situation where it will be insufficient or not relevant entry in the memory of solr and then waiting time for the n-th search page will be very long. we can also try adding warming queries, but we won’t be able to prepare all the combinations, but even if we could the the cache would have to be big. so we won’t be able to achieve the desired results with any of these solutions.

filters, filters and filter again

this behavior solr (and other applications based on lucene too) is caused by the queue of documents, called. priority queue, which in order to display the last page of results must download all documents matching the query and return the ones we want located in the desired page. of course, in our case, if we want the first page of search results queue will have 100 entries. however, if we want the last page will have solr search in the queue to put one million documents. of course what i told is in big simplification.

the idea is to limit the number of documents lucene must put in the queue. how to do it ? we will use filters to help us, so in solr we will use the fq parameter. using a filter will limit the number of search results. the ideal size of the queue would be the one that is passed in the rows parameter of query. however, this situation is ideal and not very possible in most situations. an additional problem is that asking a query with a filter we can not determine the number of results, because we do not know how much documents will the filter return. the solution to this problem is the making two queries instead of just one – the first one to see how fimiting is our filter thus using rows=0 and start=0, and the second is already adequately calculated (example below).

the maximum price of the product in the test data is 10,000, and the minimum is 0. so to the first query we will add the following bracket: <0; 1000>, and to the second query, we add the following bracket: <9000; 10000>.

disadvantages of solution based on filters

there is one minus the filter-based solution and it is quite large. it may happen that the number of results to which the filter is limiting the query is too small. what then? we should increase the choosen bracket for the filter. of course we can calculate the optimal brackets on the basis of our data, but it depends on the data and queries and why i won’t be talking about this at this point.

what is the performance after the change?

so let’s repeat the tests, but now let’s implement the filter based approach. so the first will just return the first page of results (the query q=*:*&sort=price+asc&rows=100&start=0&fq=price:[0+to+1000]). the second query (actually two queries) will will be used the check the number of results and then fetch those results (those two queries: q=*:*&sort=price+asc&rows=0&start=0&fq=price:[9000+to+10,000] and q=*:*&sort=price+asc&rows=100&start=100037&fq=price:[9000+to+10000]). it is worth noting about the changed start parameter in the query, due to the fact that we get fewer search results (this is caused by the fq parameter). this test was was carried out in similar way to the previous one – start solr, run a query (or queries), and shutdown solr. the number seen in the table are the arithmetic mean of query time execution.

query average response time (ms)
q=*:*&sort=price+asc&rows=100&start=0&fq=price:[0+to+1000] 128
q=*:*&rows=0&fq=price:[9000+to+10000]
q=*:*&sort=price+asc&rows=100&start=100037&fq=price:[9000+to+10000]
422

as you can see, the query performance changed. we can therefore conclude that we succeeded. of course, we could be tempted by further optimizations, but for now let’s say that we are satisfied with the results. i suspect however that you can ask the following question:

how is this handled by other search engines ?

good question, but the answer is trivial in total – one of the methods is to prevent a very deep paging. this method shall include google. try to type the word “ask” for the search and go further than 91 page of search results. google didn’t let me ;)

conclusions

as you can see deep paging performance after our changes has increased significantly. we can now allow the user to search results without worrying that it will kill our solr and the machine on which he works. of course, this method is not without flaws, due to the fact that we need to know certain parameters (eg in our case, the maximum price for descending sort), but this is a solution that allows you to provide search results in a relatively low latency time when compared to the pre-optimization method.

Database Filter (software) Document Test data

Published at DZone with permission of Rafał Kuć, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Text Analysis Within a Full-Text Search Engine
  • Connecting Elasticsearch Directly to your Java EE Application
  • How To Generate Scripts of Database Objects in SQL Server
  • It’s Time to Use a Data Privacy Vault

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: