DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Avatar

Rafał Kuć

Owner at Alud Consulting

Białystok, PL

Joined Jun 2011

About

Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems.

Stats

Reputation: 347
Pageviews: 533.4K
Articles: 6
Comments: 2
  • Articles
  • Comments

Articles

article thumbnail
Read-Only Collections in Solr
Learn more about read-only collections in Solr.
October 10, 2019
· 12,177 Views · 1 Like
article thumbnail
From Solr Master-Slave to the (Solr)Cloud
If you're moving to Solr's distributed search system, there's a lot to keep in mind ranging from indexing differences to testing procedures to keeping ZooKeeper working.
December 15, 2016
· 7,229 Views · 1 Like
article thumbnail
Deal With Multi-Tenant Data in Solr
Different techniques can be used to handle multi-tenant data in Solr. This article discusses routing techniques you can use depending on the size and number of shards.
October 2, 2015
· 6,797 Views · 6 Likes
article thumbnail
SolrCloud: What Happens When ZooKeeper Fails – Part Two
in the previous blog post about solrcloud we’ve talked about the situation when zookeeper connection failed and how solr handles that situation. however, we only talked about query time behavior of solrcloud and we said that we will get back to the topic of indexing in the future. that future is finally here – let’s see what happens to indexing when zookeeper connection is not available. looking back at the old post in the solrcloud – what happens when zookeeper fails? blog post, we’ve shown that solr can handle querying without any issues when connection to zookeeper has been lost (which can be caused by different reasons). of course this is true until we change the cluster topology. unfortunately, in case of indexing or cluster change operations, we can’t change the cluster state or index documents when zookeeper connection is not working or zookeeper failed to read/write the data we want. why we can run queries? the situation is quite simple – querying is not an operation that needs to alter solrcloud cluster state. the only thing solr needs to do is accept the query, run it against known shards/replicas and gather the results. of course cluster topology is not retrieved with each query, so when there is no active zookeeper connection (or zookeeper failed) we don’t have a problem with running queries. there is also one important and not widely know feature of solrcloud – the ability to return partial results. by adding the shards.tolerant=true parameter to our queries we inform solr, that we can live with partial results and it should ignore shards that are not available. this means that solr will return results even if some of the shards from our collection is not available. by default, when this parameter is not present or set to false , solr will just return error when running a query against collection that doesn’t have all the shards available. why we can’t index data? so, we can’t we index data, when zookeeper connection is not available or when zookeeper doesn’t have a quorum? because there is potentially not enough information about the cluster state to process the indexing operation. solr just may not have the fresh information about all the shards, replicas, etc. because of that, indexing operation may be pointed to incorrect shard (like not to the current leader), which can lead to data corruption. and because of that indexing (or cluster change) operation is jus not possible. it is generally worth remembering, that all operations that can lead to cluster state update or collections update won’t be possible when zookeeper quorum is not visible by solr (in our test case, it will be a lack of connectivity of a single zookeeper server). of course, we could leave you with what we wrote above, but let’s check if all that is true. running zookeeper a very simple step. for the purpose of the test we will only need a single zookeeper instance which is run using the following command from zookeeper installation directory: bin/zkserver.sh start we should see the following information on the console: jmx enabled by default using config: /users/gro/solry/zookeeper/bin/../conf/zoo.cfg starting zookeeper ... started and that means that we have a running zookeeper server. starting two solr instances to run the test we’ve used the newest available solr version – the 5.2.1 when this blog post was published. to run two solr instances we’ve used the following command: bin/solr start -e cloud -z localhost:2181 solr asked us a few questions when it was starting and the answers where the following: number of instances: 2 collection name: gettingstarted number of shards: 2 replication count: 1 configuration name: data_driven_schema_configs cluster topology after solr started was as follows: let’s index a few documents to see that solr is really running, we’ve indexed a few documents by running the following command: bin/post -c gettingstarted docs/ if everything went well, after running the following command: curl -xget 'localhost:8983/solr/gettingstarted/select?indent=true&q=*:*&rows=0' we should see solr responding with similar xml: 0 38 *:* true 0 we’ve indexed our documents, we have solr running. let’s stop zookeeper and index data to stop zookeeper server we will just run the following command in the zookeeper installation directory: bin/zkserver.sh stop and now, let’s again try to index our data: bin/post -c gettingstarted docs/ this time, instead of data being written into the collection we will get an error response similar to the following one: posting file index.html (text/html) to [base]/extract simpleposttool: warning: solr returned an error #503 (service unavailable) for url: http://localhost:8983/solr/gettingstarted/update/extract?resource.name=%2fusers%2fgro%2fsolry%2f5.2.1%2fdocs%2findex.html&literal.id=%2fusers%2fgro%2fsolry%2f5.2.1%2fdocs%2findex.html simpleposttool: warning: response: 5033cannot talk to zookeeper - updates are disabled.503 as we can see, the lack of zookeeper connectivity resulted in solr not being able to index data. of course querying still works. turning on zookeeper again and retrying indexing will be successful, because solr will automatically reconnect to zookeeper and will start working again. short summary of course this and the previous blog post related to zookeeper and solrcloud are only touching the surface of what is happening when zookeeper connection is not available. a very good test that shows us data consistency related information can be found at http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/ . i really recommend it if you would like to know what will happen with solrcloud in various emergency situations.
July 2, 2015
· 17,561 Views
article thumbnail
Sorting by function value in Solr
In Solr 3.1 and later we have a very interesting functionality, which enables us to sort by function value. What does that gives us? Actually, a few interesting possibilities. Let’s start The first example that comes to mind, perhaps because of the project on which I worked some time ago, it’s sorting on the basis of distance between two geographical points. So far, to implement such functionality was needed changes in Solr (for example, LocalSolr or LocalLucene). Using Solr 3.1 and later, you can sort your search results using the value returned by the defined functions. For example, is Solr, we have the dist function calculating the distance between two points. One variation of the function is a function accepting five parameters: algorithm and two pairs of points. If, using this feature, we would like to sort your search results in ascending order from the point of latitude and longitude 0.0, we should add the following sort parameter to the Solr query: ...sort=dist(2, geo_x, geo_y, 0, 0) asc I suspect that the most commonly used values of the first parameter will be: 1 – calculation based on the Manhattan metrics 2 – calculation of Euclidean distance A few words about performance Everything is fine till now, but how it looks like in terms of performance ? I’ve made a two simple tests. During the first test, I indexed 200 000 documents, every one of them consisted of four fields: identifier (numeric field), description (a text field) and location (two numeric fields). In order not to obscure the test results for sorting, I used one of the simplest functions currently available in the Solr – the sum function which sums two given arguments. I compared the query time of the default sorting (by score) with the ones which used the value of the function. The following table shows the results of the test: Query Results number Query time Re-query time q=*:*&sort=score+desc 200.000 31ms 0ms q=*:*&sort=sum(geo_x,geo_y)+desc 200.000 813ms 0ms q=opis:ala&sort=score+desc 200.000 47ms 1ms q=opis:ala&sort=sum(geo_x,geo_y)+desc 200.000 797ms 1ms Another test was based on a comparison of sorting by a string field to sort using function. The test was almost identical to the first test. I’ve indexed 200,000 documents indexed (with additional field: name_sort – type string) and used the sum function. The following table shows the results of the test: Query Results number Query time Re-query time q=*:*&sort=opis_sort+desc 200.000 267ms 0ms q=*:*&sort=sum(geo_x,geo_y)+desc 200.000 823ms 0ms q=opis:ala&sort=opis_sort+desc 200.000 266ms 1ms q=opis:ala&sort=sum(geo_x,geo_y)+desc 200.000 810ms 1ms Above test shows that sorting using the sort function is much slower than the default sort order (which you’d expect). Sorting on the basis of function value is also slower than sorting with the use of string based field, but the difference is not as significant as in the previous case. A few words at the end Of course, the above test just glides through the topic of sorting efficiency using Solr functions, however, shows a direct relationship. Given that, in most cases, this will not be the default sort method and giving us a really powerful tool it seems to me that this is a feature worth remembering. It will definitely be worth using when the requirements says that we have to sort on the value that depends on the query and index values – as in the case of sorting by distance from the point specified by the user.
September 27, 2011
· 19,534 Views
article thumbnail
Lucene and Solr's CheckIndex to the Rescue!
while using lucene and solr we are used to a very high reliability. however, there may come a day when solr will inform us that our index is corrupted, and we need to do something about it. is the only way to repair the index to restore it from the backup or do full indexation? no – there is hope in the form of checkindex tool. what is checkindex ? checkindex is a tool available in the lucene library, which allows you to check the files and create new segments that do not contain problematic entries. this means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in solr. where do i start? please note that, according to what we find in javadocs, this tool is experimental and may change in the future. therefore, before starting to work with it we should create a copy of the index. in addition, it is worth knowing that the tool analyzes the index byte by byte, and thus for large indexes the time of analysis and repair may be large. it is important not to run the tool with the -fix option at the moment when it is used by solr or other applications based on the lucene library. finally, be aware that the launch of the tool in repairing mode may result in removal of some or all documents that are stored in the index. how to run it to run the utility, go to the directory where the lucene library files are located and run the following command: java -ea:org.apache.lucene... org.apache.lucene.index.checkindex index_path -fix in my case, it looked as follows: java -cp lucene-core-2.9.3.jar -ea:org.apache.lucene... org.apache.lucene.index.checkindex e:\solr\solr\data\index\ -fix after a while i got the following information : opening index @ e:solrsolrdataindex segments file=segments_2 numsegments=1 version=format_diagnostics [lucene 2.9] 1 of 1: name=_0 doccount=19 compound=false hasprox=true numfiles=11 size (mb)=0,018 diagnostics = {os.version=6.1, os=windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=sun microsystems inc.} no deletions test: open reader.........ok test: fields..............ok [15 fields] test: field norms.........ok [15 fields] test: terms, freq, prox...ok [900 terms; 1517 terms/docs pairs; 1707 tokens] test: stored fields.......ok [232 total field count; avg 12,211 fields per doc] test: term vectors........ok [3 total vector count; avg 0,158 term/freq vector fields per doc] no problems were detected with this index. it mean that the index is correct and there was no need for any corrective action. additionally, you can learn some interesting things about the index broken index but what happens in the case of the broken index? there is only one way to see it – let’s try. so, i broke one of the index files and ran the checkindex tool. the following appeared on the console after i’ve run the checkindex tool: opening index @ e:solrsolrdataindex segments file=segments_2 numsegments=1 version=format_diagnostics [lucene 2.9] 1 of 1: name=_0 doccount=19 compound=false hasprox=true numfiles=11 size (mb)=0,018 diagnostics = {os.version=6.1, os=windows 7, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=x86, java.version=1.6.0_23, java.vendor=sun microsystems inc.} no deletions test: open reader.........failed warning: fixindex() would remove reference to this segment; full exception: org.apache.lucene.index.corruptindexexception: did not read all bytes from file "_0.fnm": read 150 vs size 152 at org.apache.lucene.index.fieldinfos.read(fieldinfos.java:370) at org.apache.lucene.index.fieldinfos.(fieldinfos.java:71) at org.apache.lucene.index.segmentreader$corereaders.(segmentreader.java:119) at org.apache.lucene.index.segmentreader.get(segmentreader.java:652) at org.apache.lucene.index.segmentreader.get(segmentreader.java:605) at org.apache.lucene.index.checkindex.checkindex(checkindex.java:491) at org.apache.lucene.index.checkindex.main(checkindex.java:903) warning: 1 broken segments (containing 19 documents) detected warning: 19 documents will be lost note: will write new segments file in 5 seconds; this will remove 19 docs from the index. this is your last chance to ctrl+c! 5... 4... 3... 2... 1... writing... ok wrote new segments file "segments_3" as you can see, all the 19 documents that were in the index have been removed. this is an extreme case, but you should realize that this tool might work like this. the end if you remember about the basisc assumptions associated with the use of the checkindex tool you may find yourself in a situation when this tool will come in handy and you will not have to ask yourself a question like “when was the last backup was made?”
September 22, 2011
· 20,888 Views

Comments

Learning Haskell

Dec 27, 2011 · Mr B Loid

:-)
David Flanagan - New Book: The Ruby Programming Language

Nov 08, 2011 · Gerd Storm

Sorry Tommaso, I've made a mistake during translation from Polish to English, the source article has been corrected - http://solr.pl/en/2011/11/07/lucene-eurocon-2011-day-one/ :)

User has been successfully modified

Failed to modify user

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: