DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Avatar

Doug Turnbull

Consultant at OpenSource Connections

Charlottesville, US

Joined Feb 2013

http://opensourceconnections.com

About

Search and recommendations expert. Get in touch if you need help making search or recommendations more relevant to users needs. Or if you'd like to build a new search/recommendations experience for your users.

Stats

Reputation: 308
Pageviews: 323.4K
Articles: 7
Comments: 0
  • Articles

Articles

article thumbnail
High-Quality Recommendation Systems with Elasticsearch: Part I
We argue that recommendations and search are two sides of the same coin. Both rank content for a user based on “relevance.” The only difference is whether a keyword query is provided. Read this article to find out more.
Updated September 18, 2016
· 11,335 Views · 7 Likes
article thumbnail
Precision and Recall by Example: The Two Pillars of Search Relevance
An excerpt from the upcoming book Relevant Search on the fundamentals of search relevance.
April 6, 2016
· 9,949 Views · 7 Likes
article thumbnail
Solr vs Elasticsearch: Battle of The Query DSLs
Comparing Solr and Elasticsearch on how they control ranking, and how their query DSLs stack up against each other.
February 10, 2016
· 10,490 Views · 11 Likes
article thumbnail
Correctly Using Apache Camel’s AdviceWith in Unit Tests
We care a lot about the stuff that goes around Solr and Elasticsearch in our client’s infrastructure. One area that seems to always be being reinvented for-better-or-worse is the data ETL/data ingest path from data source X to the search engine. One tool we’ve enjoyed using for basic ETL these days is Apache Camel. Camel is an extremely feature-rich Java data integration framework for wiring up just about anything to anything else. And by anything I mean anything: file system, databases, HTTP, search engines, twitter, IRC, etc. One area I initially struggled with with Camel was exactly how to test my code. Lets say I have defined a simple Camel route like this: from("file:inbox") .unmarshall(csv) // parse as CSV .split() // now we're operating on individual CSV lines .bean("customTransformation") // do some random operation on the CSV line .to("solr://localhost:8983/solr/collection1/update") Great! Now if you’ve gotten into Camel testing, you may know there’s something called “AdviceWith“. What is this interesting sounding thing? Well I think its a way of saying “take these routes and muck with them” — stub out this, intercept that and don’t forward, etc. Exactly the kind of slicing and dicing I’d like to do in my unit tests! I definitely recommend reading up on the docs, but here’s the real step-by-step built around where you’re probably going to get stuck (cause its where I got stuck!) getting AdviceWith to work for your tests. 1. Use CamelTestSupport Ok most importantly, we need to actually define a test that uses CamelTestSupport. CamelTestSupport automatically creates and starts our camel context for us. public class ItGoesToSolrTest extends CamelTestSupport { ... } 2. Specify the route builder we’re testing In our test, we need to tell CamelTestSupport where it can access its routes: @Override protected RouteBuilder createRouteBuilder() { return new MyProductionRouteBuilder(); } 3. Specify any beans we’d like to register Its probably the case that you’re using Java beans with Camel. If you’re using the bean integration and referring to beans by name in your camel routes, you’ll need to register those names with an instance of your class. @Override protected Context createJndiContext() throws Exception { JndiContext context = new JndiContext(); context.bind("customTransformation", new CustomTransformation()); return context; } 4. Monkey with our production routes using advice with Second we need to actually use the AdviceWithRouteBuilder before each test: @Before public void mockEndpoints() throws Exception { AdviceWithRouteBuilder mockSolr = new AdviceWithRouteBuilder() { @Override public void configure() throws Exception { // mock the for testing interceptSendToEndpoint("solr://localhost:8983/solr/collection1/update") .skipSendToOriginalEndpoint() .to("mock:catchSolrMessages"); } }) context.getRouteDefinition(1). .adviceWith(context, mockSolr); } There’s a couple things to notice here: In configure we simply snag an endpoint (in this case Solr) and then we have complete freedom to do whatever we want. In this case, we’re rewiring it to a mock endpoint we can use for testing. Notice how we get a route definition by index (in this case 1) to snag the route we’re testing and that we’d like to monkey with. This is how I’ve seen it in most Camel examples, and its hard to guess how Camel is going to assign some index to your route. A better way would be to give our route definition a name: from(“file:inbox”) .routeId(“csvToSolrRoute”) .unmarshall(csv) // parse as CSV then we can refer to this name when retrieving our route: context.getRouteDefinition("csvToSolrRoute"). .adviceWith(context, mockSolr); 5. Tell CamelTestSupport you want to manually start/stop camel One problem you will run into with the normal tutorials is that CamelTestSupport may start routes before your mocks have taken hold. Thus your mocked routes won’t be part of what CamelTestSupport has actually started. You’ll be pulling your hair out wondering why Camel insists on attempting to forward documents to an actual Solr instance and not your test endpoint. To take matters into your own hands, luckily CamelTestSupport comes to the rescue with a simple method you need to override to communicate your intent to manually start/stop the camel context: @Override public boolean isUseAdviceWith() { return true; } Then in your test, you’ll need to be sure to do: @Test public void foo() { context.start(); // tests! context.stop(); } 6. Write a test! Now you’re equipped to try out a real test! @Test public void testWithRealFile() { MockEndpoint mockSolr = getMockEndpoint("mock:catchSolrMessages"); File testCsv = getTestfile(); context.start(); mockSolr.expectedMessageCount(1); FileUtils.copyFile(testCsv, "inbox"); mockSolr.assertIsSatisfied(); context.stop(); } And that’s just scratching the surface of Camel’s testing capabilities. Check out the camel docs for information on stimulating endpoints directly with the ProducerTemplate thus letting you avoid using real files — and all kinds of goodies. Anyway, hopefully my experiences with AdviceWith can help you get it up and running in your tests! I’d love to hear about your experiences or any tips I’m missing either in the comments or [via email][5]. If you’d love to utilize Solr or Elasticsearch for search and analytics, but can’t figure out how to integrate them with your data infrastructure — contact us! Maybe there’s a camel recipe we could cook up for you that could do just the trick.
May 16, 2014
· 23,544 Views · 1 Like
article thumbnail
Build Your Own Custom Lucene Query and Scorer
Every now and then we’ll come across a search problem that can’t simply be solved with plain Solr relevancy. This usually means a customer knows exactly how documents should be scored. They may have little tolerance for close approximations of this scoring through Solr boosts, function queries, etc. They want a Lucene-based technology for text analysis and performant data structures, but they need to be extremely specific in how documents should be scored relative to each other. Well for those extremely specialized cases we can prescribe a little out-patient surgery to your Solr install – building your own Lucene Query. This is the Nuclear Option Before we dive in, a word of caution. Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths: You’ve utilized Solr’s extensive set of query parsers & features including function queries, joins, etc. None of this solved your problem You’ve exhausted the ecosystem of plugins that extend on the capabilities in (1). That didn’t work. You’ve implemented your own query parser plugin that takes user input and generates existing Lucene queries to do this work. This still didn’t solve your problem. You’ve thought carefully about your analyzers – massaging your data so that at index time and query time, text lines up exactly as it should to optimize the behavior of existing search scoring. This still didn’t get what you wanted. You’ve implemented your own custom Similarity that modifies how Lucene calculates the traditional relevancy statistics – query norms, term frequency, etc. You’ve tried to use Lucene’s CustomScoreQuery to wrap an existing Query and alter each documents score via a callback. This still wasn’t low-level enough for you, you needed even more control. If you’re still reading you either think this is going to be fun/educational (good for you!) or you’re one of the minority that must control exactly what happens with search. If you don’t know, you can of course contact us for professional services. Ok back to the action… Refresher – Lucene Searching 101 Recall that to search in Lucene, we need to get a hold of an IndexSearcher. This IndexSearcher performs search over an IndexReader. Assuming we’ve created an index, with these classes we can perform searches like in this code: Directory dir = new RAMDirectory(); IndexReader idxReader = new IndexReader(dir); idxSearcher idxSearcher = new IndexSearcher(idxReader) Query q = new TermQuery(new Term(“field”, “value”)); idxSearcher.search(q); Let’s summarize the objects we’ve created: Directory – Lucene’s interface to a file system. This is pretty straight-forward. We won’t be diving in here. IndexReader – Access to data structures in Lucene’s inverted index. If we want to look up a term, and visit every document it exists in, this is where we’d start. If we wanted to play with term vectors, offsets, or anything else stored in the index, we’d look here for that stuff as well. IndexSearcher — wraps an IndexReader for the purpose of taking search queries and executing them. Query – How we expect the searcher to perform the search, encompassing both scoring and which documents are returned. In this case, we’re searching for “value” in field “field”. This is the bit we want to toy with In addition to these classes, we’ll mention a support class exists behind the scenes: Similarity – Defines rules/formulas for calculating norms at index time and query normalization. Now with this outline, let’s think about a custom Lucene Query we can implement to help us learn. How about a query that searches for terms backwards. If the document matches a term backwards (like ananab for banana), we’ll return a score of 5.0. If the document matches the forwards version, let’s still return the document, with a score of 1.0 instead. We’ll call this Query “BackwardsTermQuery”. This example is hosted here on github. A tale of 3 classes – A Query, A Weight, and a Scorer Before we sling code, let’s talk about general architecture. A Lucene Query follows this general structure: A custom Query class, inheriting from Query A custom Weight class, inheriting from Weight A custom Scorer class inheriting from Scorer These three objects wrap each other. A Query creates a Weight, and a Weight in turn creates a Scorer. A Query is itself a very straight-forward class. One of its main responsibilities when passed to the IndexSearcher is to create a Weight instance. Other than that, there are additional responsibilities to Lucene and users of your Query to consider, that we’ll discuss in the “Query” section below. A Query creates a Weight. Why? Lucene needs a way to track IndexSearcher level statistics specific to each query while retaining the ability to reuse the query across multiple IndexSearchers. This is the role of the Weight class. When performing a search, IndexSearcher asks the Query to create a Weight instance. This instance becomes the container for holding high-level statistics for the Query scoped to this IndexSearcher (we’ll go over these steps more in the “Weight” section below). The IndexSearcher safely owns the Weight, and can abuse and dispose of it as needed. If later the Query gets reused by another IndexSearcher, a new Weight simply gets created. Once an IndexSearcher has a Weight, and has calculated any IndexSearcher level statistics, the IndexSearcher’s next task is to find matching documents and score them. To do this, the Weight in turn creates a Scorer. Just as the Weight is tied closely to an IndexSearcher, a Scorer is tied to an individual IndexReader. Now this may seem a little odd – in our code above the IndexSearcher always has exactly one IndexReader right? Not quite. See, a little hidden implementation detail is that IndexReaders may actually wrap other smaller IndexReaders – each tied to a different segment of the index. Therefore, an IndexSearcher needs to have the ability score documents across multiple, independent IndexReaders. How your scorer should iterate over matches and score documents is outlined in the “Scorer” section below. So to summarize, we can expand the last line from our example above… idxSearcher.search(q); … into this psuedocode: Weight w = q.createWeight(idxSearcher); // IndexSearcher level calculations for weight Foreach IndexReader idxReader: Scorer s = w.scorer(idxReader); // collect matches and score them Now that we have the basic flow down, let’s pick apart the three classes in a little more detail for our custom implementation. Our Custom Query What should our custom Query implementation look like? Query implementations always have two audiences: (1) Lucene and (2) users of your Query implementation. For your users, expose whatever methods you require to modify how a searcher matches and scores with your query. Want to only return as a match 1/3 of the documents that match the query? Want to punish the score because the document length is longer than the query length? Add the appropriate modifier on the query that impacts the scorer’s behavior. For our BackwardsTermQuery, we don’t expose accessors to modify the behavior of the search. The user simply uses the constructor to specify the term and field to search. In our constructor, we will simply be reusing Lucene’s existing TermQuery for searching individual terms in a document. private TermQuery backwardsQuery; private TermQuery forwardsQuery; public BackwardsTermQuery(String field, String term) { // A wrapped TermQuery for the reverse string Term backwardsTerm = new Term(field, new StringBuilder(term).reverse().toString()); backwardsQuery = new TermQuery(backwardsTerm); // A wrapped TermQuery for the Forward Term forwardsTerm = new Term(field, term); forwardsQuery = new TermQuery(forwardsTerm); } Just as importantly, be sure your Query meets the expectation of Lucene. Most importantly, you MUST override the following. createWeight() hashCode() equals() The method createWeight() we’ve discussed. This is where you’ll create a weight instance for an IndexSearcher. Pass any parameters that will influence the scoring algorithm, as the Weight will in turn be creating a searcher. Even though they are not abstract methods, overriding the hashCode()/equals() methods is very important. These methods are used by Lucene/Solr to cache queries/results. If two queries are equal, there’s no reason to rerun the query. Running another instance of your query could result in seeing the results of your first query multiple times. You’ll see your search for “peas” work great, then you’ll search for “bananas” and see “peas” search results. Override equals() and hashCode() so that “peas” != bananas. Our BackwardsTermQuery implements createWeight() by creating a custom BackwardsWeight that we’ll cover below: @Override public Weight createWeight(IndexSearcher searcher) throws IOException { return new BackwardsWeight(searcher); } BackwardsTermQuery has a fairly boilerplate equals() and hashCode() that passes through to the wrapped TermQuerys. Be sure equals() includes all the boilerplate stuff such as the check for self-comparison, the use of the super equals operator, the class comparison, etc etc. By using Lucene’s unit test suite, we can get a lot of good checks that our implementation of these is correct. @Override public boolean equals(Object other) { if (this == other) { return true; } if (!super.equals(other)) { return false; } if (getClass() != other .getClass()) { return false; } BackwardsTermQuery otherQ = (BackwardsTermQuery)(other); if (otherQ.getBoost() != getBoost()) { return false; } return otherQ.backwardsQuery.equals(backwardsQuery) && otherQ.forwardsQuery.equals(forwardsQuery); } @Override public int hashCode() { return super.hashCode() + backwardsQuery.hashCode() + forwardsQuery.hashCode(); } Our Custom Weight You may choose to use Weight simply as a mechanism to create Scorers (where the real meat of search scoring lives). However, your Custom Weight class must at least provide boilerplate implementations of the query normalization methods even if you largely ignore what is passed in: getValueForNormalization normalize These methods participate in a little ritual that IndexSearcher puts your Weight through with the Similarity for query normalization. To summarize the query normalization code in the IndexSearcher: float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f); Great, what does this code do? Well a value is extracted from Weight. This value is then passed to a Similarity instance that “normalizes” that value. Weight then receives this normalized value back. In short, this is allowing IndexSearcher to give weight some information about how its “value for normalization” compares to the rest of the stuff being searched by this searcher. This is extremely high level, “value for normalization” could mean anything, but here it generally means “what I think is my weight” and what Weight receives back is what the searcher says “no really here is your weight”. The details of what that means depend on the Similarity and Weight implementation. It’s expected that the Weight’s generated Scorer will use this normalized weight in scoring. You can chose to do whatever you want in your own Scorer including completely ignoring what’s passed to normalize(). While our Weight isn’t factoring into the scoring calculation, for consistency sake, we’ll participate in the little ritual by overriding these methods: @Override public float getValueForNormalization() throws IOException { return backwardsWeight.getValueForNormalization() + forwardsWeight.getValueForNormalization(); } @Override public void normalize(float norm, float topLevelBoost) { backwardsWeight.normalize(norm, topLevelBoost); forwardsWeight.normalize(norm, topLevelBoost); } Outside of these query normalization details, and implementing “scorer”, little else happens in the Weight. However, you may perform whatever else that requires an IndexSearcher in the Weight constructor. In our implementation, we don’t perform any additional steps with IndexSearcher. The final and most important requirement of Weight is to create a Scorer. For BackwardsWeight we construct our custom BackwardsScorer, passing scorers created from each of the wrapped queries to work with. @Override public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws IOException { Scorer backwardsScorer = backwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); Scorer forwardsScorer = forwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); return new BackwardsScorer(this, context, backwardsScorer, forwardsScorer); } Our Custom Scorer The Scorer is the real meat of the search work. Responsible for identifying matches and providing scores for those matches, this is where the lion share of our customization will occur. It’s important to note that a Scorer is also a Lucene DocIdSetIterator. A DocIdSetIterator is a cursor into a set of documents in the index. It provides three important methods: docID() – what is the id of the current document? (this is an internal Lucene ID, not the Solr “id” field you might have in your index) nextDoc() – advance to the next document advance(target) – advance (seek) to the target One uses a DocIdSetIterator by first calling nextDoc() or advance() and then reading the docID to get the iterator’s current location. The value of the docIDs only increase as they are iterated over. By implementing this interface a Scorer acts as an iterator over matches in the index. A Scorer for the query “field1:cat” can be iterated over in this manner to return all the documents that match the cat query. In fact, if you recall from my article, this is exactly how the terms are stored in the search index. You can chose to either figure out how to correctly iterate through the documents in a search index, or you can use the other Lucene queries as building blocks. The latter is often the simplest. For example, if you wish to iterate over the set of documents containing two terms, simply use the scorer corresponding to a BooleanQuery for iteration purposes. The first method of our scorer to look at is docID(). It works by reporting the lowest docID() of our underlying scorers. This scorer can be thought of as being “before” the other in the index, and as we want to report numerically increasing docIDs, we always want to chose this value: @Override public int docID() { int backwordsDocId = backwardsScorer.docID(); int forwardsDocId = forwardsScorer.docID(); if (backwordsDocId <= forwardsDocId && backwordsDocId != NO_MORE_DOCS) { currScore = BACKWARDS_SCORE; return backwordsDocId; } else if (forwardsDocId != NO_MORE_DOCS) { currScore = FORWARDS_SCORE; return forwardsDocId; } return NO_MORE_DOCS; } Similarly, we always want to advance the scorer with the lowest docID, moving it ahead. Then, we report our current position by returning docID() which as we’ve just seen will report the docID of the scorer that advanced the least in the nextDoc() operation. @Override public int nextDoc() throws IOException { int currDocId = docID(); // increment one or both if (currDocId == backwardsScorer.docID()) { backwardsScorer.nextDoc(); } if (currDocId == forwardsScorer.docID()) { forwardsScorer.nextDoc(); } return docID(); } In our advance() implementation, we allow each Scorer to advance. An advance() implementation promises to either land docID() exactly on or past target. Our call to docID() after we call advance will return either that one or both are on target, or it will return the lowest docID past target. @Override public int advance(int target) throws IOException { backwardsScorer.advance(target); forwardsScorer.advance(target); return docID(); } What a Scorer adds on top of DocIdSetIterator is the “score” method. When score() is called, a score for the current document (the doc at docID) is expected to be returned. Using the full capabilities of the IndexReader, any number of information stored in the index can be consulted to arrive at a score either in score() or while iterating documents in nextDoc()/advance(). Given the docId, you’ll be able to access the term vector for that document (if available) to perform more sophisticated calculations. In our query, we’ll simply keep track as to whether the current docID is from the wrapped backwards term scorer, indicating a match on the backwards term, or the forwards scorer, indicating a match on the normal, unreversed term. Recall docID() is always called on advance/nextDoc. You’ll notice we update currScore in docID, updating it every time the document advances. @Override public float score() throws IOException { return currScore; } A Note on Unit Testing Now that we have an implementation of a search query, we’ll want to test it! I highly recommend using Lucene’s test framework. Lucene will randomly inject different implementations of various support classes, index implementations, to throw your code off balance. Additionally, Lucene creates test implementations of classes such as IndexReader that work to check whether your Query correctly fulfills its contract. In my work, I’ve had numerous cases where tests would fail intermittently, pointing to places where my use of Lucene’s data structures subtly violated the expected contract. An example unit test is included in the github project associated with this blog post. Wrapping Up That’s a lot of stuff! And I didn’t even cover everything there is to know! As an exercise to the reader, you can explore the Scorer methods cost() and freq(), as well as the rewrite() method of Query used optionally for optimization. Additionally, I haven’t explored how most of the traditional search queries end up using a framework of Scorers/Weights that don’t actually inherit from Scorer or Weight known as “SimScorer” and “SimWeight”. These support classes consult a Similarity instance to customize calculation certain search statistics such as tf, convert a payload to a boost, etc. In short there’s a lot here! So tread carefully, there’s plenty of fiddly bits out there! But have fun! Creating a custom Lucene query is a great way to really understand how search works, and the last resort short in solving relevancy problems short of creating your own search engine. And if you have relevancy issues, contact us! If you don’t know whether you do, our search relevancy product, Quepid – might be able to tell you!
February 10, 2014
· 13,798 Views
article thumbnail
Debugging “Wrong FS expected: file:///” exception from HDFS
I just spent some time putting together some basic Java code to read some data from HDFS. Pretty basic stuff. No map reduce involved. Pretty boilerplate code like the stuff from this popular tutorial on the topic. No matter what, I kept hitting my head on this error: Exception in thread “main” java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/hadoop/DOUG_SVD/out.txt, expected: file:/// If you checkout the tutorial above, what’s supposed to be happening is that an instance of Hadoop’s Configuration should encounter a fs.default.name property, in one of the config files its given. The Configuration should realize that this property has a value of hdfs://localhost:9000. When you use the Configuration to create a Hadoop FileSystem instance, it should happily read this property from Configuration and process paths from HDFS. That’s a long way of saying these three lines of Java code: // pickup config files off classpath Configuration conf = new Configuration() // explicitely add other config files conf.addResource("/home/hadoop/conf/core-site.xml"); // create a FileSystem object needed to load file resources FileSystem fs = FileSystem.get(conf); // load files and stuff below! Well… My Hadoop config files (core-site.xml) appear setup correctly. It appears to be in my CLASSPATH. I’m even trying to explicitly add the resource. Basically I’ve followed all the troubleshooting tips you’re supposed to follow when you encounter this exception. But I’m STILL getting this exception. Head meet wall. This has to be something stupid. Troubleshooting Hadoop’s Configuration & FileSystem Objects Well before I reveal my dumb mistake in the above code, it turns out there’s some helpful functions to help debug these kind of problems: As Configuration is just a bunch of key/value pairs from a set of resources, its useful to know what resources it thinks it loaded and what properties it thinks it loaded from those files. getRaw() — return the raw value for a configuration item (like conf.getRaw("fs.default.name")) toString() — Configuration‘s toString shows the resources loaded You can similarly checkout FileSystem‘s helpful toString method. It nicely lays out where it thinks its pointing (native vs HDFS vs S3 etc). So if you similarly are looking for a stupid mistake like I was, pepper your code with printouts of these bits of info. They will at least point you in a new direction to search for your dumb mistake. Drumroll Please Turns out I missed the crucial step of passing a Path object not a String to addResource. They appear to do slightly different things. Adding a String adds a resource relative to the classpath. Adding a Path is used to add a resource at an absolute location and does not consider the classpath. So to explicitly load the correct config file, the code above gets turned into (drumroll please): // pickup config files off classpath Configuration conf = new Configuration() // explicitely add other config files // PASS A PATH NOT A STRING! conf.addResource(new Path("/home/hadoop/conf/core-site.xml")); FileSystem fs = FileSystem.get(conf); // load files and stuff below! Then Tada! everything magically works! Hopefully these tips can save you the next time you encounter these kinds of problems.
March 27, 2013
· 17,281 Views
article thumbnail
Escaping Solr Query Characters In Python
I’ve been working in some Python Solr client code. One area where bugs have cropped up is in query terms that need to be escaped before passing to Solr. For example, how do you send Solr an argument term with a “:” in it? Or a “(“? It turns out that generally you just put a \ in front of the character you want to escape. So to search for “:” in the “funnychars” field, you would send q=funnychars:\:. Php programmer Mats Lindh has solved this pretty well, using str_replace. str_replace is a convenient, general-purpose string replacement function that lets you do batch string replacement. For example you can do: $matches = array("Mary","lamb","fleece"); $replacements = array("Harry","dog","fur"); str_replace($matches, $replacements,"Mary had a little lamb, its fleece was white as snow"); Python doesn’t quite have str_replace. There is translate which does single character to single character batch replacement. That can’t be used for escaping because the destination values are strings(ie “\:”), not single characters. There’s a general purpose “str_replace” drop-in replacement at this Stackoverflow question: edits =[("Mary","Harry"),("lamb","dog"),("fleece","fur")]# etc.for search, replace in edits: s = s.replace(search, replace) You’ll notice that this algorithm requires multiple passes through the string for search/replacement. This is because that earlier search/replaces may impact later search/replaces. For example, what if edits was this: edits =[("Mary","Harry"),("Harry","Larry"),("Larry","Barry")] First our str_replace will replace Mary with Harry in pass 1, then Harry with Larry in pass 2, etc. It turns out that escaping characters is a narrower string replacement case that can be done more efficiently without too much complication. The only character that one needs to worry about impacting other rules is escaping the \, as the other rules insert \ characters, we wouldn’t want them double escaped. Aside from this caveat, all the escaping rules can be processed from a single pass through the string which my solution below does, performing a heck of a lot faster: # These rules all independent, order of# escaping doesn't matter escapeRules ={'+':r'\+','-':r'\-','&':r'\&','|':r'\|','!':r'\!','(':r'\(',')':r'\)','{':r'\{','}':r'\}','[':r'\[',']':r'\]','^':r'\^','~':r'\~','*':r'\*','?':r'\?',':':r'\:','"':r'\"',';':r'\;',' ':r'\ '}defescapedSeq(term):""" Yield the next string based on the next character (either this char or escaped version """forcharin term:ifcharin escapeRules.keys():yield escapeRules[char]else:yieldchardefescapeSolrArg(term):""" Apply escaping to the passed in query terms escaping special characters like : , etc""" term = term.replace('\\',r'\\')# escape \ firstreturn"".join([nextStr for nextStr in escapedSeq(term)]) Aside from being a good general solution to this problem, in some basic benchmarks, this turns out to be about 5 orders of magnitude faster than doing it the naive way! Pretty cool, but you’ll probably rarely notice the difference. Nevertheless it could matter in specialized cases if you are automatically constructing and batching large/gnarly queries that require a lot of work to escape. Anyway, enjoy! I’d love to hear what you think!
February 8, 2013
· 4,179 Views

User has been successfully modified

Failed to modify user

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: