This post summarizes our experience around data ingestion in Search. Almost any search project begins with feeding search engine by existing data. Here we are mostly focusing on old good relational databases as data source. I don’t even hesitate what to to type: SQL database, or not-NoSQL DB?. Needless to say most of these considerations are applicable to any other data sources like files, web services, NoSQL DBs and distributed file systems.
Solr Data Import Handler - DIH
Let me come out at first, I’m a big fan of Data Import Handler. It’s as handy as any other ETL tool - you don’t need to write Java code calling SolrJ and debug SQL query results in IDE. With DIH you are jungling with configs and copy-paste queries and play with the queries and data right in the SolrAdmin. So, DIH is perfect for fast prototyping, but what about running it in production? There are couple of issues here (add yours):
- lack of concurrency - single thread processing keeps hardware idle and takes a lot of time to complete;
- lack of performant join (look further).
We will check these points below, however, we have an evidence that DIH is used for really huge deployments. So, despite of these limitations many people run DIH in production. To solve the concurrency problem, we can logically shard data using specially crafted queries and launch per-shard imports in parallel. Join performance problem is solved by persistent caches, which presumably makes join operation faster, yet I am skeptical about this approach. Now let’s look at some practical questions.
Indexing Blocks by DIH
Nested Entities (Data Join) in DIH
Now, lets come back to joining entities in DIH. As it’s perfectly described in SOLR-2382 referred above: “Using SqlEntityProcessor with Child Entities can cause an "n+1 select” problem”. This term reminds my earlier challenges in IT. Thus, it can’t process any considerable amount of data in a reasonable time. You may ask why don’t just ask RDBMS to join entities and process joined result set in DIH. Well, it’s possible if both tables reside in the same DB, however, you should remember about cartesian product problem which can happen if you join two or more child entities. Recommended approach in this case is to cache one side of the relation on heap with hashmap, and when heap is exceeded (the only case I care) it’s suggested to put data off-heap in BDB files.
Merge Join in DIH
It turns out ETL community is aware of this problem and had solution recipe for ages - external merge join. For example a gain between naive N+1 subqueries to merge join for hundred of millions records is the speedup from several hours to several minutes. It’s worth to mention that we meet the same algorithm - merging sorted sequences over and over again in search engines implementation. Here is the good news: starting from 5.0 merge join is available for any EntityProcessor in DIH by specifying join=”zipper” in child entity. Sure you need to sort both inputs, thankfully that’s done by RDBMS’ indices quite well. Interesting to see how Kettle ETL reminds about it every time when merge join configuration is amended. DIH doesn’t bother you with such pop-up, but throws an exception if inbound streams are not ordered. You can also process many-to-many relations, but it requires join and sort in RDBMS, that’s usually fine. It’s time to talk about threads and concurrent processing. Before this, let’s make a note that merge join algorithm is not easily parallelizable, thus “hash join” (lookup in cached data) is more suitable for multithreaded processing.
There are no threads in DIH. Sic. We have usual pitfall of producer-consumer with the sequential processing - everyone waits its’ counterpart:
|What we have in DIH||What we want to have|
Just a note, the same problem we have if run DIH with SolrCloud, in this case DIH feeds Solr one document by one synchronously, and blocks until every document is sent to shard leader by DistributingUpdateProcessor.
Okay, enough about problems, let’s talk about opportunities. We can parallelize outbound flow (consumer):
- if DIH sent updates via ConcurrentUpdateSolrClient or CloudSolrClient, it would unblock producer from consumer, giving the opportunity to fully utilize Solr machines for indexing. However, this is not possible with current DIH design, but we have a great attempt to breakthrough - SOLR-7188. Heads up! It would be a great win, letting us to run DIH as a real ETL tool.
- there is another patch adds threads on UpdateRequestProcessors layer - SOLR-3585 you can think about it as a server side ConcurrentUpdateSolrClient. Despite we have a positive feedback from the production usage, I’ve changed my mind since its’ contribution and don’t consider it as an architecturally wise approach. I suppose it’s a client’s duty (i.e. ETL) to provide the proper level of concurrent load and throttling. Nevertheless, you can use it if you are in the trouble, for example if you have some legacy script posting files into Solr http endpoint.
We can also think how to prefetch data in JdbcDataSource in background thread that avoids blocking producer. I must have such patch somewhere, let me know if you need one.
Another ETL Tool: Kettle
I also played with a couple of open source ETL tools, I choose Kettle as an example. It has many useful facilities built-in and you can definitely use it as a tool box for data ingestion. However, we are keen in the particular problem - building Solr XMLs. The difficulty which I've faced is the limitations of flat relational tuples (call it rows or records). However, what we need is to have at least three levels of nesting like here - parent-child-attributes. One of the possible workarounds is to use XML DOM as a data structure, however it can not be transferred between transformation steps as-is, and needs to be converted into string, concatenated and parsed again and again something like we have here. I've found XML Join is not well scalable, it rather works like in-memory XPath database, it's great, but not what I need. One of the possible solutions is to introduce DOM XML as a first class datatype in Kettle, and let some steps to process it as-is.
Stay tuned, we will present such a proof-of-concept soon. Don't hesitate to share your vision, experience and findings. Here is also a deck which I used during my talk about ETL.