Complementing an RDBMS With Solr
For this article, we’ll consider an application that manages Companies, People, and Products and the expected relationships between them with the example of famous jazz musicians.
Join the DZone community and get the full member experience.
Join For FreeIn a LucidWorks article, Solr and RDBMS: Designing Your Application for the Best of Both, it is suggested how the use of both Solr and an RDBMS can complement each other in the design of an application. This is exactly what was done in an application I recently worked on. In this article, I plan to share how this complementary relationship benefited our application and some of the details of how it was implemented. The LucidWorks article suggests the use of the Solr DataImportHandler (DIH) to keep the RDBMS and Solr in sync. That was not the approach we took in our application; rather, it was the responsibility of the Java layer to keep these two data repositories in sync.
The application that I’m referring to is proprietary, so I’ll have to change up some of the specifics. But in the words of Dragnet: “The story you are about to see is true. The names have been changed to protect the innocent.”
Example Application
For this article, we’ll consider an application that manages Companies, People, and Products and the expected relationships between these. A person can be an employee of a company and a company produces a collection of products. Here’s some sample data:
Company |
Employees |
Products |
Foo-Bar Jive |
Ella Fitzgerald Louis Armstrong Dizzy Gillespie |
Bebop Boogie Scat |
Dizzy’s Hop Shop |
John Coltrane Mile Davis Charles Mingus |
Swing Fusion |
In typical fashion, the RDBMS schema will have a table for each of these three entities, storing the entity name as well as other pertinent information. For example, the People table might store each person’s date of birth and both the companies and people table might have a foreign key relationship to an Addresses table. Whereas the RDBMS will store all the data and relationships our application needs for each of the entities, SOLR will store terms that a user might use to search for the entity. Certainly, "name" is at the top of this list, and for simplicity, names will be the focus of this article. But in a real-world application, there are likely other identifiers that a user may want to search by, i.e. a social security number for a person or an SKU for a product.
Dependent upon a user’s context within the application, only a certain type of entity may be appropriate for search results. For example, on the product page for Dizzy’s Hop Shop, search results would be filtered to just those of type "product" where the company is "Dizzy’s Hop Shop". On the other hand, from a front-page search, we will want the search to return all results unfiltered. For example:
- If the user enters the search term "dizzy", the search should return both the person "Dizzy Gillespie" and the company "Dizzy’s Hop Shop".
- If the user enters the search term "scat", we may want the search to return not just the product "scat" but the company "Foo-Bar Jive", which produces scat, as well.
Solr Configuration
A Solr collection can be thought of much like a single RDBMS table. The collection’s schema defines the columns, or, more correctly, fields, that make up this table. Documents make up the rows. Like a database table column, each field has a type, but here, the type not only specifies datatype (string, number, boolean, etc.) but also search functionality, such as how the contents of the field are to be tokenized and if searches against the field are to be case-sensitive. Fields can be defined as single-valued or multi-valued.
Since we want to be able to obtain mixed search results, we will use a single collection for all three database entities. Here’s a high-level view of the schema for this collection:
Field |
Description |
Id |
Like an RDBMS primary key, this is a unique identifier for each document in the collection. In the case of our application where data is stored both in an RDBMS and Solr, this is the linkage between the two (more about this later). |
Type |
The document type: Company, Person, or Product. |
Name |
The name of the entity (i.e. "Ella Fitzgerald"). |
AlternateId |
Other identifiers for the entity. I don’t want to go into too much detail here, but just want to raise the idea that depending on the application, there will likely be other identifiers. The specifics of what’s stored could differ between entities. For people, we might store a familiar nickname (i.e. "Satchmo", "Satch", "Pops"). For products, we could store an SKU. |
Parent |
For people and products, this would be the name of the company they are associated with. For companies, it would be empty. |
Children |
This would be a multivalued field listing the name of all the employees and products for a company. It would be empty for employees and products. |
The Java Layer
In our application, the Java layer supports all user/client interaction with the application. Users/clients are expected to never interact directly with either the RDBMS or Solr. Client requests for the creation, updating, and deletion of application entities (i.e. companies, people, products) are submitted to the Java layer, which is then responsible for appropriate database and Solr updates. Likewise, search queries are submitted to the Java layer, which forwards the query to Solr. The Java layer then hydrates the search result with full entity data from the RDBMS and returns the result to the client.
Key to the ability of the Java layer to map documents in a Solr search result to entities in the RDBMS is the Id field defined for each document. For each of the database tables in our application (at least those that are searchable by Solr), we use a serial numeric as the primary key. Then, when forming Solr documents, the Id field is the concatenation of the Java class name and the entities primary key in the database, i.e. ldb.solr.entity.Person.57
.
Java Layer: Keeping the RDBMS and Solr in Sync
As suggested in the LucidWorks article, Solr DIH is one option for keeping the RDBMS and Solr (loosely) in sync. The reason we’re not taking this approach for our application is that DIH is a batch process. Even if we run DIH at a high frequency, SOLR will always be a bit behind the RDBMS. Instead, for our application, we require the Java layer to make real-time updates to Solr. We might consider a batch process like DIH as a secondary mechanism for relaying database updates to Solr. This would allow the application to continue to run if the Solr server were temporarily down and then the batch process could catch up to SOLR when the server was back up. But for normal operations, the Java layer has the primary responsibility of keeping Solr up-to-date with changes made to the database.
Like most Java applications storing data in a database, our application will use the Java Persistence API (JPA) and in fact, we wish to keep our application independent of any specific JPA implementations (i.e. Hibernate). We also wish to keep Solr updates isolated from other application code, most of our application code should be unaware of the updates being made to Solr.
There are three main classes involved in updating SOLR:
SolrIndexed
is an interface that must be implemented by each entity class which is to be maintained in Solr (Company, Person, Product).SolrUpdate
is a JPA entity listener responsible for using SolrJ to make the Solr updates.SolrDto
is a data transfer object that represents a Solr document.
SolrDto Class
The members of the SolrDto
class match up one-to-one with the schema for our Solr collection. Each member field is annotated with @Field
:
public class SolrDto {
@Field
private String id;
@Field
private String name;
…
public String getId() {
return id;
}
public void setId(Class<? extends SolrIndexed>clazz, Long id){
this.id = new SolrId(clazz, id).toString();
}
…
}
Note that the setter for ID does not take the expected String argument, which would match the type of Id, but instead takes the class of the entity and its Id (database primary key).
The SolrId
class handles forming the Id of the Solr document (for updates to Solr) and parsing the document IDs of returned search results.
public class SolrId{
private final Class<? extends SolrIndexed> entityClass;
private final Long id;
public SolrId(Class<? extends SolrIndexed> clazz, Long id) {
this.entityClass = clazz;
this.id = id;
}
public SolrId( String solrId){
int dotPos = solrId.lastIndexOf('.');
if( dotPos == -1){
throw new RuntimeException("Invalid SolrId: " + solrId);
}
try {
this.entityClass = (Class<? extends SolrIndexed>) Class.forName(solrId.substring(0, dotPos));
} catch (ClassNotFoundException e) {
throw new RuntimeException("SolrId has invalid class: " + solrId);
}
this.id = Long.valueOf(solrId.substring(dotPos +1));
}
public Class<? extends SolrIndexed> getEntityClass() {
return entityClass;
}
public Long getId() {
return id;
}
@Override
public String toString() {
return String.format("%s.%d", entityClass.getCanonicalName(), id);
}
}
SolrIndexed Interface
Each of our entity classes (Company, Person, Product) will implement the SolrIndexed
interface. Implementing this interface means that each of these classes must provide a method getSolrDto()
.
public interface SolrIndexed {
SolrDto getSolrDto();
}
Thus, each of these classes is responsible for providing a mapping from the entity to SolrDto
.
@Transient
@Override
public SolrDto getSolrDto() {
SolrDto dto = new SolrDto();
dto.setId(this.getClass(), id);
dto.setName(name);
…
return dto;
}
SolrUpdate Class
The SolrUpdate
class is a JPA event listener. A JPA event listener contains callback methods that respond to JPA lifecycle events. In the case of the SolrUpdate
class, these methods use the SolrJ API to update Solr based on the JPA lifecycle events.
public class SolrUpdate extends SolrServer {
@PostPersist
@PostUpdate
public void onCreateUpdate( SolrIndexed indexed){
SolrClient client = getSolrClient();
try {
client.addBean(indexed.getSolrDto());
client.commit();
} catch (IOException | SolrServerException e) {
e.printStackTrace();
}
}
@PostRemove
public void onDelete( SolrIndexed indexed){
SolrClient client = getSolrClient();
try {
client.deleteById(indexed.getSolrDto().getId());
client.commit();
} catch (IOException | SolrServerException e) {
e.printStackTrace();
}
}
}
In order for SolrUpdate
to be called for JPA lifecycle events, it must be listed in the annotation @EntityListeners
on each of the SolrIndexed
entities (Company, Person, Product).
@Entity
@EntityListeners(SolrUpdate.class)
@Table(name = "people")
public class Person implements SolrIndexed {
Performance Considerations
This approach to sync’ing the RDBMS and SOLR will likely be very chatty. For each entity insert/update/delete a corresponding update is made to SOLR. The SolrJ API supports updating and/or deleting multiple documents in a single call, the limiting factor here is JPA which only provides for pre/post database action call-backs.
If we’re not restricted to pure JPA but can make use of extensions of the JPA provider, Hibernate does offer an option that will reduce this chatty-ness. In addition to the JPA entity listeners, Hibernate provides ‘interceptors’ that are called at certain life-cycle events, in particular at the ending of a transaction. With this available the call to SolrClient.commit() can be delayed until the Hibernate transaction completes. In this case the SolrClient must be managed as a singleton for the life of the transaction. The JPA event listeners would make addBean() and deleteById() calls to that SolrClient, but not call commit(). The commit() call would then be made at the completion of the transaction by the Hibernate interceptor.
Java Layer: Supporting Queries
For our application, we want to support searching against a user-entered string. The Java Layer:
- Builds a Solr (Lucene) query based on that string
- The query requests only the Id field to be returned.
- The query requests the results to be sorted by score (quality of match).
- Submits it to the SOLR server via the SolrJ API.
- Hydrates the returned IDs to full entities by use of JPA.
Here’s the sample code:
public List<SolrIndexed> search(String searchString) throws IOException, SolrServerException {
final String[]fields = {"id"};
SolrQuery query = new SolrQuery();
query.setQuery(String.format("name:(%s)", searchString));
query.setFields(fields);
query.setSort("score", SolrQuery.ORDER.desc);
SolrDocumentList documents = getSolrClient().query(query).getResults();
EntityManager entityManager = EntityManagerUtility.getEntityManager();
entityManager.getTransaction().begin();
List<SolrIndexed> resultsList = documents.stream().
map(doc->(String)doc.get("id")).
map(SolrDto.SolrId::new).
map( solrId-> entityManager.find(solrId.getEntityClass(), solrId.getId())).
collect(Collectors.toList());
entityManager.close();
return resultsList;
}
Let’s examine building the query string itself — String.format("name:(%s)", searchString)
— in a bit more detail.
First, in the sample code, we’re simply searching for matches between the searchString
and the name
field. Suppose, however, our requirements were to look for matches both against the name
field and the alternateId
field, giving priority to name matches. This could be accomplished by instead forming the Lucene query as:
String.format("name:(%s)^2 OR alternateId:(%s)", searchString, searchString)
Here, we’ve asked for documents where either the name field matches searchString
or the second notice that we’ve wrapped searchString
in parenthesis. This is necessary since the user entered searchString
, which may include white space. In Lucene syntax, parentheses indicate grouping. Wrapping searchString
in parentheses says to search for documents where the name
or alternateId
field matches any of the tokens in the searchString
. The more tokens that match, the higher score assigned to the document. So, referring back to our sample data, if the user searchString
is "Dizzy Gillespie", Solr will find a match both for the Person ‘"Dizzy Gillespie" and the Company "Dizzy’s Hop Shop". The Person will score higher since there is a match on both tokens and be returned first. "Dizzy’s Hop Shop" is a match, but with a lower score.
Conclusion
Using both an RDBMS and Solr as dual data repositories for an application allows the application to store detailed entity information and relationships in typical RDBMS fashion while also supporting rich search capabilities with Solr. The examples given here are necessarily simplistic. In a real application, both the RDBMS and Solr schemas would be more complex. For example, copy fields in the Solr schema could define edge-gram and n-gram versions of the name
fiel,d which the search query could also include as lower -uality matches.
Published at DZone with permission of Doug Baughman. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments