Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Complementing an RDBMS With Solr

DZone's Guide to

Complementing an RDBMS With Solr

For this article, we’ll consider an application that manages Companies, People, and Products and the expected relationships between them with the example of famous jazz musicians.

· Database Zone ·
Free Resource

Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.  

In a LucidWorks article, Solr and RDBMS: Designing Your Application for the Best of Both, it is suggested how the use of both Solr and an RDBMS can complement each other in the design of an application. This is exactly what was done in an application I recently worked on. In this article, I plan to share how this complementary relationship benefited our application and some of the details of how it was implemented. The LucidWorks article suggests the use of the Solr DataImportHandler (DIH) to keep the RDBMS and Solr in sync. That was not the approach we took in our application; rather, it was the responsibility of the Java layer to keep these two data repositories in sync.

The application that I’m referring to is proprietary, so I’ll have to change up some of the specifics. But in the words of Dragnet: “The story you are about to see is true. The names have been changed to protect the innocent.”

Example Application

For this article, we’ll consider an application that manages Companies, People, and Products and the expected relationships between these. A person can be an employee of a company and a company produces a collection of products. Here’s some sample data:

Company

Employees

Products

Foo-Bar Jive

Ella Fitzgerald

Louis Armstrong

Dizzy Gillespie

Bebop

Boogie

Scat

Dizzy’s Hop Shop

John Coltrane

Mile Davis

Charles Mingus

Swing

Fusion

In typical fashion, the RDBMS schema will have a table for each of these three entities, storing the entity name as well as other pertinent information. For example, the People table might store each person’s date of birth and both the companies and people table might have a foreign key relationship to an Addresses table. Whereas the RDBMS will store all the data and relationships our application needs for each of the entities, SOLR will store terms that a user might use to search for the entity. Certainly, "name" is at the top of this list, and for simplicity, names will be the focus of this article. But in a real-world application, there are likely other identifiers that a user may want to search by, i.e. a social security number for a person or an SKU for a product.

Dependent upon a user’s context within the application, only a certain type of entity may be appropriate for search results. For example, on the product page for Dizzy’s Hop Shop, search results would be filtered to just those of type "product" where the company is "Dizzy’s Hop Shop". On the other hand, from a front-page search, we will want the search to return all results unfiltered. For example:

  • If the user enters the search term "dizzy", the search should return both the person "Dizzy Gillespie" and the company "Dizzy’s Hop Shop".
  • If the user enters the search term "scat", we may want the search to return not just the product "scat" but the company "Foo-Bar Jive", which produces scat, as well.

Solr Configuration

A Solr collection can be thought of much like a single RDBMS table. The collection’s schema defines the columns, or, more correctly, fields, that make up this table. Documents make up the rows. Like a database table column, each field has a type, but here, the type not only specifies datatype (string, number, boolean, etc.) but also search functionality, such as how the contents of the field are to be tokenized and if searches against the field are to be case-sensitive. Fields can be defined as single-valued or multi-valued.

Since we want to be able to obtain mixed search results, we will use a single collection for all three database entities. Here’s a high-level view of the schema for this collection:

Field

Description

Id

Like an RDBMS primary key, this is a unique identifier for each document in the collection. In the case of our application where data is stored both in an RDBMS and Solr, this is the linkage between the two (more about this later).

Type

The document type: Company, Person, or Product.

Name

The name of the entity (i.e. "Ella Fitzgerald").

AlternateId

Other identifiers for the entity. I don’t want to go into too much detail here, but just want to raise the idea that depending on the application, there will likely be other identifiers. The specifics of what’s stored could differ between entities. For people, we might store a familiar nickname (i.e. "Satchmo", "Satch", "Pops"). For products, we could store an SKU.

Parent

For people and products, this would be the name of the company they are associated with. For companies, it would be empty.

Children

This would be a multivalued field listing the name of all the employees and products for a company. It would be empty for employees and products.

The Java Layer

In our application, the Java layer supports all user/client interaction with the application. Users/clients are expected to never interact directly with either the RDBMS or Solr. Client requests for the creation, updating, and deletion of application entities (i.e. companies, people, products) are submitted to the Java layer, which is then responsible for appropriate database and Solr updates. Likewise, search queries are submitted to the Java layer, which forwards the query to Solr. The Java layer then hydrates the search result with full entity data from the RDBMS and returns the result to the client.

Key to the ability of the Java layer to map documents in a Solr search result to entities in the RDBMS is the Id field defined for each document. For each of the database tables in our application (at least those that are searchable by Solr), we use a serial numeric as the primary key. Then, when forming Solr documents, the Id field is the concatenation of the Java class name and the entities primary key in the database, i.e. ldb.solr.entity.Person.57.

Java Layer: Keeping the RDBMS and Solr in Sync

As suggested in the LucidWorks article, Solr DIH is one option for keeping the RDBMS and Solr (loosely) in sync. The reason we’re not taking this approach for our application is that DIH is a batch process. Even if we run DIH at a high frequency, SOLR will always be a bit behind the RDBMS. Instead, for our application, we require the Java layer to make real-time updates to Solr. We might consider a batch process like DIH as a secondary mechanism for relaying database updates to Solr. This would allow the application to continue to run if the Solr server were temporarily down and then the batch process could catch up to SOLR when the server was back up. But for normal operations, the Java layer has the primary responsibility of keeping Solr up-to-date with changes made to the database.

Like most Java applications storing data in a database, our application will use the Java Persistence API (JPA) and in fact, we wish to keep our application independent of any specific JPA implementations (i.e. Hibernate). We also wish to keep Solr updates isolated from other application code, most of our application code should be unaware of the updates being made to Solr.

There are three main classes involved in updating SOLR:

  1. SolrIndexed is an interface that must be implemented by each entity class which is to be maintained in Solr (Company, Person, Product).
  2. SolrUpdate is a JPA entity listener responsible for using SolrJ to make the Solr updates.
  3. SolrDto is a data transfer object that represents a Solr document.

SolrDto Class

The members of the SolrDto class match up one-to-one with the schema for our Solr collection. Each member field is annotated with @Field:

public class SolrDto {
   @Field
    private String id;
    @Field
    private String name;
   …

    public String getId() {
        return id;
    }

    public void setId(Class<? extends SolrIndexed>clazz, Long id){
        this.id = new SolrId(clazz, id).toString();
   }

   …
}

Note that the setter for ID does not take the expected String argument, which would match the type of Id, but instead takes the class of the entity and its Id (database primary key).

The SolrId class handles forming the Id of the Solr document (for updates to Solr) and parsing the document IDs of returned search results.

    public class SolrId{
        private final Class<? extends SolrIndexed> entityClass;
        private final Long id;

        public SolrId(Class<? extends SolrIndexed> clazz, Long id) {
            this.entityClass = clazz;
            this.id = id;
        }

        public SolrId( String solrId){
            int dotPos = solrId.lastIndexOf('.');
            if( dotPos == -1){
                throw new RuntimeException("Invalid SolrId:  " + solrId);
            }
            try {
                this.entityClass = (Class<? extends SolrIndexed>) Class.forName(solrId.substring(0, dotPos));
            } catch (ClassNotFoundException e) {
                throw new RuntimeException("SolrId has invalid class: " + solrId);
            }
            this.id = Long.valueOf(solrId.substring(dotPos +1));
        }

        public Class<? extends SolrIndexed> getEntityClass() {
            return entityClass;
        }

        public Long getId() {
            return id;
        }

        @Override
        public String toString() {
            return String.format("%s.%d", entityClass.getCanonicalName(), id);
        }
    }

SolrIndexed Interface

Each of our entity classes (Company, Person, Product) will implement the SolrIndexed interface. Implementing this interface means that each of these classes must provide a method getSolrDto().

public interface SolrIndexed {
    SolrDto getSolrDto();
}

Thus, each of these classes is responsible for providing a mapping from the entity to SolrDto.

    @Transient
    @Override
    public SolrDto getSolrDto() {
        SolrDto dto = new SolrDto();
        dto.setId(this.getClass(), id);
        dto.setName(name);
        …

        return dto;
}

SolrUpdate Class

The SolrUpdate class is a JPA event listener. A JPA event listener contains callback methods that respond to JPA lifecycle events. In the case of the SolrUpdate class, these methods use the SolrJ API to update Solr based on the JPA lifecycle events.

public class SolrUpdate extends SolrServer {

    @PostPersist
    @PostUpdate
    public void onCreateUpdate( SolrIndexed indexed){
        SolrClient client = getSolrClient();
        try {
            client.addBean(indexed.getSolrDto());
            client.commit();
        } catch (IOException | SolrServerException e) {
            e.printStackTrace();
        }
    }

    @PostRemove
    public void onDelete( SolrIndexed indexed){
        SolrClient client = getSolrClient();
        try {
            client.deleteById(indexed.getSolrDto().getId());
            client.commit();
        } catch (IOException | SolrServerException e) {
            e.printStackTrace();
        }
    }

}

In order for SolrUpdate to be called for JPA lifecycle events, it must be listed in the annotation @EntityListeners on each of the SolrIndexed entities (Company, Person, Product).

@Entity
@EntityListeners(SolrUpdate.class)
@Table(name = "people")
public class Person implements SolrIndexed {

Performance Considerations

This approach to sync’ing the RDBMS and SOLR will likely be very chatty. For each entity insert/update/delete a corresponding update is made to SOLR. The SolrJ API supports updating and/or deleting multiple documents in a single call, the limiting factor here is JPA which only provides for pre/post database action call-backs.

If we’re not restricted to pure JPA but can make use of extensions of the JPA provider, Hibernate does offer an option that will reduce this chatty-ness. In addition to the JPA entity listeners, Hibernate provides ‘interceptors’ that are called at certain life-cycle events, in particular at the ending of a transaction. With this available the call to SolrClient.commit() can be delayed until the Hibernate transaction completes. In this case the SolrClient must be managed as a singleton for the life of the transaction. The JPA event listeners would make addBean() and deleteById() calls to that SolrClient, but not call commit(). The commit() call would then be made at the completion of the transaction by the Hibernate interceptor.

Java Layer: Supporting Queries

For our application, we want to support searching against a user-entered string. The Java Layer:

  • Builds a Solr (Lucene) query based on that string
    • The query requests only the Id field to be returned.
    • The query requests the results to be sorted by score (quality of match).
  • Submits it to the SOLR server via the SolrJ API.
  • Hydrates the returned IDs to full entities by use of JPA.

Here’s the sample code:

public List<SolrIndexed> search(String searchString) throws IOException, SolrServerException {

        final String[]fields = {"id"};
        SolrQuery query = new SolrQuery();
        query.setQuery(String.format("name:(%s)", searchString));
        query.setFields(fields);
        query.setSort("score", SolrQuery.ORDER.desc);

        SolrDocumentList documents = getSolrClient().query(query).getResults();

        EntityManager entityManager = EntityManagerUtility.getEntityManager();
        entityManager.getTransaction().begin();
        List<SolrIndexed> resultsList = documents.stream().
                map(doc->(String)doc.get("id")).
                map(SolrDto.SolrId::new).
                map( solrId-> entityManager.find(solrId.getEntityClass(), solrId.getId())).
                collect(Collectors.toList());
        entityManager.close();
        return resultsList;
}

Let’s examine building the query string itself — String.format("name:(%s)", searchString) — in a bit more detail.

First, in the sample code, we’re simply searching for matches between the searchString and the name field. Suppose, however, our requirements were to look for matches both against the name field and the alternateId field, giving priority to name matches. This could be accomplished by instead forming the Lucene query as:

String.format("name:(%s)^2 OR alternateId:(%s)", searchString, searchString)

Here, we’ve asked for documents where either the name field matches searchString or the second notice that we’ve wrapped searchString in parenthesis. This is necessary since the user entered searchString, which may include white space. In Lucene syntax, parentheses indicate grouping. Wrapping searchString in parentheses says to search for documents where the name or alternateId field matches any of the tokens in the searchString. The more tokens that match, the higher score assigned to the document. So, referring back to our sample data, if the user searchString is "Dizzy Gillespie", Solr will find a match both for the Person ‘"Dizzy Gillespie" and the Company "Dizzy’s Hop Shop". The Person will score higher since there is a match on both tokens and be returned first. "Dizzy’s Hop Shop" is a match, but with a lower score.

Conclusion

Using both an RDBMS and Solr as dual data repositories for an application allows the application to store detailed entity information and relationships in typical RDBMS fashion while also supporting rich search capabilities with Solr. The examples given here are necessarily simplistic. In a real application, both the RDBMS and Solr schemas would be more complex. For example, copy fields in the Solr schema could define edge-gram and n-gram versions of the name fiel,d which the search query could also include as lower -uality matches.

Download AnzoGraph now and find out for yourself why it is acknowledged as the most complete all-in-one data warehouse for BI style and graph analytics.  

Topics:
solr ,rdbms ,jpa ,database ,tutorial ,java ,queries

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}