Querying Cached Data - Paradigm Shift
Why would you ever query cached data if you can query your persistent
store, such as database? Well, the answer is the same as for accessing
data by key from cache vs. getting it from database - for performance
and scalability. However, querying cache is not exactly the same as
querying your database - the main difference is that if cache only has a
subset of data stored in database, then you will be only querying that
subset, so query result will be reflecting only in-memory state. Does
this matter? Depends on your application requirements and also depends
on the amount of data you are able to store in cache.
With introduction of cloud computing and virtual instances, the amount of memory available to your grid on the cloud becomes virtually limitless. Adding nodes to your grid has become as simple as calling AWS API on EC2 whenever your application demands it. On top of it, if GridGain swap space is configured, all the data that cannot fit in memory on a single node will be overflown to disk. Also, your application may not even have that much data, or maybe querying cached data, which usually contains data that has been accessed relatively recently, is good enough. Thus in many cases querying cache is becoming to look more and more like querying your database.
Now that you made a decision in your project that you want to query cached data, the next question becomes how to cache query results. Most of us are familiar with Hibernate and it's support for 2nd Level Caching which also comes with Query Cache. The way query cache works in Hibernate is generally the way we are used to think of caching queried data. In a nutshell, a query is issued against the database and the results of the query are then stored in cache in a single collection. If you have multiple queries, then multiple collections containing query results are stored. Now if you ever update a single bean in Hibernate which can potentially affect the query result (pretty much any change to the queried tables), Hibernate is forced to invalidate (remove) the cached query results from cache and reload them on-demand next time. This significantly increases memory consumption, and frequent cache invalidations of query results perform horribly and do not scale at all. Even Hibernate itself discourages its users from using it. Here is the quote from Hibernate documenation:
... most queries do not benefit from caching of their results. So by default, individual queries are not cached even after enabling query caching ...
So, how does querying of cached data help? It helps by entirely removing the need for Query Result Cache altogether. SQL queries on your indexed cached data are executed in memory and perform very fast, so there is no more need to cache query results. Just run your SQL query on your cached data and get the results whenever you need them. However, it is important to note that without rich SQL support for cache queries, they will not be able to replace database queries within your project. In the example below, where Person relates to Company, if your cache does not support SQL joins, then you would not be able to find all people working for the same company, which may be quite limiting. Hense, it is extremely important to evaluate how rich the SQL support on a certain cache product before making a decision to query cached data.
In GridGain 3.0 the support for cache queries is virtually without any limitations. If you know SQL, you can run queries against cached data, including support for any type of joins, where clause keywords, order by, group by, etc... In addition to SQL queries, GridGain also supports text queries using Lucene or H2 TEXT underlying indexing. You can also run predicate-based FULL SCAN queries, which will iterate over all cache elements on remote nodes and will include only the ones that passed the predicate filter.
As an examples take a look at some of the GridGain 3.0 query examples below. Note the JDBC PreparedStatement syntax for passing arguments and the SQL join performed between Person and Organization classes. Also note how you can cherry-pick the set of nodes on which you would like to execute your query.
// Create query which selects salaries based on range for all employees
// that work for a certain company.
GridCacheQuery<Long, Person> qry = cache.createQuery(SQL, Person.class,
"from Person, Organization where Person.orgId = Organization.id " +
"and Organization.name = ? and Person.salary > ? and Person.salary <= ?");
// Query all nodes to find all cached GridGain employees
// with salaries less than 1000.
qry.queryArguments("GridGain", 0, 1000).execute(grid);
// Query only remote nodes to find all remotely cached GridGain employees
// with salaries greater than 1000 and less than 2000.
qry.queryArguments("GridGain", 1000, 2000).execute(grid.remoteProjection());
// Query local node only to find all locally cached GridGain employees
// with salaries greater than 2000.
qry.queryArguments("GridGain", 2000, Integer.MAX_VALUE).
Here is an examples of a text query which will scan all resumes for word "master":
// Will query fields annotated with @GridCacheQueryLuceneField annotation.
GridCacheQuery<Long, Person> mastersQry =
cache.createQuery(LUCENE, Person.class, "Master");
// Query all cache nodes.