RavenDB 5.1: Empowering Your Queries
Learn about the latest release of RavenDB 5.1 including new features for sentiment analysis and full text search directly in the database engine and much more.
Join the DZone community and get the full member experience.Join For Free
RavenDB has been around for a while. The first production deployment (the backend for monitoring bird migration patterns in Norway) went live on Nov 2010. After a decade of working on the database, I thought that I had a pretty good idea about the kind of features that RavenDB has and what is the direction the project is going.
Archimedes famously said “δῶς μοι πᾶ στῶ καὶ τὰν γᾶν κινάσω”. That probably doesn’t make sense to you, since he spoke in Ancient Greek. But you can probably guess, since this is one of the most famous sayings in the world: “Give me one firm spot on which to stand, and I shall move the earth”. In the same sense, a couple of minor new features in RavenDB have been instrumental to open up a huge leap in the capabilities of RavenDB.
In RavenDB 5.1, among other features, we added the ability to access the contents of an attachment during the indexing process. This means your indexing function can operate on attachments of documents and thus make the content of the attachment available for queries.
Another feature that we added was to enable NuGet integration for RavenDB indexes. The idea was that instead of making you deploy additional assemblies to extend RavenDB’s indexing processes, you’ll be able to define the additional behavior using NuGet packages.
NuGet is a package manager for .NET and pretty much all the libraries for the .NET platform are available through NuGet.
It was only when we had these two features complete and ready that we realized what kind of new capability we added to RavenDB. By enabling you to reach to entire .NET ecosystem through NuGet, you are now able to do far more inside your indexes. And by allowing you to access the contents of attachments in indexing, we gave you a lot more targets to use this new capability on.
The ability to use NuGet is the lever and the access to the attachments during indexing is the place to stand on. Now we can move the Earth. It may seem that I’m getting overly excited about this, I know. Let me give you some examples of how these features can be utilized together to great effect.
Integrating Office Into the Database
A document can have any number of attachments. An attachment can be of any size, and there are various mechanisms in place to optimize how RavenDB works with them.
An attachment is stored separately from the document data and the attachment can be of any size. We have users that store attachments in the GB range, for example. In practice, attachments in RavenDB are used very similarly to attachments in email. And like emails, a large number of attachments are Office documents, such as Word and Excel.
For example, you might attach the Word document for the Lease in your database for the property management application. Or an employee might submit their hours by uploading an Excel spread sheet. Up until RavenDB 5.1, attachments were mostly static in RavenDB. They were managed by RavenDB, took part of transactions, replications and backups, but they weren’t something that you would generally act upon. The new features in RavenDB 5.1 change the situation considerably.
We can utilize the DocumentFormat.OpenXml NuGet package to process Word documents in our indexes. The simplest scenario that we have is the ability to index the content of Office documents as part of searching inside of RavenDB.
The following blog post guides you through the process of exposing the content of Office documents as part of your RavenDB indexes. The final result of which is that you are now able to issues queries such as:
In this case, the Contract field we are searching on doesn’t actually exists in the Lease entity inside of RavenDB. Instead, this is a searchable field that is generated as part of the indexing process by reading the content of the attached contract to the Lease entity.
This sort of integration means that you can skip storing Office documents in Share Point or trying to integrate multiple systems in your backend. This can reduce the cost and complexity of operating such a system significantly.
Machine Learning Integration
The ability to use NuGet packages in the indexing process leads to some fascinating possibilities for integrating RavenDB with the global .NET ecosystem. One of the more fascinating options is to integrate machine learning and prediction directly into RavenDB. Let’s consider the Microsoft.ML NuGet package as such a source.
I have trained a model using the Twitter Airline dataset to classify texts as positive, negative or neutral. I was then able to create the following index with RavenDB. This uses several advanced capabilities of RavenDB’s indexes at once.
The interesting part of this index is the SentimentAnalysis class, which is defined as Additional Source in the index. Essentially, a way to extend the way RavenDB index data. The full class can be found here, it is roughly 60 lines of mostly glue code. What the SentimentAnalysis glues is interesting, because this index also utilize the ability to import packages from NuGet:
We import the Microsoft.ML package and then use the prediction engine to run prediction on the data as part of our indexing process. What this means is that I’m then able to run queries such as:
But we aren’t stopping here, among other things, this index is using dynamic fields (utilizing the CreateField() call). That means that we can ask question such as how negative a comment is.
We have made the results of machine learning prediction searchable and accessible to all. You can train your model to detect all sort of interesting details and the results are going to be transparent to the rest of the system.
Instead of struggling with the complexities of machine learning, you can let RavenDB manage all of that for you. What is unique in the approach that RavenDB takes is that you don’t need a complicate pipeline or to integrate multiple systems. You can have everything delivered in the box, so to speak.
Another example of using Machine Learning in RavenDB is running image classifications , but you are limited only by your imagination. You can use the same framework to run anomaly detection, predict the next value in a time series and more.
Aggregating Machine Learning Results
A powerful feature in RavenDB is its ability to run large scale aggregations very efficiently. RavenDB is able to return results on datasets composed of billions of elements in milliseconds. This is done by utilizing a Map/Reduce system.
Usually, Map/Reduce is a pattern that is used to handle distributed computation. RavenDB uses it in a unique manner, instead of distributing computation over multiple machines, we are distributing the computation over time. Instead of recomputing aggregation as they change, RavenDB allows us to do incremental computation of just additions, deletions or modifications to the dataset.
The output of the aggregation is then available for your queries to operate on. In essence, RavenDB takes the path of the Scout and make sure to be prepared. By the time that you ask the question, your query can be answered with very little work.
The same abilities that we have previously explored in this article are also available for Map/Reduce operation. Using the Excel timecard as an example, we can define a Map/Reduce index that will tell us how many hours each employee worked this month and how much they should be paid.
The sentiment analysis we looked at can be used to compute the negativity score for each commentator in the system and measure the controversy associate with authors and topics.
RavenDB is a powerful database and the recent 5.1 release gives you a lot more options about how you can work with your data and process it. The capability to run sentiment analysis or full text search into the content Office document is now within reach, without having to integrate multiple separate systems.
Running such tasks directly inside the database engine can dramatically simplify your operational environment and reduce the number of balls you have to juggle in order to get the functionality you need.
Opinions expressed by DZone contributors are their own.