DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Solr: Not Just For Text Anymore

Solr: Not Just For Text Anymore

When Solr came out, it was supposed to be an OpenSource text search engine. Now it has a big place in Big Data. Read what Ness's CTO, Moshe Kranc has to say about how it has evolved.

Moshe Kranc user avatar by
Moshe Kranc
·
Aug. 08, 16 · Opinion
Like (14)
Save
Tweet
Share
9.22K Views

Join the DZone community and get the full member experience.

Join For Free

When Solr was first created in 2004, it was intended to be an OpenSource text search engine to provide Google-like search capabilities for uses such as corporate websites and internal document search. Based on the Lucene search library, Solr added a client-server architecture, a RESTful API, and some syntactic sugar for text queries.

Fast forward to 2016 and Solr has evolved from an enterprise search engine or a poor man’s Google into a viable choice for real-time Big Data analytics, competing with products such as Redshift, Spark, and Presto. The metamorphosis was gradual, so you may have missed it. Here are some of the highlights:

  • Support for non-text fields: Early on, Solr introduced the ability to define non-text fields such as numbers and dates. Why is this useful in a text search engine? For example, in addition to a textual field that describes a movie’s title, you might also wish to define the year in which the movie was released. A user could then search for all movies made between 2005 and 2008 whose title includes the word “Battle.”
  • Faceted search: This is the dynamic clustering of search results into categories, so that the user can drill down into search results based on any value in a field. For example, suppose a database of available jobs includes a field for City and a field for Position. The use can then search for all Software Engineer jobs and see how many open Software Engineer jobs there are in each city. Or, the user can search for all jobs in Boston, and see a breakdown of how many openings there are for each type of position in Boston. (Note that faceting is really a form of high-speed aggregation, i.e., counting the number of instances of all values for a given field, without the need for pre-aggregation.)
  • High availability and scalability: SolrCloud, released in 2012, provides clustering of Solr nodes. Data is automatically sharded and replicated across nodes in the cluster, queries are automatically distributed across the cluster, and node failover is performed automatically. With SolrCloud, Solr became an industrial strength product that could be trusted with mission-critical data and operations.
  • Performance improvements: In its early days, adding new data to Solr required rebuilding the entire index. This made Solr a very static product – index rebuilds were scheduled for off hours, and until then no new data was searchable. Later versions implemented instantaneous updates via an in-memory index that complements the main disk-based index. Solr also added several layers of caching, so that frequently repeated queries (or portions of queries) do not need to be re-run.
  • SQL support: The Solr query language is similar to SQL, but it is not SQL, so it will not work with SQL-compliant tools, e.g., analytic visualization tools such as Tableau. The recent Solr release added support for SQL, as well as a JDBC driver. Solr can now be used as a replacement for any relational database.
  • Schema-less support for unstructured data: Solr needs to know the type of a given field in order to index it correctly (indexing text is very different from indexing a number). This is fine for relational tables, where all the columns are known in advance. But, in a NOSQL world, where columns are not known in advance, and data is a set of arbitrary key-value pairs, how can Solr know the field type? Solr came up with a solution based on user-defined naming conventions, e.g., if the field name starts with “t_” then it is a text field. Thanks to this, Solr can support NOSQL unstructured data.
  • Bloomberg Analytics Component for Solr: Bloomberg Financial Services uses Solr extensively, and found the existing statistical packages woefully lacking. So, they developed a high-performance framework that can perform complex calculations and aggregations on time-series data, and then released it to OpenSource.

 Today, Solr is not just for text search anymore. It is a high-speed, high-availability SQL/NOSQL database that can perform aggregations and other complex calculations in real time. This is not just theory – Ness has customers who use Solr in production to provide real-time aggregation and time-series analysis for hundreds of simultaneous users. Solr has evolved to the point where it is not just a text-indexing engine. It is a viable alternative to other products such as Spark and Amazon Redshift that perform real-time aggregation on Big Data.

A closing note: Solr has a younger competitor named ElasticSearch, which is also based on Lucene. The two products compete neck-in-neck as far as capabilities, and a new feature in one product rapidly finds its way into the other product. I do not mean to take a side in this competition — everything written here about Solr is also true of ElasticSearch. But, the Solr story is more compelling because of the metamorphosis Solr had to undergo over the past twelve years. As the joke goes, G-d could create the world in 6 days only because he didn’t have to support an installed base. The Solr team had to re-create Solr as a real-time analytic engine while continuing to support an installed base, and for that, they deserve our admiration.

Big data Relational database Database sql Software engineer clustering Search engine (computing) Time series Engine career Web Protocols

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Building a RESTful API With AWS Lambda and Express
  • Tracking Software Architecture Decisions
  • Authenticate With OpenID Connect and Apache APISIX
  • Fargate vs. Lambda: The Battle of the Future

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: