DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • The Complete Apache Spark Collection [Tutorials and Articles]
  • Building a Real-Time Anomaly Detection Experiment With Kafka and Cassandra
  • A Transition From Monolith to Microservices
  • Time Series Compression Algorithms and Their Applications

Trending

  • Apache Flink
  • Unleashing the Power of Microservices With Spring Cloud
  • TypeScript: Useful Features
  • What Technical Skills Can You Expect To Gain From a DevOps Course Syllabus?
  1. DZone
  2. Data Engineering
  3. Databases
  4. Full Text Searchable Log Repository Using Cassandra and Lucene

Full Text Searchable Log Repository Using Cassandra and Lucene

Logging is quite necessary, and relevant to multiple apps. Here's a look at searchable logging with Lucene and Cassandra!

Sutanu Dalui user avatar by
Sutanu Dalui
·
Dec. 13, 15 · Tutorial
Like (6)
Save
Tweet
Share
9.03K Views

Join the DZone community and get the full member experience.

Join For Free

Abstract

Logging is a general cross cutting concern with almost all applications. We have robust libraries like log4j/logback or the jdk logging present for it. For many projects the logs are stored in some database for analysis afterwards.

In this article we propose such a persistent, searchable, scalable logging infrastructure that can be customized, or simply used as a plugin to extend the existing logging framework of an application. We will discuss it as a Log4j plugin, using a custom Log4j appender to utilize the framework.

Working code can be found at:

https://github.com/javanotes/weblogs

https://github.com/javanotes/weblog4jappender

The project uses Stratio secondary index plugin for Cassandra; a Lucene based full text search implementation on Cassandra. The core project is developed as a Spring boot application that can run as an embedded servlet container, or deployed as webapp. It provides:

  1. A RESTful API for log request ingestion. For e.g, the Log4j appender would POST logging requests to the api
  2. A web based console for viewing and searching through logs, and some data visualization

The Problem

A logging service, that should be:

  • Persistent - Logs generated should be persisted as time series data, onto disk system
  • Scalable - Logs generated can grow arbitrarily. So the infrastructure should be scalable enough to persist massive amount of data.
  • Non intrusive - Logging should be a 'non intrusive' cross cutting concern of an application. That is to say, the performance/functionality of the application should not be affected/negligibly affected by the logging service.
  • Centralized - The infrastructure should be a centralized one with multiple logging clients allowed to utilize it.
  • Searchable - The persisted log data should be full text searchable, and date wise searchable, and can be paged through results.
  • Pluggable - The solution should be extensible so that it can be plugged/adapted with any popular logging framework easily.
  • Analyzable - Some analytics like error count trends.

The Solution

A persistent and searchable data store needs to be a database. While some traditional RDBMS (MySql, that I know of) do support full text search, but storing a massive amount of data with high write operation rate can quickly become a bottleneck. So we move out from a RDBMS solution.

Lucene based datastores, like SolR/Elastic Search can be a good proposition, preferably ElasticSearch, if we take the ease of use. However, these tools index document wise; that is to say, a complete record with all its fields is indexed, simplistically speaking. These solutions are a good fit for complex record (structurally) search capabilities, but can be an overkill for a limited search facility ('phrase' or 'term' search only) and without any need of, say relevance search.

This takes us to the options of nosql. The immediate candidates that come to mind, purely on the basis of popularity (well, in internet search results at least!) are Cassandra and Mongodb. In fact Mongo has built in support for full text search as well. However, keeping in mind the time series nature of the data, it seems a key-value store (Cassandra) should be a better fit than a document oriented (Mongo). Also, keeping in mind the high write operation nature, it felt to me Cassandra is a better candidate. The last inference is simply based on a previous project that I had worked on, where we were prototyping a big data ingestion platform on Cassandra. So.. Cassandra!

Approach

We need to keep the following thing in mind in using Cassandra:

  • A basic challenge is, well, Cassandra (v2.2.3) has no out of the box support for full text search
  • Dataset pagination, which is not trivial using Cassandra and has some limitation on previous/next fetches.
  • Cassandra data model needs to be designed top down, that is we design how we store the data based on what we want to see and not the other way
  • The partition key, needs to be based on something that will always be provided while querying. As well it should be good enough to distribute data evenly across the cluster.
  • The timestamp column should be the first clustering key with a timeuuid datatype. Using a timeuuid will serve a dual purpose of natural ordering a time-series data, as well as provide for a unique 'rowid' that can be useful for pagination queries
  • Any other searchable field, say logging level, can be kept as subsequent clustering key

Full Text Search

For full text searching capability, we will use a custom secondary index plugin for Cassandra. A custom secondary index in Cassandra is an external java extension that Cassandra uses as a dynamic library. Some discussions here.

Stratio has developed a lucene based custom secondary index implementation as part of their core big data platform and it is open sourced. From their Github wiki:

It is a plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search.

 Keeping in mind the above points, a simple data model can be as follows:

CREATE TABLE data_points (

    app_id text,

    event_ts timeuuid,

    level text,

    log_text text,

    lucene text,

    PRIMARY KEY (app_id, event_ts)

) WITH CLUSTERING ORDER BY (event_ts DESC)

The column 'lucene' is a dummy column that will be used by Stratio plugin to index designated fields. More details can be found in their documentation.

Then queries can be formed to fetch time-series ordered data as follows:

SELECT * FROM data_points WHERE app_id=<app_log_id> AND lucene='{filter:{type:"boolean", must:[{type:"phrase",field:"log_text",value:"<search_phrase_or_term>"},{type:"match",field:"level",value:"INFO"}],should:[],not:[]},sort:{fields:[{field:"event_ts",reverse:false}]}}' AND event_ts>=minTimeuuid(<lower_bound_timeuuid>) AND event_ts<=maxTimeuuid(<upper_bound_timeuuid>) LIMIT <page_size>;

 The important points to note for pagination:

  1. For NEXT - <upper_bound> is from the last record in current page
  2. For PREVIOUS - <lower_bound> is from the first record in current page
  3. For NEXT, the limit is from 'head' and for PREVIOUS it will be from 'tail'. Accordingly for a PREVIOUS query, the timeuuid should be sorted in reverse
  4. Skipping to page numbers is not supported, since we do not have a concept of 'offset'
Database Big data Lucene Time series Cross-cutting concern application Data model (GIS) Spring Framework Repository (version control) clustering

Opinions expressed by DZone contributors are their own.

Related

  • The Complete Apache Spark Collection [Tutorials and Articles]
  • Building a Real-Time Anomaly Detection Experiment With Kafka and Cassandra
  • A Transition From Monolith to Microservices
  • Time Series Compression Algorithms and Their Applications

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: