OpenSearch: Introduction and Data Management Patterns
A practical guide to OpenSearch for data and platform engineers—covering core concepts, usage patterns, and best practices for scalable search and analytics.
Join the DZone community and get the full member experience.
Join For FreeIn this article we will provide an introduction to OpenSearch for data engineers and platform engineers. We will introduce basic concepts and briefly demonstrate how OpenSearch can be used correctly for data ingestion of log and analytics data at scale.
Introduction to OpenSearch
OpenSearch is an open-source search and analytics database engine, which developers use to build solutions for various applications, including search, data observability, data ingestion, Security Information and Event Management (SIEM), vector databases, and more.
It’s designed for scalability and offers powerful full-text search capabilities, supporting both structured and unstructured data. Over time, OpenSearch has evolved into a standalone platform, distinguished by its unique features and capabilities.
Amazon Web Services (AWS) leads the OpenSearch initiative, which is now being driven by a steering community. Since the OpenSearch Project is community-led, new features and innovations are constantly proposed and developed to meet the ever-changing search needs.
Basic OpenSearch Concepts
An OpenSearch cluster consists of nodes, sometimes having different roles. A node is a server that runs the OpenSearch software, and it works with documents instead of rows and tables.
OpenSearch stores data in Indexes. An index is a pool of documents where they are stored and searched. This is not directly comparable to a table in a standard database, but can be used as a reference.
A document is a piece of information that contains multiple fields, similar to a row with columns, and it is stored in an index. This is the piece of data you'd store in OpenSearch—and is expected to be in a JSON format.
An index does not have any notion of order, and documents are added without any particular sequence.
An index can have shards, which are parts of the index that can be used to scale, and it can also have replicas, which are copies of the shards.
A cluster is a collection of nodes, and an index can be replicated across multiple nodes in the cluster.
The schema of an index is called a mapping, which dictates how to treat fields in the index, and it can be managed using the cat API.
OpenSearch provides endpoints to access information about nodes, indices, and shards, allowing users to monitor and manage their data. The easiest one to use is the _cat
API.
Use Cases for OpenSearch
There are typically three main use cases for OpenSearch: log and metric analytics, search (for example, catalog search or enterprise search), and vector search. Combining traditional search (full-text search) with vector search is often referred to as Hybrid Search.
Log analytics involves ingesting log data, IoT data, or events, and then using dashboards for analytics, security analytics, and log analytics.
Applicative or catalog search use cases involve allowing users to search through data, such as real estate listings, using various search methods like vector search, text search, or geography-based search.
Vector search, or semantic search, allows a more advanced search than just finding the text that appears in a document in an exact form, and is often useful in GenAI and Retrieval-Augmented Generation (RAG) use-cases.
Data Management Patterns in OpenSearch
Based on these use cases, there are data management patterns to be aware of, including index life cycle management patterns for data that becomes static over time.
Index life cycle management patterns, or Index State Management (ISM) in OpenSearch, involve applying tiers to data, with less frequently accessed data stored in less expensive tiers, and implementing retention patterns to manage older data.
Older data can be managed using time-based indexes, such as daily or monthly indexes, or other patterns and APIs.
Index patterns, including index templates, are essential tools for defining index mappings and schemas on a cluster, allowing for automatic mapping and settings when an index is created.
Index templates can be used to manage mappings for time-based indexes or other types of indexes, and are recommended for use in any case.
Index rollover is an API that allows users to roll over indexes, and is an important tool for managing data in OpenSearch.
Index rollovers and data streams are APIs and tools within OpenSearch that allow for maintaining rolling indexes, even if they are not date-based, by capping indices at a certain size or using a pattern or date threshold.
Maintaining an index per day can be problematic due to varying usage patterns, such as holidays and busy shopping days like Black Friday, which can lead to imbalanced cluster and index management issues.
Rollovers and data streams enable the management of indexes to keep them at a consistent size, which is important for avoiding issues related to imbalance in the cluster.
Data streams are in fact just a layer of abstraction on top of rollovers, allowing users to write to a data stream, which maintains the underlying indexes, creating a new index when the previous one reaches a certain size or age.
Data Preprocessing and Ingestion
Preprocessing data before ingesting it into OpenSearch is usually recommended, but the platform also allows for preprocessing during ingestion using ingestion pipelines and processors.
Ingestion pipelines can be defined on OpenSearch and then be used to execute various data processing functions on entry to drop fields, set values, and perform geo enrichments, but can be difficult to debug and consume CPU resources on data nodes.
Index Optimization Techniques
Index sorting is an optimization technique for time-based data, allowing for sorting at the index level, which can improve performance by reducing the need to search through entire indices.
Rolled up indexes can be created when dealing with a large number of events with metrics, allowing for the aggregation of data into a single event with a lower granularity, such as from ten seconds to one hour, which can be useful for dashboard analysis over longer periods of time.
Rolled Up Indexes and Real-World Applications
This concept is applied in real-world scenarios, such as with Pulse, where a large volume of events is received in real-time, but the high granularity is not necessary for dashboard analysis of past data.
Reindexing Data and Conclusion
The reindex API is a tool that enables the transfer of data from one index to another, which can be useful for making mapping changes, such as creating a new index with an updated mapping and reindexing data from a database or another OpenSearch index.
Understanding data management patterns, including rolled up indexes and the reindex API, is crucial for efficiently managing data in OpenSearch.
This article is based on the OpenSearch for Data and Platform Engineers video tutorial series I've produced in collaboration with Pulse.
Opinions expressed by DZone contributors are their own.
Comments