DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Doris vs Elasticsearch: A Comparison and Practical Cost Case Study
  • Introduction to Spring Data Elasticsearch 4.1
  • Logging vs. Monitoring: Part 1
  • Reactive Elasticsearch With Quarkus

Trending

  • Unlocking the Benefits of a Private API in AWS API Gateway
  • Google Cloud Document AI Basics
  • Integrating Security as Code: A Necessity for DevSecOps
  • Medallion Architecture: Why You Need It and How To Implement It With ClickHouse
  1. DZone
  2. Data Engineering
  3. Big Data
  4. An Introduction to Elasticsearch

An Introduction to Elasticsearch

How to start querying data and documents with Elasticsearch with a few detailed examples.

By 
Hasan Rahhal user avatar
Hasan Rahhal
·
May. 18, 16 · Tutorial
Likes (21)
Comment
Save
Tweet
Share
19.9K Views

Join the DZone community and get the full member experience.

Join For Free

ElasticSearch is an open source, RESTful search engine built on top of Apache Lucene and released under the Apache license. It is Java-based, and can search and index document files in diverse formats.

ElasticSearch has been compared to Apache Solr and offers several notable features:

  • Provides a scalable search solution.
  • Performs near-real-time searches.
  • Provides support for multi-tenancy.
  • An index can be easily recovered in a case of a server crash.
  • Uses Javascript Object Notation (JSON) and Java application program interfaces (APIs).
  • Automatically indexes JSON documents.
  • Each index can have its own settings.
  • Searches can be done with Lucene-based querystrings.

Indices and Types

Every time you store data in Elasticsearch it gets saved inside an index which has a type. compared to MongoDB an index is similar to a database, and a type similar to a collection. Compared to SQL an index would be like a database, and a type like a table.

Convention:

localhost:9200/{index}/{type}/

Important note: different types living in the same index cannot have the same field name with a different config or field type

For example the following two documents can't co-exist since they share the same index, and both have a city attribute of different types, string and object, respectively:

localhost:9200/test/users/1
{
    "city": "cityID123"
}
localhost:9200/test/city/1
{
    "city": {
        "name": "Toronto"
    }
}

When developing with elasticsearch there are 3 main steps we have to consider. Mapping, Indexing, and Searching data.

1. Mapping

Mapping is used to define how elastic should store and index a particular document and it's fields.

However if no mapping was introduced to a specific field on pre-index time, elastic will dynamically add a generic type to that field. Although this may sound tempting, it is NOT! since generic types are very basic and do not meet the query expectations most of the time.

Moving forward with this tutorial we will base our examples on the following data schema:

{
    "first_name": "bam",
    "last_name": "margera",
    "gender": "male",
    "age": 36
}

So to make things more efficient we're gonna create the index, type and mapping for the schema in one request. Something that looks like the following:

PUT localhost:9200/test/

{
    "mappings": {
        "users": {
            "properties": {
                "age": {
                    "type": "long"
                },
                "first_name": {
                    "type": "string"
                },
                "gender": {
                    "type": "string"
                },
                "level": {
                    "type": "string"
                },
                "last_name": {
                    "type": "string"
                }
            }
        }
    }
}

So creating an Index called test, a type called users with 5 fields that it contains.

Note that field types can have the following values: string, date, long, double, boolean, ip, object, nested, geo_point, and geo_shape.

If everything goes well, we should get the following response:

{
  "acknowledged": true
}

Now that we told Elasticsearch what kind of data we want to insert, let's go ahead and index or store it.

2. Indexing

Indexing, or storing, is the process of inserting data into Elasticsearch to make it searchable using the Index API.

So let's index 3 simple documents:

POST localhost:9200/test/users/
{
    "first_name": "Bam",
    "last_name": "Margera",
    "gender": "male",
    "level": "super awesome",
    "age": 36
}

POST localhost:9200/test/users/

{
    "first_name": "Stephanie",
    "last_name": "Hodge",
    "gender": "female",
    "level": "awesome",
    "age": 34
}

POST localhost:9200/test/users/

{
    "first_name": "Johnny",
    "last_name": "Knoxville",
    "gender": "male",
    "level": "awesome",
    "age": 45
}

On success of any of the following docs, we should see a response like this:

{
  "_index": "test",
  "_type": "users",
  "_id": "AVRQDOka0YBBUjDwpzQQ",
  "_version": 1,
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

Where __id_ is a generated id by Elasticsearch that is a 20 character long, URL-safe, Base64-encoded GUID string.

We can also specify our own id after the type like this:

POST localhost:9200/test/users/MyID123
{
    "first_name": "Bam",
    "last_name": "Margera",
    "gender": "male",
    "level": "super awesome",
    "age": 36
}

Now that we have our data indexed, let's move forward to query it.

3. Searching

In this section we will cover Elasticsearch Queries, Filters, and Aggregations for search

To search in a specific index and type, the following convention is used:

POST localhost:9200/test/users/_search

So now by hitting this request, the response will look like:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "users",
        "_id": "AVRQQlCE0YBBUjDwpzQZ",
        "_score": 1,
        "_source": {
          "first_name": "Bam",
          "last_name": "Margera",
          "gender": "male",
          "level": "super awesome",
          "age": 36
        }
      },
      **... the other 2 docs go here**
    ]
  }
}

By looking at this response we can see that the data that we inserted is found inside the hits.hits array included inside the __source_ object, and since we didn't actually specify anything to search for we'll get a __score_ of 1 for all docs.

On the top level hits.total is the total number of the docs using an empty search query, and max_score is the maximum score a document can take in a specific query. In our case it's one, since no query was specified.

In __shards.total_ the value is the number of Lucene indexes that Elasticsearch created for that index. The default number is always 5 unless we specify otherwise on index creation time. More details about shards are explained here.

a. Queries

Queries is what we use to get results with scoring (relevance)

To ask a question like

  • level = "super awesome"

Using the match query for full-text that is used on analyzed fields, we would write:

POST localhost:9200/test/users/_search

{
    "query": {
        "match": {
            "level": "super awesome"
        }
    }
}

The response will be:

{
  "took": 19,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.2712221,
    "hits": [
      {
        "_index": "test",
        "_type": "users",
        "_id": "AVRQQlCE0YBBUjDwpzQZ",
        "_score": 0.2712221,
        "_source": {
          "first_name": "Bam",
          "level": "super awesome",
          ...
        }
      },
      {
        "_index": "test",
        "_type": "users",
        "_id": "AVRQRtYW0YBBUjDwpzQa",
        "_score": 0.09848769,
        "_source": {
          "first_name": "Stephanie",
          "level": "awesome",
          ...
        }
      },
      {
        "_index": "test",
        "_type": "users",
        "_id": "AVRQRx-E0YBBUjDwpzQf",
        "_score": 0.09848769,
        "_source": {
          "first_name": "Johnny",
          "level": "awesome",
          ...
        }
      }
    ]
  }
}

As we can see, the user Bam scored the highest of 0.2712221 since his level was "super awesome ", whereas Stephanie and Johnny scored an equal 0.09848769, so their level was just "awesome"

Whereas for exact values on non-analyzed fields, numbers, dates, and Booleans, it's better to use the Term Query :

  • age = 36
POST localhost:9200/test/users/_search

{
     "query": {
        "term": {
            "age": 36
        }
    }
}

This query will return only Bam.

To combine more than one query together we can use the Query clause to find:

  • level = "super awesome" AND "age" < 40
POST localhost:9200/test/users/_search

{
     "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "level": "super awesome"
                    }
                },
                {
                    "range": {
                        "age": {
                            "lt": 40
                        }
                    }
                }
            ]
        }
    }
}

Where must is and array that implies AND. bool also supports should implying OR, and must_not.

Moreover we used the range query with age "lt" less than 40, where range also supports "lte", "gt", and "gte".

b. Filters

Filters are non-scoring queries that can be used if the score has no importance. It's returns a boolean that answers with "yes" or "no" where the score is always = 1.

Executing the following filter has no significance on the score, but will return only 2 docs:

POST localhost:9200/test/users/_search

{
    "filter": {
       "match": {
            "gender": "male"
        }
    }
}

Whereas combining this with a previous query:

  • level = "super awesome" AND only return gender = "male"
POST localhost:9200/test/users/_search

{
     "query": {
        "match": {
            "level": "super awesome"
        }
    },
     "filter": {
        "match": {
            "gender": "male"
        }
    }
}

This will return only 2 users, Bam and Johnny, scoring 0.2712221 and 0.09848769 respectively, where Bam has a more relevant level than Johnny.

Although this works fine, it is bad for performance since it will execute the query first then apply the filter returned results.

To force Elasticsearch to apply the filter before in order to limit the number of docs and then apply the query, we should wrap everything in a bool clause then add the filter next to must:

POST localhost:9200/test/users/_search

{
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "level": "super awesome"
                    }
                }
            ],
            "filter": {
                "match": {
                    "gender": "male"
                }
            }
        }    
    }
}


More More More...

  • level = "super awesome", and age < 40 but only return gender = "male"

We would write:

POST localhost:9200/test/users/_search

{
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "level": "super awesome"
                    }
                },
                {
                    "range": {
                        "age": {
                            "lt": 40
                        }
                    }
                }
            ],
            "filter": {
                "match": {
                    "gender": "male"
                }
            }
        }    
    }
}

This will return only 1 user Bam scoring 1.0253175.

Important note: We can also combine more than 1 filter using the bool.

So as Elasticsearch states: "As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else."

c. Aggregations

Aggregations is a big part of elasticseach it is used to calculate stats about our data. Divided into 3 different types:

  • Metrics Aggregations
  • Bucket Aggregations
  • Pipeline AggregationsIn this tutorial I'm gonna cover the Term Aggregations which is a part of the Bucket Aggregations.
  • How many females and males do we have in our Index/type ?

We can write the following:

POST localhost:9200/test/users/_search

{
    "size": 0,
    "aggs" : {
        "genders" : {
            "terms" : { "field" : "gender" }
        }
    }
}

We set "size" = 0 since we don't want to see any search results. Just the aggs results. "aggs" is a predefined Elasticsearch property, followed by "genders", which is a property that we can freely name. We can call it "xyz" if we want. "terms" implies that that we are performing a term aggregation which specifies the field name that we want to agg > genders.

Response:

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "genders": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "male",
          "doc_count": 2
        },
        {
          "key": "female",
          "doc_count": 1
        }
      ]
    }
  }
}

What we want is everything inside the bucket array, which tells us that we have 1 female and 2 males.

The power of aggs is that it can be combined with any filter/query.

So using the last filter we created, we can simply say:

{
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "level": "super awesome"
                    }
                },
                {
                    "range": {
                        "age": {
                            "lt": 40
                        }
                    }
                }
            ],
            "filter": {
                "match": {
                    "gender": "male"
                }
            }
        }    
    },
    "aggs" : {
        "genders" : {
            "terms" : { "field" : "gender" }
        }
    }
}

This will return Bam with a male count = 1 : )

TADAH!

Database Elasticsearch

Published at DZone with permission of Hasan Rahhal. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Doris vs Elasticsearch: A Comparison and Practical Cost Case Study
  • Introduction to Spring Data Elasticsearch 4.1
  • Logging vs. Monitoring: Part 1
  • Reactive Elasticsearch With Quarkus

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!