Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Elasticsearch for Dummies

DZone's Guide to

Elasticsearch for Dummies

Get to know the basics of Elasticsearch, its advantages, how to install it, and how to index documents using Elasticsearch.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Have you heard about the popular open-source tool used for searching and indexing that is used by giants like Wikipedia and Linkedin? No? I’m pretty sure you've heard it in passing.

I’m talking about Elasticsearch. In this blog, you’ll get to know the basics of Elasticsearch, its advantages, how to install it, and how to index documents using Elasticsearch.

What Is Elasticsearch?

Elasticsearch is an open-source, enterprise-grade search engine that can power extremely fast searches and support all data discovery applications. With Elasticsearch, we can store, search, and analyze big volumes of data quickly and in near real-time. It is generally used as the underlying search engine that powers applications that have simple/complex search features and requirements.

Advantages of Elasticsearch

  • Built on top of Lucene: Being built on top of Lucene, it offers the most powerful full-text search capabilities.

  • Document-oriented: It stores complex entities as structured JSON documents and indexes all fields by default, providing higher performance.

  • Schema-free: It stores a large quantity of semi-structured (JSON) data in a distributed fashion. It also attempts to detect the data structure and index the present data, making it search-friendly.

  • Full-text search: Elasticsearch performs linguistic searches against documents and returns the documents that match the search condition. Result relevancy for the given query is calculated using the TF/IDF algorithm.

  • Restful API: Elasticsearch supports REST APIs, which is light-weight protocol. We can query Elasticsearch using the REST API with the Chrome plug-in Sense. Sense provides a simple user interface and has features like autocomplete Elasticsearch query syntax and copying the query as cURL command.

Elasticsearch Terminology

  • Cluster: A collection of nodes that share data.

  • Node: A single server that is part of the cluster, stores the data, and participates in the cluster’s indexing and search capabilities.

  • Index: A collection of documents with similar characteristics. An index is more equivalent to a schema in RDBMS.

  • Type: There can be multiple types within an index. For example, an e-commerce application can have used products in one type and new products in another type of the same index. One index can have multiple types as multiple tables in one database.

  • Document: A a basic unit of information that can be indexed. It is like a row in a table.

  • Shards and replicas: Elasticsearch indexes are divided into multiple pieces called shards, which allows the index to scale horizontally. Elasticsearch also allows us to make copies of index shards, which are called replicas.

Use Cases

E-commerce websites use Elasticsearch to index their entire product catalog and inventory with all the product attributes with which the end user can search against.

Whenever a user searches for a product on the website, the corresponding query will hit an index with millions of products and retrieve the product in near real-time.

Or, say you want to collect log or transaction data and want to analyze and mine this data to look for statistics, summarizations, or anomalies. In this case, you can index this data into Elasticsearch. Once the data is in Elasticsearch, we can visualize the data in timelion/D3.JS to better understand the collected logs.

Installation

Let’s assume that you are in a Linux-based environment. Assuming that you also have JDK 6 or above installed, let’s get on with downloading Elasticsearch using the command below:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.4.0.tar.gz

Then extract it:

tar -zxvf elasticsearch-5.4.0.tar.gz

Go to the folder where Elasticsearch has been installed:

cd elasticsearch-5.4.0

Start the Elasticsearch server:

bin/elasticsearch

You can access it at http://localhost:9200 on your web browser. Here, localhost denotes the host (server) and the default port of Elasticsearch is 9200.

To confirm everything is working fine, type http://localhost:9200 into your browser. You should see something like this.

{
“name” : “90AzDAw”,
“cluster_name” : “elasticsearch”,
“cluster_uuid” : “e6t_hv6eQCi280elcktrUQ”,
“version” : {
“number” : “5.4.0”,
“build_hash” : “780f8c4”,
“build_date” : “2017-04-28T17:43:27.229Z”,
“build_snapshot” : false,
“lucene_version” : “6.5.0”
},
“tagline” : “You Know, for Search”
}

Indexing Documents

Elasticsearch tends to use Lucene indexes to store and retrieve data. Adding data to Elasticsearch is known as indexing. While performing an indexing operation, Elasticsearch converts raw data into its internal documents. Each document is nothing but a mere set of correlating keys and values. Here, the keys are strings and the values would be one of the numerous data types such as strings, numbers, lists, dates, etc.

We can query Elasticsearch using the methods mentioned below:

  • cURL command

  • Using an HTTP client

  • Querying with the JSON DSL

Elasticsearch provides a REST API that we can interact with in a variety of ways through common HTTP methods like GETPOSTPUT, and DELETE  — which does the same thing as a CRUD operation does.

Now, let’s try indexing some data in our Elasticsearch instance.

curl -XPUT http://localhost:9200/patient/outpatient/1?pretty -d’
{ 
“name” : “John”,
“City” : “California”
}’

This command will insert the JSON document into an index named patient with the type named outpatient. 1 is the ID here. If you didn’t provide any ID here, it will simply create one for you. pretty is used to pretty-print the JSON response. To replace an existing document with an updated data, we just PUT it again.

By using the above method, we can insert one document at a time. In order to bulk load the data, we can use Bulk API of Elasticsearch.

curl -XPOST ‘localhost:9200/patient/outpatient/_bulk?pretty&refresh’ –data-binary “@/home/ubuntu/Ex.json”

The above command loads the Ex.json file into the patient index.

Retrieving a Document

Retrieving a Document in a index can be done using GET request.

curl -XGET ‘localhost:9200/patient/outpatient/1?pretty’

The response of this command contains the resulting JSON document under the _source field.

{
“_index” : “patient”,
“_type” : “outpatient”,
“_id” : “1”,
“_version” : 1,
“found” : true,
“_source” : {
“name” : “John”,
“City” : “California”
}
}

It returns the document with the ID 1 and some metadata about the document.

Deleting a Document

This API allows us to delete a JSON document from an index.

curl -XDELETE ‘localhost:9200/patient/outpatient/1?pretty’

This command deletes the JSON document with the ID 1. In order to delete a document that matches a specific condition, we can use the _delete_by_query API.

curl -XPOST ‘localhost:9200/patient/_delete_by_query?pretty’ -H ‘Content-Type: application/json’ -d’
{
“query”: {
“match”: { “city”: “California” }
}
}’

That’s how we index a document using Elasticsearch.

Be it in terms of configuration and usage, Elasticsearch is quite elastic in comparison to its peers. Systems working with big data may encounter I/O bottlenecks due to data analysis and search operations. For systems like these, Elasticsearch would be the ideal choice.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,elasticseach ,data analytics ,search engine ,indexing ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}