Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Superset in the Production Environment

DZone 's Guide to

Apache Superset in the Production Environment

We take a look at this interesting open source BI tool from the Apache Foundation, and show you how to set it up in a Docker container.

· Big Data Zone ·
Free Resource

Visualizing data helps in building a much deeper understanding of the data and quickens analytics around the data. There are several mature paid products available on the market. Recently, I explored an open source product name Apache Superset which I found a very upbeat product in this space. Some prominent features of Superset are:

  • A rich set of data visualizations.

  • An easy-to-use interface for exploring and visualizing data.

  • Create and share dashboards.

After reading about Superset, I wanted to try it, and as Superset is a Python programming language-based project we can easily install it using pip; but I decided to set it up as a container based on Docker. The Apache Superset GitHub Repo contains code for building and running Superset as a container. Since I want to run Superset in a completely distributed manner and with as little modification as possible in the code, I decided to modify the code so that it could run in multiple different modes.

Below is a list of specific changes/enhancements done in the code.

  • Different version of a Superset image can be built using the same code.

  • Superset configurations can be easily edited and mounted into the container, with no need to rebuild the image.

  • We can use asynchronous query executions through Celery-based executors and manage it through Flower UI.

Exploration Made Easy

While for exploring a project, development mode is an excellent choice, however, it would be great if the initial exploration happened with all the features, for instance, in the case of Superset, running queries in async mode, and storing the result in cache. You can explore Superset smoothly by using the below commands.

First pull a Docker Superset image from docker-hub:

docker pull abhioncbr/docker-superset:<tag>

Get docker-compose.yml and superset-config.py from the code base and follow the same directory structure.

Lastly, start a Superset image as a container in a local or prod mode using docker-compose:

cd docker-files/ && SUPERSET_ENV=<local | prod> SUPERSET_VERSION=<tag> docker-compose up -d

Running Superset in a Completey Distributed Mode

As per my understanding, running a Superset in a production environment for serving thousands of end-users should be distributed in nature and can be easily scaled as per the requirements. The below image depicts such a setup:


Image title

The published Docker image of Superset can be leveraged to achieve the above image.

  • The load balancer in front for routing the request from clients to a one server container.

  • Multiple containers in server mode for serving the UI of the Superset. Starting a server container using docker run can be done as follows:

docker run -p 8088:8088 \
-v config:/home/superset/config/ \
abhioncbr/docker-superset:<tag> \
cluster server <db_url> <redis_url>
  • Use multiple containers in worker mode for executing the SQL queries in an async mode using Celery executor. Starting a worker container using docker run can be done as follows:

docker run -p 5555:5555 \
-v config:/home/superset/config/ \
abhioncbr/docker-superset:<tag> \
cluster worker <db_url> <redis_url>
  • Use a centralized Redis container or Redis cluster for serving it as a cache layer and Celery task queues for workers.

  • Use a centralized Superset metadata database.

I found setting up Superset as Docker container is quite easy and 9t can be used for different environments. 

Topics:
big data ,business intelligence ,docker tutorial ,apache superset

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}