Apache Superset in the Production Environment
We take a look at this interesting open source BI tool from the Apache Foundation, and show you how to set it up in a Docker container.
Join the DZone community and get the full member experience.Join For Free
Visualizing data helps in building a much deeper understanding of the data and quickens analytics around the data. There are several mature paid products available on the market. Recently, I explored an open source product name Apache Superset which I found a very upbeat product in this space. Some prominent features of Superset are:
A rich set of data visualizations.
An easy-to-use interface for exploring and visualizing data.
Create and share dashboards.
After reading about Superset, I wanted to try it, and as Superset is a Python programming language-based project we can easily install it using
pip; but I decided to set it up as a container based on Docker. The Apache Superset GitHub Repo contains code for building and running Superset as a container. Since I want to run Superset in a completely distributed manner and with as little modification as possible in the code, I decided to modify the code so that it could run in multiple different modes.
Below is a list of specific changes/enhancements done in the code.
Different version of a Superset image can be built using the same code.
Superset configurations can be easily edited and mounted into the container, with no need to rebuild the image.
Exploration Made Easy
While for exploring a project, development mode is an excellent choice, however, it would be great if the initial exploration happened with all the features, for instance, in the case of Superset, running queries in async mode, and storing the result in cache. You can explore Superset smoothly by using the below commands.
First pull a Docker Superset image from docker-hub:
docker pull abhioncbr/docker-superset:<tag>
Lastly, start a Superset image as a container in a local or prod mode using
cd docker-files/ && SUPERSET_ENV=<local | prod> SUPERSET_VERSION=<tag> docker-compose up -d
Running Superset in a Completey Distributed Mode
As per my understanding, running a Superset in a production environment for serving thousands of end-users should be distributed in nature and can be easily scaled as per the requirements. The below image depicts such a setup:
The published Docker image of Superset can be leveraged to achieve the above image.
The load balancer in front for routing the request from clients to a one server container.
Multiple containers in server mode for serving the UI of the Superset. Starting a server container using
docker runcan be done as follows:
docker run -p 8088:8088 \ -v config:/home/superset/config/ \ abhioncbr/docker-superset:<tag> \ cluster server <db_url> <redis_url>
Use multiple containers in worker mode for executing the SQL queries in an async mode using Celery executor. Starting a worker container using
docker runcan be done as follows:
docker run -p 5555:5555 \ -v config:/home/superset/config/ \ abhioncbr/docker-superset:<tag> \ cluster worker <db_url> <redis_url>
Use a centralized Redis container or Redis cluster for serving it as a cache layer and Celery task queues for workers.
Use a centralized Superset metadata database.
I found setting up Superset as Docker container is quite easy and 9t can be used for different environments.
Opinions expressed by DZone contributors are their own.