Exploring CockroachDB With Jupyter Notebook and Python
Today, we're going to explore CockroachDB from the Data Science perspective, using a popular exploratory web tool called Jupyter Notebook and Python language.
Join the DZone community and get the full member experience.
Join For FreeThis is the next installment in the series of tutorials on CockroachDB and Docker Compose.
I was inspired to write this post based on this article. The article goes over using Jupyter with Oracle, MySql, and Postgresql, we're going to do the same with Cockroach! One caveat here is the heavy reliance on ipython-sql library. We're going to use Pandas library as the ipython-sql magic functions are not compatible with Cockroach today. Hopefully, you will find it useful.
You can find the older posts here: Part 1, Part 2, Part 3, Part 4.
- Information on CockroachDB can be found here.
- Information on Jupyter Notebook can be found here.
We're going to continue using our trusty docker-compose. Jupyter project publishes its images on Docker Hub. We're going to use the minimal image for this tutorial.
Here's my compose file:
version: '3.9'
services:
crdb:
image: cockroachdb/cockroach:v21.2.3
container_name: crdb-1
ports:
- "26257:26257"
- "8080:8080"
command: start-single-node --insecure
volumes:
- ${PWD}/cockroach-data/crdb:/cockroach/cockroach-data:rw
jupyter:
image: jupyter/minimal-notebook
container_name: jupyter
environment:
- GRANT_SUDO=yes
ports:
- "8888:8888"
volumes:
- $PWD:/home/jovyan/work
If you've been following my posts, you may notice that I switched to CockroacihDB version 19.2.3. For the Jupyter environment, I'm passing GRANT_SUDO=yes as an environment variable in case, we need to install some additional software inside the container. I'm also exposing the notebook at port 8888.
- Open Jupyter notebook
When docker-compose up is executed, Jupyter will output an access token to the Jupyter notebook.
To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
http://1b8009ed3252:8888/?token=37a818efee156c6e3b2d3bd21db07827ce250560a336987b
or http://127.0.0.1:8888/?token=37a818efee156c6e3b2d3bd21db07827ce250560a336987b
You can also access this information with the `docker logs ' command. Copy one of the provided URLs and paste it into a browser.
- Once you're in the notebook, you will notice a
workdirectory. Clicking into it will expose all of the files in your host's current directory.
This behavior can be explained by the following volume mapping.
volumes:
- $PWD:/home/jovyan/work
Let's create a new notebook by clicking New and selecting Python 3. You can click on the title of the notebook and name it to your liking.
- Install prerequisite software
The beauty of Jupyter's environment is its ability to adapt to different scenarios. In the next step, we're going to install Python libraries using pip from within the notebook!
!pip install sqlalchemy cockroachdb pandas psycopg2-binary matplotlib
NOTE: Since the release of this post, cockroachdb adapter for sqlalchemy has been renamed to sqlalchemy-cockroachdb.
Then click on the Run icon to execute.
Before we go to the next step, let's switch to Cockroach and load some data.
- Initialize a workload.
docker exec -it crdb-1 ./cockroach workload init movr
This will initialize a workload for our application. We've chosen a sample movr application.
10:18 $ docker exec -it crdb-1 ./cockroach workload init movr I200211 15:18:30.420866 1 workload/workloadsql/dataload.go:135 imported users (0s, 50 rows) I200211 15:18:30.431182 1 workload/workloadsql/dataload.go:135 imported vehicles (0s, 15 rows) I200211 15:18:30.468309 1 workload/workloadsql/dataload.go:135 imported rides (0s, 500 rows) I200211 15:18:30.498384 1 workload/workloadsql/dataload.go:135 imported vehicle_location_histories (0s, 1000 rows) I200211 15:18:30.528526 1 workload/workloadsql/dataload.go:135 imported promo_codes (0s, 1000 rows)
Now we're ready to populate the database:
docker exec -it crdb-1 ./cockroach workload run movr --duration=1m
The workload will run for 1m, specified by the --duration parameter.
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 59.9s 0 439 7.3 1.9 1.9 2.4 2.5 4.5 addUser _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 59.9s 0 124 2.1 3.6 3.7 4.5 4.5 5.0 addVehicle ...
There are different workload generators available with Cockroach, feel free to explore them here. When load completes, you can log in to the database and poke around in the database.
docker exec -it crdb-1 ./cockroach sql --insecure
root@:26257/defaultdb> show databases;
database_name
+---------------+
bank
defaultdb
movr
postgres
system
(5 rows)
Time: 2.847ms
root@:26257/defaultdb> use movr;
SET
Time: 856.8µs
root@:26257/movr> show tables;
table_name
+----------------------------+
promo_codes
rides
user_promo_codes
users
vehicle_location_histories
vehicles
(6 rows)
Time: 2.459ms
root@:26257/movr>
While we're there, let's also add a user with access to our database.
- Create a user and grant access
CREATE USER IF NOT EXISTS maxroach; GRANT ALL ON DATABASE movr TO maxroach; GRANT ALL ON TABLE movr.* TO maxroach; \q
We can test access for users maxroach with:
docker exec -it crdb-1 ./cockroach sql --insecure --user=maxroach --database=movr
- Access the database in Jupyter.
import pandas as pd
from sqlalchemy.engine import create_engine
engine = create_engine('cockroachdb://maxroach@crdb-1:26257/movr')
df = pd.read_sql('select count(*) from vehicles', engine)
If you're familiar with Python, sqlalchemy is a popular Python framework for working with databases. Feel free to explore our docs for more in-depth examples.
Once a database connection is established, we use pandas data frame to read the results of the select statement. Click on Run to execute. Nothing happened!
Because we stored the result in a variable called df, we need to use the methods provided by the library.
Type the following and hit Run.
df.head()
The good news, the result matches the total count of vehicles in the database!
Let's try another query:
df = pd.read_sql('select city, count(city) from vehicles group by city', engine)
print(df)
- Use
matplotlibto graph a plot
Jupyter is also capable of graphing your results, here's a simple graph chart of the data frame.
df.plot(x ='city', y='count', kind = 'bar')
- Graph a pie chart.
df.plot.pie(y='count',figsize=(5, 5),autopct='%1.1f%%', startangle=90)
This is just a taste of the powerful Jupyter Notebook and CockroachDB Python capabilities. Feel free to visit the docs for more.
Please share your feedback in the comments.
Published at DZone with permission of Artem Ervits. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments