DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • MongoDB to Couchbase: An Introduction to Developers and Experts
  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format

Trending

  • How to Configure and Customize the Go SDK for Azure Cosmos DB
  • A Guide to Developing Large Language Models Part 1: Pretraining
  • Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
  • Kubeflow: Driving Scalable and Intelligent Machine Learning Systems
  1. DZone
  2. Data Engineering
  3. Databases
  4. Using SingleStore as a Time Series Database

Using SingleStore as a Time Series Database

Exploring SingleStore's support for Time Series data by using S and P 500 stock data from Kaggle. We'll build a quick dashboard to visualize candlestick charts.

By 
Akmal Chaudhri user avatar
Akmal Chaudhri
DZone Core CORE ·
Feb. 09, 22 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
4.9K Views

Join the DZone community and get the full member experience.

Join For Free

Abstract

SingleStore is a very versatile database system. It is based upon Relational Technology and supports multi-model capabilities, such as Key-Value, JSON, Full-Text Search, Geospatial, and Time Series.

This article will explore SingleStore's support for Time Series data by using historical S and P 500 stock data from Kaggle. We'll also build a quick dashboard to visualize candlestick charts using Streamlit.

The SQL scripts, Python code, and notebook files used in this article are available on GitHub. The notebook files are available in DBC, HTML, and iPython formats.

Introduction

Since the advent of Relational Database Technology, many new requirements to manage data have emerged. Luminaries, such as Martin Fowler, have proposed Polyglot Persistence as one solution for managing diverse data and data processing requirements, as shown in Figure 1.

Figure 1. Polyglot Persistence.

However, Polyglot Persistence comes with costs and has attracted criticisms, such as:

In an often-cited post on polyglot persistence, Martin Fowler sketches a web application for a hypothetical retailer that uses each of Riak, Neo4j, MongoDB, Cassandra, and an RDBMS for distinct data sets. It's not hard to imagine his retailer's DevOps engineers quitting in droves.
 — Stephen Pimentel

And also:

What I've seen in the past has been is if you try to take on six of these [technologies], you need a staff of 18 people minimum just to operate the storage side — say, six storage technologies. That's not scalable and it's too expensive.
 — Dave McCrory

There have also been some proposals for using micro-services to implement a Polyglot Persistence architecture in recent years. However, SingleStore can provide a simpler solution by supporting diverse data types and processing requirements in a single multi-model database system. This offers many benefits, such as lower TCO, less burden upon developers to learn multiple products, no integration pains, and more. In a series of articles, we'll discuss SingleStore's multi-model capabilities in more detail. We'll start with Time Series data.

To begin with, we need to create a free Managed Service account on the SingleStore website, and a free Community Edition (CE) account on the Databricks website. At the time of writing, the Managed Service account from SingleStore comes with $500 of Credits. This is more than adequate for the case study described in this article. For Databricks CE, we need to sign-up for the free account rather than the trial version. We are using Spark because, in a previous article, we noted that Spark was great for ETL with SingleStore.

If you don't have an account at Kaggle, create one and download the all_stocks_5yr.csv file. The Kaggle website states that this file is 29.58 MB in size. The dataset consists of the following fields:

  • date: Spans a five-year daily period from 8 February 2013 until 7 February 2018. No missing values.
  • open: Opening price. 11 missing values.
  • high: High price. 8 missing values.
  • low: Low price. 8 missing values.
  • close: Closing price. No missing values.
  • volume: Total shares traded. No missing values.
  • name: Trading symbol. 505 unique values. No missing values.

For our initial exploration, we'll select the date, close, and name.

Configure Databricks CE

A previous article provides detailed instructions on how to Configure Databricks CE for use with SingleStore. We can use those exact instructions for this use case.

Upload CSV file

To use the CSV file, we need to upload it into the Databricks CE environment. A previous article provides detailed instructions on how to upload a CSV file. We can use those exact instructions for this use case.

Create the Database Table

In our SingleStore Managed Service account, let's use the SQL Editor to create a new database. Call this timeseries_db, as follows:

SQL
 
CREATE DATABASE IF NOT EXISTS timeseries_db;


We'll also create a table, as follows:

SQL
 
USE timeseries_db;

CREATE ROWSTORE TABLE IF NOT EXISTS tick ( 
     ts     DATETIME SERIES TIMESTAMP, 
     symbol VARCHAR(5), 
     price  NUMERIC(18, 4),
     KEY(ts)
);


Each row has a time-valued attribute called ts. We'll use DATETIME rather than DATETIME(6), since we are not working with fractional seconds in this example. SERIES TIMESTAMP specifies a table column as the default timestamp. We'll create a KEY on ts as this will allow us to filter on ranges of values efficiently.

Fill Out the Notebook

Let's now create a new Databricks CE Python notebook. We'll call it Data Loader for Time Series. We'll attach our new notebook to our Spark cluster.

In a new code cell, let's add the following code:

Python
 
from pyspark.sql.types import *

tick_schema = StructType([
  StructField("ts", TimestampType(), True),
  StructField("open", DoubleType(), True),
  StructField("high", DoubleType(), True),
  StructField("low", DoubleType(), True),
  StructField("price", DoubleType(), True),
  StructField("volume", IntegerType(), True),
  StructField("symbol", StringType(), True)
])


This schema ensures that we have the correct column types.

We'll create a new Dataframe in the next code cell, as follows:

Python
 
tick_df = spark.read.csv("/FileStore/all_stocks_5yr.csv",
                         header = True,
                         schema = tick_schema)


This reads the CSV file and creates a Dataframe called tick_df. We also tell Spark that there is a header row and ask it to use the previously defined schema.

In the next code cell, let's get the number of rows:

Python
 
tick_df.count()


Executing this, we obtain the value 619040.

We'll remove some of the columns based upon our earlier decision for the initial analysis, as follows:

Python
 
tick_df = tick_df.drop("open", "high", "low", "volume")


And sort the data:

Python
 
tick_df = tick_df.sort("ts", "symbol")


In the next code cell, we'll take a look at the structure of the Dataframe:

Python
 
tick_df.show(10)


The output should look like this:

Plain Text
 
+-------------------+-------+------+
|                 ts|  price|symbol|
+-------------------+-------+------+
|2013-02-08 00:00:00|  45.08|     A|
|2013-02-08 00:00:00|  14.75|   AAL|
|2013-02-08 00:00:00|   78.9|   AAP|
|2013-02-08 00:00:00|67.8542|  AAPL|
|2013-02-08 00:00:00|  36.25|  ABBV|
|2013-02-08 00:00:00|  46.89|   ABC|
|2013-02-08 00:00:00|  34.41|   ABT|
|2013-02-08 00:00:00|  73.31|   ACN|
|2013-02-08 00:00:00|  39.12|  ADBE|
|2013-02-08 00:00:00|   45.7|   ADI|
+-------------------+-------+------+
only showing top 10 rows


We are now ready to write the Dataframe to SingleStore. In the next code cell, we can add the following:

Python
 
%run ./Setup


In the Setup notebook, we need to ensure that the server address and password have been added for our SingleStore Managed Service cluster.

In the next code cell, we'll set some parameters for the SingleStore Spark Connector, as follows:

Python
 
spark.conf.set("spark.datasource.singlestore.ddlEndpoint", cluster)
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", password)
spark.conf.set("spark.datasource.singlestore.disablePushdown", "false")


Finally, we are ready to write the Dataframe to SingleStore using the Spark Connector:

Python
 
(tick_df.write
   .format("singlestore")
   .option("loadDataCompression", "LZ4")
   .mode("ignore")
   .save("timeseries_db.tick"))


This will write the Dataframe to the tick table in the timeseries_db database. We can check that this table was successfully populated from SingleStore.

Example Queries

Now that we have built our system, we can run some queries. SingleStore supports a range of useful functions for working with Time Series data. Let's look at some examples.

Average Aggregate

The following query illustrates how to compute a simple average aggregate overall Time Series values in the table:

SQL
 
SELECT symbol, AVG(price)
FROM tick
GROUP BY symbol
ORDER BY symbol;


The output should be:

Plain Text
 
+--------+---------------+
| symbol |  AVG(price)   |
+--------+---------------+
| A      | 49.20202542   |
| AAL    | 38.39325226   |
| AAP    | 132.43346307  |
| AAPL   | 109.06669849  |
| ABBV   | 60.86444003   |
  ...      ...


Time Bucketing

Time bucketing can aggregate and group data for different time series by a fixed time interval. SingleStore supports several functions:

  • FIRST: The value associated with the minimum timestamp. The documentation contains additional details and examples.
  • LAST: The value associated with the maximum timestamp. The documentation contains additional details and examples.
  • TIME_BUCKET: Normalizes time to the nearest bucket start time. The documentation contains additional details and examples.

For instance, we can use TIME_BUCKET to find the average time series value grouped by five-day intervals, as follows:

SQL
 
SELECT symbol, TIME_BUCKET("5d", ts), AVG(price)
FROM tick
WHERE symbol = "AAPL"
GROUP BY 1, 2
ORDER BY 1, 2;


The output should be:

Plain Text
 
+--------+-----------------------+--------------+
| symbol | TIME_BUCKET("5d", ts) |  AVG(price)  |
+--------+-----------------------+--------------+
| AAPL   | 2013-02-08 00:00:00.0 | 67.75280000  |
| AAPL   | 2013-02-13 00:00:00.0 | 66.36943333  |
| AAPL   | 2013-02-18 00:00:00.0 | 64.48960000  |
| AAPL   | 2013-02-23 00:00:00.0 | 63.63516667  |
| AAPL   | 2013-02-28 00:00:00.0 | 61.51996667  |
  ...      ...                     ...


We can also combine these functions to create candlestick charts that show the high, low, open, and close for a stock over time, bucketed by a five-day window, as follows:

SQL
 
SELECT TIME_BUCKET("5d") AS ts,
     symbol,
     MIN(price) AS low,
     MAX(price) AS high,
     FIRST(price) AS open,
     LAST(price) AS close
FROM tick
WHERE symbol = "AAPL"
GROUP BY 2, 1
ORDER BY 2, 1;


The output should be:

Plain Text
 
+------------+--------+----------+----------+----------+----------+
|     ts     | symbol |   low    |   high   |   open   |  close   |
+------------+--------+----------+----------+----------+----------+
| 2013-02-08 | AAPL   | 66.8428  | 68.5614  | 67.8542  | 66.8428  |
| 2013-02-13 | AAPL   | 65.7371  | 66.7156  | 66.7156  | 65.7371  |
| 2013-02-18 | AAPL   | 63.7228  | 65.7128  | 65.7128  | 64.4014  |
| 2013-02-23 | AAPL   | 63.2571  | 64.1385  | 63.2571  | 63.5099  |
| 2013-02-28 | AAPL   | 60.0071  | 63.0571  | 63.0571  | 60.0071  |
  ...          ...      ...        ...        ...        ...


Smoothing

We can smooth Time Series data using AVG as a windowed aggregate. Here is an example where we are looking at the price and the moving average of price over the last three ticks:

SQL
 
SELECT symbol, ts, price, AVG(price)
OVER (ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS smoothed_price
FROM tick
WHERE symbol = "AAPL";


The output should be:

Plain Text
 
+--------+-----------------------+----------+----------------+
| symbol |          ts           |  price   | smoothed_price |
+--------+-----------------------+----------+----------------+
| AAPL   | 2013-02-08 00:00:00.0 | 67.8542  | 67.85420000    |
| AAPL   | 2013-02-11 00:00:00.0 | 68.5614  | 68.20780000    |
| AAPL   | 2013-02-12 00:00:00.0 | 66.8428  | 67.75280000    |
| AAPL   | 2013-02-13 00:00:00.0 | 66.7156  | 67.49350000    |
| AAPL   | 2013-02-14 00:00:00.0 | 66.6556  | 67.19385000    |
  ...      ...                     ...        ...


AS OF

Finding a table row that is current AS OF a point in time is also a common Time Series requirement. This can be easily achieved using ORDER BY and LIMIT. Here is an example:

SQL
 
SELECT *
FROM tick
WHERE ts <= "2021-10-11 00:00:00"
AND symbol = "AAPL"
ORDER BY ts DESC
LIMIT 1;


The output should be:

Plain Text
 
+-----------------------+--------+----------+
|          ts           | symbol |  price   |
+-----------------------+--------+----------+
| 2018-02-07 00:00:00.0 | AAPL   | 159.5400 |
+-----------------------+--------+----------+


Interpolation

Time Series data may have gaps. We can interpolate missing points. The SingleStore documentation provides an example stored procedure that can be used for this purpose when working with tick data.

Bonus: Streamlit Visualization

Earlier, candlestick charts were mentioned, and it would be great to see these in a graphic rather than tabular format. We can do this quite easily with Streamlit. A previous article showed the ease with which we could connect Streamlit to SingleStore.

Install the Required Software

We need to install the following packages:

Plain Text
 
streamlit
pandas
plotly
pymysql


These can be found in the requirements.txt file on GitHub. Run the file as follows:

Shell
 
pip install -r requirements.txt


Example Application

Here is the complete code listing for streamlit_app.py:

Python
 
# streamlit_app.py

import streamlit as st
import pandas as pd
import plotly.graph_objects as go
import pymysql

# Initialize connection.

def init_connection():
    return pymysql.connect(**st.secrets["singlestore"])

conn = init_connection()

symbol = st.sidebar.text_input("Symbol", value = "AAPL", max_chars = None, key = None, type = "default")
num_days = st.sidebar.slider("Number of days", 2, 30, 5)

# Perform query.

data = pd.read_sql("""
SELECT TIME_BUCKET(%s) AS day,
    symbol,
    MIN(price) AS low,
    MAX(price) AS high,
    FIRST(price) AS open,
    LAST(price) AS close
FROM tick
WHERE symbol = %s
GROUP BY 2, 1
ORDER BY 2, 1;
""", conn, params = (str(num_days) + "d", symbol.upper()))

st.subheader(symbol.upper())

fig = go.Figure(data = [go.Candlestick(
    x = data["day"],
    open = data["open"],
    high = data["high"],
    low = data["low"],
    close = data["close"],
    name = symbol,
  )])

fig.update_xaxes(type = "category")
fig.update_layout(height = 700)

st.plotly_chart(fig, use_container_width = True)

st.write(data)


Create Secrets file

Our local Streamlit application will read secrets from a file .streamlit/secrets.toml in our application's root directory. We need to create this file as follows:

Plain Text
 
# .streamlit/secrets.toml

[singlestore]
host = "<TO DO>"
port = 3306
database = "timeseries_db"
user = "admin"
password = "<TO DO>"


The <TO DO> for host and password should be replaced with the values obtained from the SingleStore Managed Service when creating a cluster.

Run the Code

We can run the Streamlit application as follows:

Shell
 
streamlit run streamlit_app.py


The output in a web browser should look like Figure 2.

Figure 2. Streamlit.

On the web page, we can enter a new stock symbol in the text box and use the slider to change the number of days for TIME_BUCKET. Feel free to experiment with the code to suit your needs.

Summary

This article showed that SingleStore is a capable solution for working with Time Series data. Using the power of SQL and built-in functions, we can achieve a great deal. SingleStore has extended its support for Time Series with the addition of FIRST, LAST and TIME_BUCKET.

Acknowledgments

I thank Dr. John Pickford for advice and pointers to suitable Time Series datasets.

I am also grateful to Part-Time Larry for his excellent video on Streamlit — Building Financial Dashboards with Python and the GitHub code for inspiring the Streamlit Visualization in this article.

Database Relational database Time series sql file IO Plain text Data processing Python (language)

Published at DZone with permission of Akmal Chaudhri. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • MongoDB to Couchbase: An Introduction to Developers and Experts
  • The Magic of Apache Spark in Java
  • RION - A Fast, Compact, Versatile Data Format

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!