AWS NoSQL Performance Lab Using Python

This is a coding lab to evaluate the performance of AWS-managed NoSQL databases such as DynamoDB, Cassandra, Redis, and MongoDB using Python.

David Shilman

CORE ·

Oct. 23, 23 · Analysis

Likes (6)

Comment

Save

4.9K Views

In most financial firms, online transaction processing (OLTP) often relies on static or infrequently updated data, also called reference data. Reference data sources don’t always require ACID transaction capabilities, rather need support for fast read queries often based on simple data access patterns, and event-driven architecture to ensure the target systems remain up-to-date. NoSQL databases emerge as ideal candidates to meet these requirements, and cloud platforms such as AWS offer managed and highly resilient data ecosystems.

In this article, I am not going to determine which AWS NoSQL database is better: the concept of a better database only exists within a specific purposeful context. I will share a coding lab to measure the performance of AWS-managed NoSQL databases such as DynamoDB, Cassandra, Redis, and MongoDB.

Performance Testing

I will start by defining the performance test case, which will concurrently insert a JSON payload 200 times and then read it 200 times.

JSON Payload

The base/parent class in base_db.py implements the test case logic of executing 10 concurrent threads to create and read 200 records.

     Python 
   
 
 
   #imports
.....
class BaseDB:

    def __init__(self, file_name='instrument.json', threads=10, records=20):

      ...................................
      
    def execute(self):

        create_threads = []
        for i in range(self.num_threads):
            thread = threading.Thread(
                target=self.create_records, args=(i,))
            create_threads.append(thread)
            thread.start()

        for thread in create_threads:
            thread.join()

        read_threads = []
        for i in range(self.num_threads):
            thread = threading.Thread(target=self.read_records, args=(i,))
            read_threads.append(thread)
            thread.start()

        for thread in read_threads:
            thread.join()

        self.print_stats()
 
  

Each thread executes the write/read routine in the create_records and read_records, respectively. Notice that these functions do not include any database-specific logic, but rather, measure the performance of each read-and-write execution.

     Python 
   
 
 
   def create_records(self, thread_id):

  for i in range(1, self.num_records + 1):
    key = int(thread_id * 100 + i)
    start_time = time.time()
    self.create_record(key)
    end_time = time.time()
    execution_time = end_time - start_time
    self.performance_data[key] = {'Create Time': execution_time}


def read_records(self, thread_id):

  for key in self.performance_data.keys():
    start_time = time.time()
    self.read_record(key)
    end_time = time.time()
    execution_time = end_time - start_time
    self.performance_data[key]['Read Time'] = execution_time 
  

Once the test case is executed, the print_stats function prints the execution metrics such as the read/write mean and the standard deviation (stdev) values, which indicate database read/write performance and consistency (smaller stdev implies more consistent execution performance).

     Python 
   
 
 
   def print_stats(self):

  if len(self.performance_data) > 0:
    # Create a Pandas DataFrame from performance data
    df = pd.DataFrame.from_dict(self.performance_data, orient='index')

    if not df.empty:
      df.sort_index(inplace=True)
      # Calculate mean and standard deviation for each column
      create_mean = statistics.mean(df['Create Time'])
      read_mean = statistics.mean(df['Read Time'])
      create_stdev = statistics.stdev(df['Create Time'])
      read_stdev = statistics.stdev(df['Read Time'])

      print("Performance Data:")
      print(df)
      print(f"Create Time mean: {create_mean}, stdev: {create_stdev}")
      print(f"Read Time mean: {read_mean}, stdev: {read_stdev}")
 
  

NoSQL Code

Unlike relational databases that support standard SQL, each NoSQL database has its own SDK. The child test case classes for each NoSQL database only need to implement a constructor and create_record/read_recod functions that contain proprietary database SDK to instantiate a database connection and to create/read records in a few lines of code.

DynamoDB Test Case

     Python 
   
   import boto3
from base_db import BaseDB

class DynamoDB (BaseDB):

    def __init__(self, file_name='instrument.json', threads=10, records=20):

        super().__init__(file_name, threads, records)

        dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
        table_name = 'Instruments'
        self.table = dynamodb.Table(table_name)

    def create_record(self, key):

        item = {
            'key': key,
            'data': self.json_data
        }
        self.table.put_item(Item=item)

    def read_record(self, key):

        self.table.get_item(Key={'key': key})

if __name__ == "__main__":

    DynamoDB().execute()

AWS Setup

To execute these performance test cases in an AWS account, you should follow these steps:

Create an EC2 IAM role with privileges to access the required AWS data services.
Launch an EC2 instance and assign the newly created IAM role.
Create each NoSQL database instance.

IAM Role

DynamoDB Table

Cassandra Keyspace/Table

Please note the DB host and credentials were hardcoded and removed in the mongo_db.py and redis_db.py modules and will need to be updated with the corresponding database connection setting for your AWS account. To connect to DynamoDB and Cassandra, I opted to use the Boto3 session credentials temporarily assigned to the db_performnace_iam_role IAM Role. This code will run in any AWS account in the East 1 region without any modification.

     Python 
   
 
 
   class CassandraDB(BaseDB):

    def __init__(self, file_name='instrument.json', threads=10, records=20):

        super().__init__(file_name=file_name, threads=threads, records=records)

        self.json_data = json.dumps(
            self.json_data, cls=DecimalEncoder).encode()

        # Cassandra Keyspaces configuration
        contact_points = ['cassandra.us-east-1.amazonaws.com']
        keyspace_name = 'db_performance'

        ssl_context = SSLContext(PROTOCOL_TLSv1_2)
        ssl_context.load_verify_locations('sf-class2-root.crt')
        ssl_context.verify_mode = CERT_REQUIRED

        boto_session = boto3.Session(region_name="us-east-1")
        auth_provider = SigV4AuthProvider(session=boto_session)

        cluster = Cluster(contact_points, ssl_context=ssl_context, auth_provider=auth_provider,
                          port=9142)
        self.session = cluster.connect(keyspace=keyspace_name) 
  

Connect to the EC2 instance (I used the Session Manager), and run the following Shell script to perform these tasks:

Install Git.
Install Pythion3.
Clone the GitHub performance_db repository.
Install and activate the Python3 virtual environment.
Install 3rd party libraries/dependencies.
Execute each test case.

     Shell 
   
 
 
   sudo yum install git
sudo yum install python3

git clone https://github.com/dshilman/db_performance.git
sudo git pull

cd db_performance
python3 -m venv venv
source ./venv/bin/activate

sudo python3 -m pip install -r requirements.txt

cd code
sudo python3 -m dynamo_db
sudo python3 -m cassandra_db
sudo python3 -m redis_db
sudo python3 -m mongo_db 
  

You should see the following output for the first two test cases:

(venv) sh-5.2$ sudo python3 -m dynamo_db

Performance Data:

Create Time Read Time

1 0.336909 0.031491

2 0.056884 0.053334

3 0.085881 0.031385

4 0.084940 0.050059

5 0.169012 0.050044

.. ... ...

916 0.047431 0.041877

917 0.043795 0.024649

918 0.075325 0.035251

919 0.101007 0.068767

920 0.103432 0.037742

[200 rows x 2 columns]

Create Time mean: 0.0858926808834076, stdev: 0.07714510154026173

Read Time mean: 0.04880355834960937, stdev: 0.028805479258627295

Execution time: 11.499964714050293

(venv) sh-5.2$ sudo python3 -m cassandra_db

Performance Data:

Create Time Read Time

1 0.024815 0.005986

2 0.008256 0.006927

3 0.008996 0.009810

4 0.005362 0.005892

5 0.010117 0.010308

.. ... ...

916 0.006234 0.008147

917 0.011564 0.004347

918 0.007857 0.008329

919 0.007260 0.007370

920 0.004654 0.006049

[200 rows x 2 columns]

Create Time mean: 0.009145524501800537, stdev: 0.005201661271831082

Read Time mean: 0.007248317003250122, stdev: 0.003557610695674452

Execution time: 1.6279327869415283

Test Results

	DynamoDB	Cassandra	MongoDB	Redis
Create	mean: 0.0859 stdev: 0.0771	mean: 0.0091 stdev: 0.0052	mean: 0.0292 std: 0.0764	mean: 0.0028 stdev: 0.0049
Read	mean: 0.0488 stdev: 0.0288	mean: 0.0072 stdev: 0.0036	mean: 0.0509 std: 0.0027	mean: 0.0012 stdev: 0.0016
Exec Time	11.45 sec	1.6279 sec	10.2608 sec	0.3465 sec

My Observations

I was blown away by Cassandra’s fast performance. Cassandra support for SQL allows rich access pattern queries and AWS Keyspaces offer cross-region replication.
I find DynamoDB's performance disappointing despite the AWS hype about it. You should try to avoid the cross-partition table scan and thus must use an index for each data access pattern. DynamoDB global tables enable cross-region data replication.
MongoDB has a very simple SDK, is fun to use, and has the best support for the JSON data type. You can create indexes and run complex queries on nested JSON attributes. As new binary data formats are emerging, MongoDB may lose its appeal.
Redis performance is amazingly fast, however, at the end of the day, it’s a key/value cache even if it supports complex data types. Redis offers powerful features such as pipelining and scripting to further improve query performance by passing code to Redis to execute on the server side.

Conclusion

In conclusion, choosing the AWS-managed NoSQL database for your enterprise reference data platform depends on your specific priorities. If performance and cross-region replication are your primary concern, AWS Cassandra stands out as a clear winner. DynamoDB integrates well with other AWS services such as Lambda and Kinesis and therefore is a great option for AWS native or serverless architecture. For applications requiring robust support for JSON data types, MongoDB takes the lead. However, if your focus is on fast lookup or session management for high availability, Redis proves to be an excellent option. Ultimately, the decision should align with your organization's unique requirements.

As always, you can find the code in the GitHub repo linked earlier in this article (see Shell script task #3 above). Feel free to contact me if you need help running this code or with the AWS setup.

AWS Amazon DynamoDB NoSQL Relational database Python (language) Performance appraisal Performance improvement

Opinions expressed by DZone contributors are their own.

Related

Trending