DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

  1. DZone
  2. Refcards
  3. Persistent Container Storage
refcard cover
Refcard #270

Persistent Container Storage

Containers are great for building applications with ephemeral data. But what if you need your data to persist? Download this Refcard to learn what you need for container storage, discover the benefits of cloud-native storage, and more!

Download Refcard
Free PDF for Easy Reference
refcard cover

Written By

author avatar Alan Hohn
Director, Software Strategy, Lockheed Martin
Table of Contents
► Overview ► Container Storage Requirements ► Cloud Native Storage ► Provisioning Container Storage ► Options for Container Storage Infrastructure ► Application Example ► Conclusion
Section 1

Overview

Ephemeral storage is a major selling point of containers. “Start a container from an image. Make whatever changes you want. Then you stop it and start a new one. Look, a whole new file system that resets back to the content of the image!”

In Docker terms, that might look like this:

1
# docker run -it centos
2
[root@d42876f95c6a /]# echo "Hello world" > /hello-file
3
[root@d42876f95c6a /]# exit
4
exit
5
# docker run -it centos
6
[root@a0a93816fcfe /]# cat /hello-file
7
cat: /hello-file: No such file or directory

When we build applications around containers, this ephemeral storage is incredibly useful. It makes it easy to scale horizontally: we just create multiple instances of containers from the same image, and each one gets its own isolated file system. It makes it easy to upgrade: we just create a new version of the image, and we don’t have to worry about upgrade-in-place or capturing anything from existing container instances. It makes it easy to move from a single system to a cluster, or from on-premises to cloud: we only need to make sure the cluster or cloud can access our image in a registry. And it makes it easy to recover: no matter what our application might have done to its file system on its way to a horrible crash, we just start a new, fresh container instance from the image and it’s like the failure never happened.

So, we don’t want our container engine to stop providing ephemeral, temporary storage. But we do have a problem when we transition from tutorial examples to real applications. Real applications must keep state somewhere. Often, we push our state back into some data store (SQL-based or NoSQL-based). But that just raises the question of where to put the data store application. Is it also in a container? Ideally, the answer is “yes,” so we can take advantage of the same rolling upgrades, redundancy, and failover that we use for our application layer. To run our data store in a container, however, we can no longer be satisfied with just ephemeral, temporary storage. Our container instances need to be able to access persistent storage.

For simple cases where we just run our Docker containers directly, this is easy. We have two main choices: we can identify a directory on the host file system, or we can have Docker manage the storage for us. Here’s how it looks when Docker manages the storage:

9
1
# docker volume create data
2
data
3
# docker run -it -v data:/data centos
4
[root@5238393087ae /]# echo "Hello world" > /data/hello-file
5
[root@5238393087ae /]# exit
6
exit
7
# docker run -it -v data:/data centos
8
[root@e62608823cd0 /]# cat /data/hello-file
9
Hello world

Docker does not keep the root file system from the first container, but it does keep the “data” volume, and that same volume is mounted in the second container as well, so the storage is persistent.

This works on a single system, but access to persistent storage gets more complicated in a clustered container environment like Kubernetes or Docker Swarm. If our data store container might get started on any one of hundreds of nodes and might migrate from one node to another at any time, we can’t just rely on one server’s file system to store the data. We need a storage solution that is aware of containers and distributed processing and can seamlessly integrate.

This Refcard will describe the solution to this need for container-aware storage and will show how getting the storage solution right is a key element of building reliable containerized applications that excel in production.

Section 2

Container Storage Requirements

Before looking at solutions for container storage, we should look at what we want the solution to look like, so we’ll better understand the design decisions for the container storage solutions that are out there.

Redundant

One of the reasons for moving our application into containers and deploying those containers into an orchestration environment is that we can have many physical nodes and can tolerate the failure of some of those nodes. In just the same way, we want our storage solution to be able to tolerate disk and node failure and keep our application running. With storage, the need for redundancy is even more important because we can’t afford to lose any data even if we have some downtime.

Distributed

The need for redundant storage drives us to some kind of distributed solution, at least with respect to disks. But we also want distributed storage for performance. As we scale our container environment up to hundreds or thousands of nodes, we don’t want those nodes to be competing for data on the same few disks. Also, as we expand our environment to multiple geographic regions to reduce latency for our users, we also want to distribute our storage geographically so access to storage is fast from anywhere.

Dynamic

Container architectures are undergoing continuous change. New versions are built, updates are rolled in incrementally, applications are being added and removed. Test instances are created, put through automated tests, and destroyed. In this architecture, it must be possible to provision and release storage dynamically as well. In fact, provisioning storage should be declarative in the same way that we can declare container instances, services, and network connectivity.

Flexible

Container technology is moving quickly, and we need to be able to introduce new storage solutions or port our application to new environments with different underlying storage infrastructure. Our storage solution needs to be able to support any underlying infrastructure, from a single machine used by a developer for testing purposes, to an on-premise environment, to a public cloud deployment.

Transparent

We need to provide storage to any kind of application, and we need to update our storage solution over time. This means we can’t tie our application to a just one storage solution. Instead, storage needs to look native, whether that means looking like a file system, or looking like some existing, understood API.

Section 3

Cloud Native Storage

Another way to put it is that we want our container storage solution to be “Cloud Native.” The Cloud Native Computing Foundation (CNCF) has identified three properties for cloud native systems. We can apply these to storage:

  1. Container packaged. Ultimately, our physical or virtual disks exist outside the container, but we want to present storage specifically to containers (so that containers are not sharing storage unless that was specifically requested). Additionally, we may want to containerize the storage control software itself, so we can use the advantages of containerization to manage and update the software that manages storage.
  2. Dynamically managed. For continuous deployment of stateful containers, we need to be able to allocate storage for new containers and clean up storage that is no longer needed, without manual intervention by some administrator.
  3. Microservices oriented. When we define a container, it should explicitly express its dependency on storage. Additionally, the storage control software itself should be based on microservices so it’s easier to scale and to distribute geographically.

The CNCF Storage Working Group is working on a whitepaper covering the CNCF storage landscape. In the meantime, there are some good resources, including a primer on cloud native storage and 8 principles for cloud native storage.

Section 4

Provisioning Container Storage

To answer this container storage need, both Kubernetes and Docker Swarm provide a set of declarative resources for provisioning and attaching storage to containers. These storage capabilities are built on top of some storage infrastructure. Later in this Refcard we’ll look at some choices for container storage, but first let’s look at how each of these two environments allows containers to declare storage dependencies.

Kubernetes

In Kubernetes, containers live in Pods. Each pod includes one or more containers that all share the same network stack and storage. Storage is defined in the volumes section of the pod definition, and volumes are available to be mounted in any container in the pod.

For example, here is a Kubernetes pod definition using an emptyDir volume to share information between two containers. As the name suggests, the emptyDir volume starts out empty, but it stays persistent while the pod is allocated to a node (which means it survives ordinary container crashes but doesn’t survive node failure or pod deletion).

22
1
apiVersion: v1
2
kind: Pod
3
metadata:
4
  name: hello-storage
5
spec:
6
  restartPolicy: Never
7
  volumes:
8
  - name: shared-data
9
    emptyDir: {}
10
  containers:
11
  - name: nginx-container
12
    image: nginx
13
    volumeMounts:
14
    - name: shared-data
15
      mountPath: /usr/share/nginx/html
16
  - name: debian-container
17
    image: debian
18
    volumeMounts:
19
    - name: shared-data
20
      mountPath: /pod-data
21
    command: ["/bin/sh"]
22
    args: ["-c", "echo Hello from the debian container > /pod-data/index.html"]

If we save this to a file called two-containers.yaml and deploy it to Kubernetes using kubectl create -f two-containers.yaml, we can browse to the NGINX server using the pod’s IP address and retrieve the created index.html file.

This is an important example, because it shows how Kubernetes allows us to declare a storage dependency in a pod using the volumes section. However, this still isn’t true permanent storage. If our Kubernetes container is using Amazon Web Services Elastic Compute Cloud (AWS EC2), an example with permanent storage might look like this:

For this example, we can destroy and create the pod again, and the same storage will be provided to the new pod, no matter what node is used to run the container. However, this example still does not provide dynamic storage, as we must separately create the Elastic Block Store (EBS) volume before we can create our pod.

16
1
apiVersion: v1
2
kind: Pod
3
metadata:
4
  name: webserver
5
spec:
6
  containers:
7
  - image: nginx
8
    name: nginx
9
    volumeMounts:
10
    - mountPath: /usr/share/nginx/html
11
      name: web-files
12
  volumes:
13
  - name: web-files
14
    awsElasticBlockStore:
15
      volumeID: <volume-id>
16
      fsType: ext4

To get dynamic storage from Kubernetes, we need two other important concepts. The first is storageClass. Kubernetes allows us to create a storageClass resource that collects information about a storage provider. We then combine this with a persistentVolumeClaim, a resource that allows us to request storage from a storageClass dynamically, with Kubernetes requesting storage for us from the storageClass we’ve chosen. Here’s an example, still using AWS EBS:

35
1
kind: StorageClass
2
apiVersion: storage.k8s.io/v1
3
metadata:
4
  name: file-store
5
provisioner: kubernetes.io/aws-ebs
6
parameters:
7
  type: io1
8
  zones: us-east-1d, us-east-1c
9
  iopsPerGB: "10"
10
---
11
kind: PersistentVolumeClaim
12
apiVersion: v1
13
metadata:
14
  name: web-static-files
15
spec:
16
  resources:
17
    requests:
18
      storage: 8Gi
19
  storageClassName: file-store
20
---
21
apiVersion: v1
22
kind: Pod
23
metadata:
24
  name: webserver
25
spec:
26
  containers:
27
  - image: nginx
28
    name: nginx
29
    volumeMounts:
30
    - mountPath: /usr/share/nginx/html
31
      name: web-files
32
  volumes:
33
  - name: web-files
34
    persistentVolumeClaim:
35
      claimName: web-static-files

As you can see, we still use the volumes section of the pod to specify our need for storage, but we use a separate PersistentVolumeClaim to ask Kubernetes to provision the resource for us. In general, the cluster administrator would deploy StorageClasses once per cluster to represent the available underlying storage. Then the application developer would specify PersistentVolumeClaims once per application when storage is first needed. The pod is then deployed and replaced as needed for application upgrades without losing the data in storage. 

Docker Swarm

Docker Swarm leverages the same core volume management capabilities we saw with a single-node Docker volume, but with the ability to provide storage to a container on any node. To provision containers in Docker Swarm, we use the docker stack command together with a Docker Compose file. For example:

12
1
version: "3"
2
services:
3
  webserver:
4
    image: nginx
5
    volumes:
6
      - web-files:/usr/share/nginx/html
7
volumes:
8
  web-files:
9
    driver: storageos
10
    driver-opts:
11
      size: 20
12
      storageos.feature.replicas: 2


When we use docker stack deploy, Docker Swarm will create the web-files volume if it doesn’t exist. This volume will be retained even if we remove the stack with docker stack rm.

Overall, we can see how both Kubernetes and Docker Swarm meet our criteria for cloud native storage. They allow containers to declare storage as a dependency, and they dynamically manage storage to make it available to an application on-demand. They are also able to provide this storage to a container no matter where in the cluster the container is running.

Of course, to provide this dynamic, distributed storage, both Kubernetes and Docker Swarm rely on configuring some underlying storage infrastructure. Let’s now look at our options and how we can decide what kind of storage infrastructure we want for our container environment.

Section 5

Options for Container Storage Infrastructure

There are numerous storage options out there for both Kubernetes and Docker Swarm, but we can group them into a few categories. For each category, we’ll look at options within and outside the cloud and discuss how well it meets our overall requirements toward cloud native storage.

Category Outside the Cloud Inside the Cloud
Raw Block Device
  • Simplest option.
  • High performing.
  • Allows container direct access to disk.
  • Ties container to node or disk interface (e.g. SCSI, Fibre Channel).
  • Can use underlying cloud resources (e.g. EBS, Azure, OpenStack Cinder).
  • Extra work to make storage usable (e.g. partition, format).
Network Attached Storage
  • Uses well understood protocols such as NFS and iSCSI.
  • Can integrate with existing on-premise NAS.
  • Lack of data locality can hurt performance.
  • Can leverage in-cloud storage providers (Managed NFS).
  • Storage is easy to access from inside and outside containers.
  • Not optimized for the dynamic addition and removal of container volumes.
Distributed File Systems
  • Can operate on same container infrastructure.
  • Opportunity for excellent data locality.
  • Storage controller software can run within container environments.
  • Can operate on same container infrastructure.
  • Opportunity for excellent data locality, even in geo-distributed situations.
  • Requires some additional provisioning for storage and controller software.
Object Stores
  • Suitable for file transfer but not read-write random access.
  • Typically requires additional underlying storage infrastructure.
  • Can use underlying cloud resources (e.g. S3, OpenStack Swift).
  • Suitable for file transfer but not read-write random access.
Software Defined Storage
  • Able to specify and receive performance guarantees.
  • Can insert value-added services such as data de-duplication and snapshots.
  • Able to specify and receive performance guarantees.
  • Aligns with cloud native storage.
  • Can insert value-added services such as data de-duplication and snapshots.

The final category, Software Defined Storage, is not a new concept but is becoming a more popular term. It continues the trend toward storage abstraction that started with logical storage in the Redundant Array of Independent Disks (RAID) and Logical Volume Manager (LVM), then was extended with virtualized storage through Storage Area Network (SAN) and distributed file systems such as Ceph and Gluster, but it adds a storage abstraction layer that can incorporate de-duplication, built-in backup and archiving, change auditing to establish data provenance, and snapshot capabilities. For persistent container storage, Software Defined Storage operates in a very similar way to distributed file systems, with a single API to provision and manage storage and the ability to localize data at the point of use, but it includes other capabilities that may be desirable from a broader storage management standpoint. 

While there’s no one “right answer” to the type of persistent storage we should deploy to our container environment, it is important to refer to the list of required elements to make sure we wind up with declarative performant storage, even as our container environment scales to multiple geographic regions and a large number of nodes. 

Section 6

Application Example

To complete our look at persistent container storage, let’s deploy a Spring Boot application that uses PostgreSQL. We’ll focus on Kubernetes for this example, but the same ideas apply in Docker Swarm.

Secrets and Configuration

So far, our persistent storage discussion has been about files and volumes, and for good reason, since most applications see storage in those terms. However, sometimes we need to provide small pieces of information to our containerized applications, including configuration files, database credentials, and environment variables. For these cases, we want the ability to maintain this information securely and keep it during application rollover or updates, but we don’t want to bundle it with the application because it might be specific to the environment or information that must be kept secret. 

For small files and variables, it seems like a waste to provide a whole storage volume (in addition to making it more complicated to update from outside the container). Instead, both Kubernetes and Docker Swarm have explicit support for storing configuration data and secrets and providing them to containers. 

Our Spring Boot application needs database connection information for the PostgreSQL database. Knowing that we need different information in development and production, we will use variables in application.properties: 

​x
1
spring.datasource.url=${DB_URL} 
2
​
3
spring.datasource.username=${DB_USER} 
4
​
5
spring.datasource.password=${DB_PASS} 
6
​
7
spring.datasource.driver-class-name=${DB_DRIVER} 

To provide this information to our container, we will first declare a ConfigMap and a Secret. 

9
1
kind: ConfigMap 
2
apiVersion: v1 
3
metadata: 
4
  name: myapp-config 
5
  namespace: myapp 
6
data: 
7
  database.url: jdbc:postgresql://mydb.myapp.pod/myapp 
8
  database.username: dbuser 
9
  database.driver: org.postgresql.Driver 


There are a couple things to note here. First, we use a namespace to keep the resources for “myapp” separate from other applications. Second, we assume we have deployed Kubernetes DNS to detect our PostgreSQL Pod and make a DNS entry for it. Finally, note that we do not include the password, because we need to use a Secret so it is held in an encrypted form. 

For our secret, we’ll use the Kubernetes command line because we just have a single value to store:

2
1
# kubectl create secret generic myapp-secret --namespace=myapp \ 
2
  --from-literal=password=’correcthorsebatterystaple’


Application Deployment

We can now use our ConfigMap and Secret in the Pod definition for our application: 

30
1
kind: Pod 
2
apiVersion: v1 
3
metadata: 
4
  name: webapp 
5
  namespace: myapp 
6
spec: 
7
  containers: 
8
  - image: registry.mycompany.com/myapp 
9
    name: myapp 
10
    env: 
11
    - name: DB_URL 
12
      valueFrom: 
13
        configMapKeyRef: 
14
          name: myapp-config 
15
          key: database.url 
16
    - name: DB_USER 
17
      valueFrom: 
18
        configMapKeyRef: 
19
          name: myapp-config 
20
          key: database.username 
21
    - name: DB_DRIVER 
22
      valueFrom: 
23
        configMapKeyRef: 
24
          name: myapp-config 
25
          key: database.driver 
26
    - name: DB_PASS 
27
      valueFrom: 
28
        secretKeyRef: 
29
          name: myapp-secret 
30
          key: password 


While this example uses environment variables, ConfigMaps and Secrets can be treated as volumes in Kubernetes, so we could also include the application.properties file or a Spring Boot YAML configuration file in a Secret, and then have Kubernetes place it in the file system of our container so our application could load it. This would allow us to avoid editing the Pod definition to add new properties.  

PostgreSQL Database

As the last step in our application example, let’s combine secrets and volumes to show how we might provide persistent storage for the PostgreSQL database that supports our Spring Boot application. 

We are going to provide our PostgreSQL container with two separate persistent volumes. The first will be used for the PostgreSQL data directory, and the second for backups and a write-ahead log (WAL). The WAL volume would allow us to configure a PostgreSQL standby server, though actually configuring PostgreSQL in this active/passive failover configuration is outside the scope of this Refcard. 

We’ll provide storage to our database using StorageOS, a Software Defined Storage solution. Our Kubernetes cluster will use the StorageOS API to provision the requested storage. We start by creating a Secret to hold the information needed to connect to the StorageOS API. This secret has three values, so we’ll use a YAML definition. To do this, Kubernetes requires us to base-64 encode each value. Then we can declare the Secret:

10
1
kind: Secret
2
apiVersion: v1
3
metadata:
4
  namespace: default
5
  name: storageos-api
6
type: "kubernetes.io/storageos"
7
data:
8
  apiAddress: <base 64 encoded URL>
9
  apiUsername: <base 64 encoded username>
10
  apiPassword: <base 64 encoded password>


Next, we configure the storageClass using that secret:

10
1
kind: StorageClass
2
apiVersion: storage.k8s.io/v1beta1
3
metadata:
4
  name: fast
5
provisioner: kubernetes.io/storageos
6
parameters:
7
  pool: default
8
  fsType: ext4
9
  adminSecretNamespace: default
10
  adminSecretName: storageos-api


Note that the storageClass is kept in the default namespace since we’ll use it with many applications. Finally, we’re ready to create our two PersistentVolumeClaims and our database Pod:

 

47
1
kind: PersistentVolumeClaim 
2
apiVersion: v1 
3
metadata: 
4
  name: pgsql-data
5
  namespace: myapp 
6
spec: 
7
  accessModes: 
8
    - ReadWriteOnce 
9
  resources: 
10
    requests: 
11
      storage: 10Gi 
12
  storageClassName: fast 
13
--- 
14
kind: PersistentVolumeClaim 
15
apiVersion: v1 
16
metadata: 
17
  name: pgsql-backup 
18
  namespace: myapp 
19
spec: 
20
  accessModes: 
21
    - ReadWriteOnce 
22
  resources: 
23
    requests: 
24
      storage: 800Gi 
25
  storageClassName: fast 
26
--- 
27
kind: Pod 
28
apiVersion: v1 
29
metadata: 
30
  name: mydb 
31
  namespace: myapp 
32
spec: 
33
  containers: 
34
  - image: postgres:9.4 
35
    name: mydb 
36
    volumeMounts: 
37
    - mountPath: /var/lib/pgsql/data 
38
      name: data 
39
    - mountPath: /backup 
40
      name: backup 
41
  volumes: 
42
  - name: data 
43
    persistentVolumeClaim: 
44
      claimName: pgsql-data 
45
  - name: backup 
46
    persistentVolumeClaim: 
47
      claimName: pgsql-backup

We allocate 10GB for pgsql-data. At the moment (Kubernetes 1.9), the ability to expand a persistent volume claim is in alpha and only supported for a few storage classes, so this needs to be large enough to hold the full/expected size of our database. Also, note that while we use a single storage class, it might be beneficial to provision multiple storage classes with different policies to have a cheaper storage option for large volumes, such as backups, where speed is not as critical. 

Section 7

Conclusion

In this Refcard, we’ve looked at the need to provision persistent storage for our containers so we can deploy the stateful parts of our application to our container environment. While the use of a distributed container environment like Kubernetes or Docker Swarm made this more complex, it also created the opportunity for distributed storage, data locality, redundancy, and the ability to deploy our storage controller components directly on the container environment. 

In choosing persistent storage infrastructure, we have options that range from basic raw block devices, where we have a simple implementation but limited scalability and redundancy, to sophisticated Software Defined Storage solutions. Software Defined Storage solutions not only guarantee performance, but also allow workloads to be deployed in a platform agnostic manner so the same solution is used for both on-premises and cloud environments. However, whatever solution we choose, we can arrive at a storage solution that our containers can provision dynamically using declarative logic and standard APIs. This gives us the ability to provide our containerized applications with the storage they need while keeping them independent of the underlying infrastructure, which allows us to deploy a broader set of use cases and more complex workloads to cloud native and containerized platforms.

Like This Refcard? Read More From DZone

related article thumbnail

DZone Article

Using EBS and EFS as Persistent Volume in Kubernetes
related article thumbnail

DZone Article

Kubernetes Assimilation: Persistence Is Futile
related article thumbnail

DZone Article

Introducing Graph Concepts in Java With Eclipse JNoSQL, Part 2: Understanding Neo4j
related article thumbnail

DZone Article

Simpler Data Transfer Objects With Java Records
related refcard thumbnail

Free DZone Refcard

Kubernetes Monitoring Essentials
related refcard thumbnail

Free DZone Refcard

Kubernetes Multi-Cluster Management and Governance
related refcard thumbnail

Free DZone Refcard

Getting Started With Kubernetes
related refcard thumbnail

Free DZone Refcard

Getting Started With Serverless Application Architecture

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: