AWS S3 Strategies for Scalable and Secure Data Lake Storage
Optimize your Amazon S3 data lake with strategic bucket configurations, data layers, encryption, and lifecycle policies for security, efficiency, and cost savings.
Join the DZone community and get the full member experience.
Join For FreeAmazon S3 is an object storage service that offers scalability, data availability, security, and performance. S3 is the main component of your data lake, and creating buckets with the right strategy and properties can help you consume the data from the data lake in an efficient and secure way.
The article will guide you through bucket strategies when creating a data lake and discuss other things to keep in mind.
1. Choosing the Right Bucket Config
Step 1
Choose the correct region to optimize for latency, minimize costs, or address regulatory requirements.
Step 2
Choose multiple S3 buckets — As such, there is no difference between having multiple or a single S3 bucket. However, multiple S3 buckets can have different life cycle configurations, versioning, access policies, etc.
Due to the above factors, organizations prefer having different buckets: one for storing raw data where versioning is on (potentially cross-region is turned on) and access policies are more restrictive.
The second bucket contains lifecycle policy, application-level tagging, request payments, and encryption at rest. If the data is PII, you might want to have more restrictive access at a different level; thus, having multiple buckets would make more sense.
2. S3 Recommended Data Layers
The recommendation is to use at least three data layers in your data lakes, and each layer uses a separate S3 bucket. However, some use cases might require an additional S3 bucket and data layer, depending on the data types that you generate and store.
For example, if you store sensitive data, we recommend that you use a landing zone data layer and a separate S3 bucket. The following list describes the three recommended data layers for your data lake:
- Raw data layer: Contains raw data and is the layer in which data is initially ingested. If possible, we recommend that you retain the original file format and turn on versioning in the S3 bucket.
- Stage data layer: Contains intermediate, processed data that is optimized for consumption (for example, CSV to Apache Parquet converted raw files or data transformations). An AWS Glue job reads the files from the raw layer and validates the data. The AWS Glue job then stores the data in an Apache Parquet-formatted file, and the metadata is stored in a table in the AWS Glue Data Catalog.
- Analytics data layer or processed layer: Contains the aggregated data for your specific use cases in a consumption-ready format (for example, Apache Parquet).
3. S3 Bucket Naming Conventions
Prefixes
Prefixes should be designed very carefully for the required security, data separation, and read and write performance on the S3. The more prefixes, the better the read and write performance.
Dataset Naming
In cases where there are 1000s of different datasets, usually derived from SQL tables, we recommend the use of prefix/namespace, which helps to create data groupings aligned with the business to which those data belong.
For example, in case of a database that has 100s tables, one way to name your datasets that belong to those 100s, is as: data-lake.supply-chain.orders
, Table_A
, and so on. At a later time, those datasets can be easily correlated using their namespace (prefix).
Recommendation
Favor a schema of buckets with a very specific purpose to simplify access control while remaining within the hard limit of 1000 per account bucket.
See the above image and try to form the bucket with a similar structure, following the rules below.
- Elements of bucket naming, starting with lower cardinality variables
- Establishing a constant prefix to increase bucket name uniqueness
- Data lake zone (3 possible values: raw, refined, consumption)
- Line of business lob
- Data classification (4 different values such as public, internal, restricted, confidential)
- Environment (4 different values such as dev, test, prod, discovery)
- AWS region suffix, used to avoid name conflicts of replicated buckets
The following diagram shows the recommended naming structure for S3 buckets in the three recommended data lake layers, including separating multiple business units, file formats, and partitions.
You can adapt data partitions according to your organization's requirements, but you should use lowercase and key-value pairs (for example, year=yyyy
, not yyyy
) so that you can update the catalog with the MSCK REPAIR TABLE
command.
4. Versioning
Object versioning is also an important feature to consider. You should turn on versioning for your raw layer's S3 buckets, because you want to make sure that you can see previous versions if there are changes to the data. However, versioning might not be necessary for all the layers in your data lake and retaining multiple versions can cause unnecessary costs.
5. Tags
S3 tags is one of a very important feature and should be enabled. An example can be for cost allocation. A cost allocation tag is a key-value pair that you associate with an S3 bucket. After you activate cost allocation tags, AWS uses the tags to organize your resource costs on your cost allocation report. Cost allocation tags can only be used to label buckets.
6. Encryption
The data in s3 needs to be secured, and you can further restrict its access using encryption. There are multiple ways to encrypt the data in S3.
- • Server-side encryption with Amazon S3-Managed Keys (SSE-S3)
- • Server-side encryption with KMS keys stored in AWS Key Management Service (SSE-KMS)
- • Server-Side Encryption with Customer-Provided Keys (SSE-C)
7. S3 Life Cycle Configuration
To manage S3 objects throughout their lifecycle, S3 provides an option of LifeCycle Config. For example, you can move the objects created 180 days ago from S3 standard storage to S3 IA access, giving you the same availability and durability as S3 Standard Storage.
Based on the requirement and data classification, after 180 days, you can move the data to Glacier, which will allow you to store the data at a low cost. However, with Glacier, you won’t be able to retrieve the data immediately. After a certain period of time, if the data is not required, you can purge the data as well.
8. Performance Considerations
If you routinely process thousands of S3 requests per second, you might need to consider different approaches to achieving higher request rates and better performance.
Conclusion
In conclusion, implementing the above strategies can achieve a robust and efficient S3 bucket architecture, ensuring seamless scalability and optimization for I/O-intensive applications. This approach enhances performance and ensures cost-effectiveness, security, and reliability.
With proper configuration and best practices, the architecture can handle dynamic workloads while maintaining high availability and durability. Ultimately, this sets the foundation for a resilient and future-proof storage solution.
Opinions expressed by DZone contributors are their own.
Comments