The IAM Conundrum
AWS's IAM solution is supposed to make identity authenication easier, but it might be worth it to look around and see if other solutions might meet your needs better.
Join the DZone community and get the full member experience.
Join For Free
Amazon Web Services Identity and Access Management (AWS IAM) service is designed to track system users and information regarding how they get authenticated. It is commonly used to protect objects, such as data files, in Amazon’s Simple Storage Service (S3), which, in turn, forms the most important layer of an S3 Data Lake.
With various levels of security layers and different departments responsible for various types of data, there are a number of intricacies and challenges involved in managing the security and governance of AWS IAM.
This article looks at the various security layers, how IAM works, if IAM is slowing down data projects, and the access control requirements that are needed in data lakes.
Security Layers
Before we look at the difficulties of managing access to resources in S3, we are going to look at the broader picture of defining a state-of-the-art security architecture. There are more layers than just authentication and authorization, both of which are covered by IAM. You also need, for example, an airtight perimeter control, leading us to the common security stack shown in the diagram and explained in the next sections.
Perimeter Security
Usually, the technology for this first layer is network-based and involves tools such as firewalls, virtual networks, and intrusion detection services. Drawing on an airport analogy, this is the TSA agent controlling access when you enter an airport. It is literally the first line of defense.
Authentication
Keeping with the above example, next is the border control agent asking you for authentication, which is your passport. For computer systems the equivalent of authentication, which, like a physical passport, is proving that you are who you claim to be. Here you find services such as Kerberos, or token-based systems such as JWT or OpenID.
Authorization
Once you have been granted access, the services need to know what you are allowed to do, which is part of the authorization process. Typically this is a list of permissions or policies that define your role or business purpose. These are stored in directories or other enterprise-wide information services (for instance, Active Directory or OpenLDAP) and commonly define group memberships. For example, user Jane Doe belongs to the group Marketing US East, which means she needs to be able to access all relevant data to fulfill her duties but may be restricted from looking at the data meant for US West Sales.
Service Access
Next up is the actual storage layer, which comprises the services storing and providing the business data. The storage systems have their own idiosyncrasies and proprietary permission features. This is where you deal with file ACLs or database grants.
These layers certainly make sense and mirror real-world processes nicely. But what is not as obvious is the ownership of these security levels. It can get messy as business units (lines of business, or LoB) want to control who has access to what data, while IT departments traditionally are responsible for infrastructure, enterprise-wide services, such as the network, the directory, and storage. This is where, in practice, the data lake architectures fall apart and turn into data swamps if not managed properly.
AWS IAM in a Nutshell
AWS offers a free service with every account, referred to as IAM, short for Identity and Access Management, which provides common entities, such as users, groups and roles, to the account owner. These are used to enable access to other AWS services based on permissions, commonly specified as policies in JSON format. For example, the following policy grants access to a specific collection within the DynamoDB service (source):
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "dynamodb:*",
"Resource": "arn:aws:dynamodb:us-east-2:123456789012:table/Books"
}
}
These policies can then be attached to the above entities (referred to as identity-based policies), or directly to the resource of a service (if supported by that service, referred to as resource-based policies). A common example in the context of a data lake in AWS S3 is to
- create IAM roles that have permission to access the S3 service endpoints (defined as a set of actions),
- define S3 bucket policies to allow or deny access to the contained objects.
The latter is especially needed in shared AWS accounts since by default all IAM entities of an account automatically have access to the resources owned by that account.
Now, if you are still with me, it should be obvious that IAM based security is not trivial as you have to handle proprietary JSON attributes and lock down or open up access to resources in multiple places, leading to mistakes that can expose sensitive data.
Is IAM Slowing Down Data Projects?
If we go back to IAM, and AWS as a whole, the issue is the same: IAM is overloading IT and business interests just like the pre-data lake system architectures did. You need to:
- Have someone capable to write highly complex JSON policies (they can grow to multiple pages, even exceeding the maximum allowed length of up to 20KB),
- Ensure you include all business units as either users, groups or roles that should have access,
- Assign the policies to the right entities, and
- Verify access is working as expected.
This spans many IT and business roles, which usually do not work hand in hand, but through some mediation system (that is, a ticket management system such as JIRA or ServiceNow). Again, errors are likely made, the process is cumbersome and slow, hindering innovation and fast iterations on data projects.
Lastly, AWS IAM only has very limited support for fine-grained access control. Only recently did AWS Glue add the ability to filter rows or grant access on a per-column level. This does not include any masking, tokenization or other advanced features, such as differential privacy. Nor does Glue extend to all AWS services yet, let alone non-AWS ones.
Requirements of Access Control in Data Lakes
The limitations and challenges of IAM as Access Control are becoming even more obvious as privacy regulations (like GDPR and CCPA) are increasingly demanding a lot more data security, like the “right to be forgotten” and so on. Based on conversations with many IAM supporters and detractors, the following requirements should be considered for anyone managing an enterprise data lake.
- IAM Integration - Any new access control or policy system has to integrate with the existing IAM service. However, since IAM bucket policies are single JSON structures containing all access roles for a given bucket, they are very error-prone to manage and so there is a need for a better easy-to-use access control system wherein the granular access control can be managed safely.
- Distributed Access - The access control system mentioned above should let you delegate the permissions easily to the data owners so that they can control access to their data. This removes the need for time-consuming, cross-functional processes.
- Single Unified Layer - Ideally, you want this access control mechanism to act as a single unified security layer that covers not only authentication and authorization, but also audit event logging.
- Unified Governance - No access control system is complete unless it offers you the capability to see who is using what resources in your data lake.
- Privacy Compliance - In order to keep your company from making the headlines for privacy breaches and avoid steep fines like those imposed by global regulations such as GDPR and CCPA, it is important that data be discovered automatically, be tagged correctly and the right information is presented to the right folks without having to make numerous copies of the data.
- Sensitive Data Protection - Not only it is important that the right information is made accessible to the right roles within the company, but it is also important that sensitive data be obfuscated so that sensitive information is protected not only from outsiders but also from insiders.
Take Away
As your data lake continues to grow with many more types of workloads and users, it is imperative that you evaluate other alternatives to IAM that may be better suited to your needs. With enterprises looking at hybrid cloud solutions to avoid service-related risks, it becomes even more important to unify data governance across those and IAM access control may not be the panacea for that.
The good news is that many startups are working to solve this problem and many alternatives are now available.
Opinions expressed by DZone contributors are their own.
Comments