Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Qubole Security Model for Authorization in AWS Cloud

DZone's Guide to

Qubole Security Model for Authorization in AWS Cloud

Learn about storing data with enterprise-level security and data governance on the cloud using the new Qubole SQL Authorization feature.

Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Qubole has defined a new security model to improve enterprise-level security and data governance on the cloud. The model integrates the cloud vendor's storage authorization with Hive authorization. This improves usability for both cloud-storage administrators and data administrators (DBA) while eliminating errors that arise from end-user authorization problems.

This is an important milestone on the way to Qubole's goal of building a secure, enterprise-level cloud platform. Qubole is one of the first vendors to add cloud storage-level checks at query compile time, and consequently, offers the most secure platform in the cloud.

History of Database Security Model

In traditional databases (RDBMS and NoSQL), the database had complete control over the catalog, compute, and storage, so administrators and users used the database as the single source of truth for authentication and authorization.

In early versions of Apache Hive, the catalog (Hive Metastore), compute (M/R), and storage (HDFS) were separate. Seamless authorization was not possible because each system had to be administered independently. Apache Hive released SQL Standard-Based Hive Authorization in Hive 0.13, which integrated all the systems. This mechanism runs storage-level checks through the Metastore server and checks the GRANT tables during query compile. As a prerequisite, a user must be registered with Hive and HDFS, and both must have a single view of users and their credentials. Similarly, Apache Ranger and Apache Sentry both provide HDFS-level file permission checks when the same user is trying to access tables through SQL.

Cloud Security Model

In the case of public clouds, data is stored in the cloud, and not in HDFS. Authorization on cloud storage is managed by the cloud vendor. SQL Standard Based Hive Authorization in Hive does not work in this case because there is no unified view of users and credentials.

Consider the typical representation of a Hive table in the cloud depicted the diagram above.

A Hive table consists of files in the cloud storage and catalog information in the Hive Metastore. Consequently, users have to be assigned Roles/Keys for cloud storage as well as USER or ROLE with the Hive database. The authorization modules of the cloud storage and the database are separate, and it is very hard to keep them synchronized. A mapping or coordination is needed to manage the pairs of roles and use them effectively.

Qubole Security Model

Qubole has integrated cloud storage authorization with Hive authorization by introducing a new security model. In QDS, users can choose between two models: L1 or L2. We'll discuss the two models in detail below, using AWS S3 storage as an example.

L1: Cloud Storage Authorization (AWS)

Users use cloud storage permissions to control table access. Users may also combine SQL statements to define access policy and rules, but will mainly use such statements for error-proofing (for example, in the case of a user accidentally dropping a table).

If the organization has two tables:

Image title

... the following is what the administrator needs to define in User1 and User2's IAM policy.

For user1's IAM-role:

...

{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::org/datawarehouse/tables/A/*"
},

...

For user2’s IAM-role:

...

{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::org/datawarehouse/tables/B/*"
},

...

Some issues with L1:

Image title

L2: Qubole SQL Authorization

L2's design principle is to provide Database Administrators with a unified view of database and storage permissions. No co-ordination with cloud storage administrators is needed.

Qubole Hive implements this design principle by assigning a separate role to the compute nodes (IAM-C). This role provides access to all the data.

Qubole Hive then executes the following checks query compile time:

  • GRANT tables in metastore
  • Storage-level checks with the user's IAM role (IAM-A) as needed for location-related DDLs only.

The compute role and query compile-time checks provide a seamless way for database administrators to define authorization policies without needing to make any changes in cloud storage policies.

Image title

L2: Limitations

The L2 model covers data access only through metadata policies in QDS.

Qubole recommends customers also secure their cluster access outside of Qubole (for example, direct access to clusters or storage) to build a completely secure environment.

Use Case

A big organization has three departments: Sales, Marketing and Finance. These need to have different privileges for access to tables stored in the protected data area.

Image title

Storage Configuration

Three tables are stored in the S3 data warehouse, which is a protected area:

Protected Data Area:
s3a://datalake/datawarehouse/customer/ (customer table)
s3a://datalake/datawarehouse/store_sales/ (store_sales table)
s3a://datalake/datawarehouse/promotion/ (promotion table)

IAM Role Creation

  • IAM-Cluster has full access to s3a://datalake/datawarehouse

  • IAM-AccountDefault has access only to s3a://datalake/defloc/

  • If admin wants to add a temp folder for everyone: s3a://datalake/temp

Then, follow Dual-IAM Role Documentation to set up Dual-IAM role for the account.

QDS Account Setup

All users can belong to the same QDS account. Here's a user list that the administrator needs to grant permission to:

Image title

On the Account Settings page in the QDS UI, put IAM-AccountDefault role credentials.

HiveQL DDL Privilege Setup

After Hive Authorization is enabled for the account:

Set role admin;
Use demo_database;

Create role sales;
Create role marketing;
Create role finance;

Grant select, insert on customer to role sales;
Grant select on customer to role finance;
Grant select on customer to role marketing;

Grant select, insert on store_sales to role sales;
Grant select on store_sales to role finance;
Grant select on store_sales to role marketing;

Grant select on promotion to role sales;
Grant select on promotion to role finance;
Grant select, insert on promotion to role marketing

GRANT role finance to USER john;
GRANT role sales to USER david;
GRANT role marketing to USER mary;

Future Work

Other engines (Spark, Presto, etc.) and applications (such as Zeppelin Notebook) will support the L2 Model over the next couple releases. We'll write blogs to elaborate on the mechanism and provide use cases for these engines to demonstrate our cross-engine security solution for data authorization.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,aws ,data security

Published at DZone with permission of Abhishek Somani, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}