Qubole Security Model for Authorization in AWS Cloud
Learn about storing data with enterprise-level security and data governance on the cloud using the new Qubole SQL Authorization feature.
Join the DZone community and get the full member experience.
Join For Freequbole has defined a new security model to improve enterprise-level security and data governance on the cloud. the model integrates the cloud vendor's storage authorization with hive authorization. this improves usability for both cloud-storage administrators and data administrators (dba) while eliminating errors that arise from end-user authorization problems.
this is an important milestone on the way to qubole's goal of building a secure, enterprise-level cloud platform. qubole is one of the first vendors to add cloud storage-level checks at query compile time, and consequently, offers the most secure platform in the cloud.
history of database security model
in traditional databases (rdbms and nosql), the database had complete control over the catalog, compute, and storage, so administrators and users used the database as the single source of truth for authentication and authorization.
in early versions of apache hive, the catalog (hive metastore), compute (m/r), and storage (hdfs) were separate. seamless authorization was not possible because each system had to be administered independently. apache hive released
sql standard-based hive authorization
in hive 0.13, which integrated all the systems. this mechanism runs storage-level checks through the metastore server and checks the
grant
tables during query compile. as a prerequisite, a user must be registered with hive and hdfs, and both must have a single view of users and their credentials. similarly, apache ranger and apache sentry both provide hdfs-level file permission checks when the same user is trying to access tables through sql.
cloud security model
in the case of public clouds, data is stored in the cloud, and not in hdfs. authorization on cloud storage is managed by the cloud vendor. sql standard based hive authorization in hive does not work in this case because there is no unified view of users and credentials.
consider the typical representation of a hive table in the cloud depicted the diagram above.
a hive table consists of files in the cloud storage and catalog information in the hive metastore. consequently, users have to be assigned roles/keys for cloud storage as well as
user
or
role
with the hive database. the authorization modules of the cloud storage and the database are separate, and it is very hard to keep them synchronized. a mapping or coordination is needed to manage the pairs of roles and use them effectively.
qubole security model
qubole has integrated cloud storage authorization with hive authorization by introducing a new security model. in qds, users can choose between two models: l1 or l2. we'll discuss the two models in detail below, using aws s3 storage as an example.
l1: cloud storage authorization (aws)
users use cloud storage permissions to control table access. users may also combine sql statements to define access policy and rules, but will mainly use such statements for error-proofing (for example, in the case of a user accidentally dropping a table).
if the organization has two tables:
... the following is what the administrator needs to define in user1 and user2's iam policy.
for user1's iam-role:
...
{
"effect": "allow",
"action": [
"s3:putobject",
"s3:getobject",
"s3:deleteobject"
],
"resource": "arn:aws:s3:::org/datawarehouse/tables/a/*"
},
...
for user2’s iam-role:
...
{
"effect": "allow",
"action": [
"s3:putobject",
"s3:getobject",
"s3:deleteobject"
],
"resource": "arn:aws:s3:::org/datawarehouse/tables/b/*"
},
...
some issues with l1:
l2: qubole sql authorization
l2's design principle is to provide database administrators with a unified view of database and storage permissions. no co-ordination with cloud storage administrators is needed.
qubole hive implements this design principle by assigning a separate role to the compute nodes (iam-c). this role provides access to all the data.
qubole hive then executes the following checks query compile time:
-
grant
tables in metastore - storage-level checks with the user's iam role (iam-a) as needed for location-related ddls only.
the compute role and query compile-time checks provide a seamless way for database administrators to define authorization policies without needing to make any changes in cloud storage policies.
l2: limitations
the l2 model covers data access only through metadata policies in qds.
qubole recommends customers also secure their cluster access outside of qubole (for example, direct access to clusters or storage) to build a completely secure environment.
use case
a big organization has three departments: sales, marketing and finance. these need to have different privileges for access to tables stored in the protected data area.
storage configuration
three tables are stored in the s3 data warehouse, which is a protected area:
protected data area:
s3a://datalake/datawarehouse/customer/ (customer table)
s3a://datalake/datawarehouse/store_sales/ (store_sales table)
s3a://datalake/datawarehouse/promotion/ (promotion table)
iam role creation
-
iam-cluster has full access to s3a://datalake/datawarehouse
-
iam-accountdefault has access only to s3a://datalake/defloc/
-
if admin wants to add a temp folder for everyone: s3a://datalake/temp
then, follow dual-iam role documentation to set up dual-iam role for the account.
qds account setup
all users can belong to the same qds account. here's a user list that the administrator needs to grant permission to:
on the account settings page in the qds ui, put iam-accountdefault role credentials.
hiveql ddl privilege setup
after hive authorization is enabled for the account:
set role admin;
use demo_database;
create role sales;
create role marketing;
create role finance;
grant select, insert on customer to role sales;
grant select on customer to role finance;
grant select on customer to role marketing;
grant select, insert on store_sales to role sales;
grant select on store_sales to role finance;
grant select on store_sales to role marketing;
grant select on promotion to role sales;
grant select on promotion to role finance;
grant select, insert on promotion to role marketing
grant role finance to user john;
grant role sales to user david;
grant role marketing to user mary;
future work
other engines (spark, presto, etc.) and applications (such as zeppelin notebook) will support the l2 model over the next couple releases. we'll write blogs to elaborate on the mechanism and provide use cases for these engines to demonstrate our cross-engine security solution for data authorization.
Published at DZone with permission of Abhishek Somani, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments