architects,bigdata,tool,tools & methods,big data
Amazon Web Services Redshift – Data Warehouse in the Cloud
IntroductionAmazon Web Services has made publicly available its fully managed, petabyte-scale data warehouse cloud service in February, 2013. It promises a high performance, secure, easily scalable data warehouse solution that costs 1/10th of a traditional data warehouse (less than 1,000 USD/TB/year, according to the AWS Introduction to Redshift presentation: http://aws.amazon.com/redshift/) , it is compatible with the traditional BI tools and ready to be running within minutes. As of writing this article the service is available in US East region only but supposed to be rolled out to other regions, too. The service is manageable via the regular AWS tools: AWS management console, command line tools (aws commands based on python) and API based on HTTP requests/responses.
Under the hoodUnder the hood, AWS Redshift is based on PostgreSQL 8.0.2. The architecture consist of 1 leader node - a node which is responsible for managing the communications with the clients, developing the execution plan and then distributing the compiled code to the compute nodes-, and 1 or more compute nodes that are exetung the code and then sending back the result to the leader node for aggregation. The compute nodes can have either 2-cores, 15GB RAM and 2 TB storage node (dubbed as XL node) or a 16-cores, 120 GB RAM and 16 TB storage node (dubbed as 8XL node). More details about the Redshift archtecture can be found at http://docs.aws.amazon.com/redshift/latest/dg/c_internal_arch_system_operation.html
Launching a clusterThe easiest way to launch a cluster is via AWS console. We need to define the basic attributes like cluster identifier, database name, database port, master username and password: Then we need to select the node type (XL or 8XL) and the number of compute nodes. A cluster can be single or multi-node, the minimum config is a one XL node cluster, while the maximum config is sixteen 8XL nodes - you can do the math in terms of cores, memory and storage. Then we can configure additional parameters (like database encyption or security groups) We can then review the configuration and are ready to launch the service: The status will be first "creating" for a while then it will become "available". This is when the JDBC url will become known and can be used for configuring the clients. In order to make the service accessible, we need to configure the security options (either a security group - if Redshift is going to be accessed from EC2 - or a CIDR/IP (Classless- Inter-Domain Routing IP range) - if Redshift is to be accessed from public Internet. The system will automatically recognise the IP address of the client connected to AWS console. And that is it! From then on the client can be connected to the Redshift cluster. We used SQLWorkbench to test the service, the same way as suggested by AWS Redshift documentation. It is a Java based open source SQL tool. The connection parameters are the standard JDBC attributes: The PostgreSQL version can be checked using
select version(); version PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.546We tested the service with Amazon stock prices downloaded from Yahoo Finance. The content has been uploaded to a S3 bucket called stockprice (S3://stockprice/amzn.csv). We had to make it accessible for everyone to read. (open/download). Then we could create the appropriate table using standard SQL command:
CREATE TABLE stockprice ( stockdate date not null, stockopen decimal(6,2), stockhigh decimal(6,2), stocklow decimal(6,2), stockclose decimal(6,2), stockvolume integer, stockadjclose decimal(6,2) ); Table 'stockprice' created Execution time: 0.15s desc stockprice COLUMN_NAME DATA_TYPE PK NULLABLE DEFAULT AUTOINCREMENT REMARKS POSITION stockdate date NO YES NO 1 stockopen numeric(6,2) NO YES NO 2 stockhigh numeric(6,2) NO YES NO 3 stocklow numeric(6,2) NO YES NO 4 stockclose numeric(6,2) NO YES NO 5 stockvolume integer NO YES NO 6 stockadjclose numeric(6,2) NO YES NO 7To load the data into stockprice table, we had to use copy command with the S3 source file (it could also be an Amazon DynamoDB source).
copy stockprice from 's3://stockprices/amzn.csv' CREDENTIALS 'aws_access_key_id=XXXXXXX;aws_secret_access_key=XXXXXXX' delimiter ',';If there is any error during the load operation, it can be verified by running a select statement on the stl_load_errors table. (e.g. incorrect data format). And then we can run our SQL statements to analyse the data.
select * from stockprice order by stockadjclose desc limit 100; stockdate stockopen stockhigh stocklow stockclose stockvolume stockadjclose 2013-01-25 275.00 284.72 274.40 283.99 4968100 283.99 2013-01-28 283.78 284.48 274.40 276.04 4321400 276.04 2013-03-05 274.00 276.68 269.99 275.59 3686200 275.59 2013-03-13 275.24 276.50 272.64 275.10 1884200 275.10 2013-03-08 275.00 275.44 271.50 274.19 1879800 274.19 2013-03-12 271.00 277.40 270.36 274.13 3246200 274.13 2013-03-07 274.10 274.80 271.85 273.88 1939000 273.88 2013-03-06 275.76 276.49 271.83 273.79 2050700 273.79 2013-01-24 269.37 276.65 269.37 273.46 3417000 273.46 2013-03-04 265.36 273.30 264.14 273.11 3453000 273.11 2013-01-30 283.00 284.20 267.11 272.76 13075400 272.76 2013-01-14 268.00 274.26 267.54 272.73 4275000 272.73 2013-01-18 270.83 274.50 269.60 272.12 2942000 272.12 2013-01-15 270.68 272.73 269.30 271.90 2326900 271.90 2013-03-11 273.43 273.99 270.40 271.24 1904900 271.24AWS console supports various management functions of the cluster, we can reboot the cluster, we can modify parameters, we can resize it by defining different node type (XL->8XL) or decreasing/increasing the number of nodes. We can also delete the cluster via AWS console.