Step-by-Step: How to Load Your Google Analytics Data into Amazon Redshift
How to load and transform Google Analytics data in Amazon's Redshift platform.
Join the DZone community and get the full member experience.Join For Free
the basic free google analytics (ga) functionality is great for small and medium websites and mobile applications, but as your online business expands, you will start looking to get more of this data. however, ga lacks advanced capabilities such as automated user segmentation, anomaly detection, and predictive modeling. that is why users are looking to integrate their ga within their data warehouses , for example, aws redshift. but can these be easily integrated?
over the last few months i was in charge of creating our google analytics extension here at panoply.io that helps in loading, transforming and analyzing the data to our platform. as we developed this feature, we learned how to use ga custom dimensions, manually export the data using ga reporting apis , and finally get this data loaded into redshift.
this article provides step-by-step guidelines on how to extract and load data to amazon redshift and use it for comprehensive web analytics.
for the purpose of illustration, i would like to check the number of pageviews per client, based on the source traffic country and corresponding network service provider. this will help us to determine which client is using what kind of network service providers, according to country, and how many pages have been viewed per network service provider or country.
to show you how to do this, i have created three custom ga dimensions . first, i created two session scope custom dimension names, client id and session id , as well as a hit scope custom dimension name, hit timestamp . the purpose of creating these custom dimensions is to generate session and hit scope data together, so we can assign client ids and session ids based on browser, along with a hit timestamp based on pageviews. we’ll then load the results of our queries to redshift and combine them to generate detailed tracking of pageviews both by client and session.
creating custom dimensions and tags with google tag manager
to create your custom dimension, sign into your ga account, then perform these steps:
once the custom dimensions are setup and defined tags using google tag manager, it’s time to collect the data from ga.
pulling the data from ga
pulling the data from ga is not a straightforward task. although the google reporting api allows us to view metric values for a number of dimensions, it doesn’t provide any built-in dimension for incoming hit stream data . so here you can use a python script, which will extract the raw data off ga and save it to a csv file. you can obtain the python code and full instructions here .
from our custom dimensions, you’ll create two different csv files. for the client id custom dimension, customize the following lines in the python script:
return service.data().ga().get( ids=table_id, start_date='2016-11-24', end_date='2016-12-24', metrics='ga:sessionsperuser', dimensions='ga:country,ga:serviceprovider,ga:clientid', sort='-ga:sessionsperuser', start_index='1', max_results='25')
then save the output to a csv file named pageviews.csv. the output will resemble the following:
for our session id custom dimension, customize the following lines and save to a file called sessions.csv:
return service.data().ga().get( ids=table_id, start_date='2016-11-24', end_date='2016-12-24', metrics='ga:pageviews', dimensions='ga:country,ga:serviceprovider,ga:sessionid', sort='-ga:pageviews', start_index='1', max_results='25')
the output will resemble the following:
next, upload these files to your s3 bucket and from there to redshift.
loading the data to redshift
before loading data directly into amazon redshift, it is important to decide how the raw data will be structured within redshift. in our example, we issued the following sql commands to create our clients and sessions tables:
create table client ( country varchar(64) not null, serviceprovider varchar(64) not null, clientid integer not null, sessionsperuser integer not null, ); create table session ( country varchar(64) not null, serviceprovider varchar(64) not null sessionid varchar(64) not null pageviews integer not null, );
next i copied the data from our s3 bucket into those tables. to do so, we executed the following commands:
copy pageviews from 's3://awssampledbuswest2/clients.csv’ credentials 'aws_access_key_id=<your-access-key-id>;aws_secret_access_key=<your-secret-access-key>' copy usertype from 's3://awssampledbuswest2/sessions.csv’ credentials 'aws_access_key_id=<your-access-key-id>;aws_secret_access_key=<your-secret-access-key>'
once the data has been copied to our tables in redshift, i can execute, for example, a query to join the two tables into one:
select client.country, client.serviceprovider, client.clientid, client.sessionsperuser, session.sessionid, session.pageviews from session join client on session.country = client.country
the sample output will resemble the following:
note: to keep the data up to date, a cron job should be set up to update the redshift table at regular intervals.
i hope this article has helped you. i am already working on next week’s article, and continuing this example showing how to analyze the data loaded to redshift using tableau.
Published at DZone with permission of Alon Brody. See the original article here.
Opinions expressed by DZone contributors are their own.