DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Databases
  4. Inexpensive Analytics for Cloud SaaS Product

Inexpensive Analytics for Cloud SaaS Product

Keeping a watchful eye. Get better insights on the inner and customer-facing aspects of your SaaS product with this budget-conscious solution.

Anshuman Singh user avatar by
Anshuman Singh
·
Aug. 21, 18 · Tutorial
Like (3)
Save
Tweet
Share
4.35K Views

Join the DZone community and get the full member experience.

Join For Free

Do you care about growth and metrics of your product? This guide will walk you through setting up an end-to-end data pipeline. It's for AWS, however, it should work on Google Cloud Platform (GCP) or Azure if you replace AWS services with their equivalent.

Data Collection

It's important to make sure that you are collecting data from all user and system interaction points. Create a spreadsheet and document all interactions, events and their destination. Typically, it would fall into these three categories:

  • Monitoring Data goes into Log Files: These are system events, alerts, warnings, execution time log.
  • User Event goes to Google Analytics (GA): Data for all user interactions, from UI, Emails, Services, chatbots and other integrations.
  • Application Data in Database: The actual user domain data, used by the application.

Collecting Log Files

Data written to log files are pushed to LogStash, which then pushes it to an AWS ElasticSearch service. When writing the logs, make sure you use consistent formatting and have enough details.

  • Log Context: The active user, team, request, and activity that was going on when the log was written. Makes it easy to trace back the problem to its origin. And keep track of parts of the application causes the most problems.
  • Consistent Formatting: Make it easier to extract the information later. While some people use JSON, we have found its easier to use a key-value pair. Not ideal, but quick to write, small in size and easy to read:
userId=xxxxx team=xxxxx event=xxxxxx error=xxxxxxx Some Plain Text Description


  • Logging Size: If you are doing something in a loop, maybe you just need to log the summary. Keeping the logs summarized and at the right level will make any later analysis easier and help keep a tab on data collection costs. It isn't a big deal if you don't log 100Kb per request in a loop. 

User Events

All user interactions are tracked and sent to Google Analytics (GA). This includes user events from the UI, which is straightforward. However, it also includes non-UI events, like:

  • Email Opens: You can use a single pixel image to track email opens. This image URL has a tracker key. When the image is fetched, it sends a custom event to the GA backend.
  • Chatbot Interactions: That is another user interaction. Messages send an action by a webhook URL. The application code handling web hooks send a custom event to the GA backend.
  • Integrations: Your SAAS product would have integrations with others. Whenever a backend service is called, send a custom event to the GA backend to track that activity.

With GA, you should switch on Custom User ID tagging. This would be a UUID generated for every user in your app and sent to GA. This has multiple benefits and allows associating user activity from multiple sessions to a single signed up user. Make sure you send the right User ID for all custom events tracked from a backend.

Application Data

This is straightforward. You are already storing user data; just store these extra fields for (almost) every table and data entity. It will help analytics later on.

  • createdDate: when was the entity created
  • by: who created the entity
  • lastModifiedDate: when was the entity last modified
  • lastModifiedBy: who modified the entity last

This helps in analytics and auditing when things go wrong. For critical entities, you can add more fields, e.g. timesModified. Another option is to keep track of each and every modification. In such cases, a CQRS-like command-based modification will make sure you have full change history on the entity.

ETL Pipeline

All your data is now collected. However, it is in multiple places. For analytics, we want to get it all in one place, to run queries and get insights from it. This is what you can do.

ElasticSearch Data with Kibana & Elastalert

You shouldn't be logging user events in logs. They go to GA. If you ensure that, that's one less ETL activity. With that, there are two types of problems to solve using ElasticSearch logs.

  • Monitoring Dashboards: AWS provides Kibana out of the box on the ElasticSearch service. Set it up and create dashboards for monitoring application health and metrics, e.g. jobs processed, response times etc.
  • Altering: Just monitoring is not enough. You would need to know when something is going or about to go wrong. Yelp has an excellent utility called Elastalert. You can use it to write search queries to monitor and if the results cross a certain threshold, it can send a message to your team Slack channel. Just set it up on a t2.micro machine to run as a daemon.

DMS Pipeline for Importing Application Data

Reporting queries can't be on the production database. AWS offers Database Migration Service (DMS) which can keep two databases in sync. You can use DMS to set up a replication task and data warehouse, an AWS RDS instance. If you choose AWS Aurora as the database, the first 6 months of DMS instance is free. DMS uses log replication for syncing. The load on the production database due to it would be minimal.

In this step, you should exclude certain tables and fields.

  • Tables are not needed for analytics, such as audit tables and temporary data
  • Large text and binary fields not needed for Analytics
  • Personally identifiable information e.g. name, email, address

GA API for Importing User Event Data

Google Analytics API can be used directly to import data. However, a different approach can work, too. Metabase is an open-source project to create beautiful dashboards. It integrates with data-sources like MySQL, Postgres, AWS Redshift and Google Analytics. You will be using Metabase for creating dashboards later and so might as well utilize it for GA integration. Here is how you can set it up:

  • Setup Metabase and integrate GA account into it
  • Create reports in Metabase for events and data we want to import
  • Metabase exposes an API to download CSVs at this URL
  • Setup an hourly Python script to download CSV and upload to Aurora DB
  • You would need an API key, which can be generated using a Metabase user
    curl -X POST -H "Content-Type: application/json" \
         -d '{"username": "<user>", "password": "<password>"}' \
         https://data.meetnotes.co/api/session


You would want the uploads to be overlapping and incremental. The easy way to do it is to run the script every 6 hrs and have the Metabase report return data for last 24 hrs. The Python script would then delete the existing data for the date range it already has and the re-upload everything from the response. This does more work than necessary but is simple enough to implement.

Data Warehouse

Now you have all the useful data in the Aurora DB. We need to create a reporting schema that's easy enough to write reports on as well as fast. The ETL pipeline gets normalized data. We will next create a Star schema for reporting. Here is how that is done.

Creating the Schema

  • Identify Dimensions and create tables for that. For example, if the user, date, and country are dimensions, then you would have a dim_users, dim_dates and dim_country as dimension tables. The way to think about dimensions is that they show up in your SQL query where clauses.
  • Identify dimension attributes for the dimensions e.g. is it a paid user, their signup date etc.
  • Identify generated attributes for the dimensions. Sometimes, its a calculated attribute that we want to use in our queries e.g. user age in days since signup.
  • Create Facts Tables for a combination of dimensions, like how many times a user logged in a particular day would be a fact for dim_user and dim_date. There are many combinations possible and the right one depends on a domain and the reports you want to generate.

Data Population

Once the schema is created, write queries to populate data. This is straightforward, except the queries would be run periodically, so you have to think about incremental updates.

Incremental Updates

Depending on the amount of data, there are multiple approaches. The simplest one is to wipe off the complete data set and repopulate it. It's the simplest method, but does more work than necessary and becomes untenable for larger data sizes. Here is a quick take on making things incremental.

  • Dimension table rows are easily made incremental. Just insert rows not already present, based on primary keys.
  • Dimension table columns will fall into two cases: the ones which never change, such as the signup date, can be part of the insert queries, and and those that change, like total_time_spent. You only need to update the ones that change.
  • Fact Tables rows would need additional rows based on newly inserted dimensions. Its a cross join and then exclude rows that are already present.
  • Fact Table columns will again fall into two cases. The first are those which have a temporal quality, meaning they don't need to be updated once calculated. An example would be logins_per_day, which when calculated once for a past date, wouldn't change. The other would be attributes that have to be calculated every time, like invitations_sent.

Schema creation scripts are run once. And then incremental updates are run using an hourly cron job.

Ongoing Maintenance

Once you set this up, there isn't really any ongoing maintenance needed to keep operating as is. However, as things change and new reports are needed, you would need to do the following.

  • Performance & Query Optimization: Some reports will start running slow with more data. Create indexes, follow standard query optimization techniques to improve their runtime.
  • Reporting Schema Changes: Simple schema changes can be additive. But from time to time, you would need a more destructive change. If so, just make the change, wipe out the schema, recreate it with changes and re-run the data population scripts.
  • Production DB Schema Change: Fix the queries and run them again. If it's a big change, recreate the warehouse as in the previous step.
  • AWS DMS Management: DMS tasks fail from time to time. However, it's pretty infrequent, less than once a month. If that happens, just restart the tasks.

As a best practice, create a dashboard to monitor the pipeline. It would keep track of the last update from each source and if that becomes too old, it means something is broken.

That should set you with a solid simple and relatively inexpensive analytics pipeline. But that's just the start. Now go ahead and build reports and dashboards to understand user behavior and grow your product.

Some configuration and black magic is needed for each of the parts above to work. If you are planning to do it and face setup issues, leave a comment or message. I'll be happy to help.

Database Analytics Data (computing) sql Event Dimension (data warehouse) application Cloud SaaS Schema

Published at DZone with permission of Anshuman Singh. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Building a REST API With AWS Gateway and Python
  • Integrate AWS Secrets Manager in Spring Boot Application
  • Tackling the Top 5 Kubernetes Debugging Challenges
  • Choosing the Right Framework for Your Project

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: