Inexpensive Analytics for Cloud SaaS Product
Keeping a watchful eye. Get better insights on the inner and customer-facing aspects of your SaaS product with this budget-conscious solution.
Join the DZone community and get the full member experience.
Join For FreeDo you care about growth and metrics of your product? This guide will walk you through setting up an end-to-end data pipeline. It's for AWS, however, it should work on Google Cloud Platform (GCP) or Azure if you replace AWS services with their equivalent.
Data Collection
It's important to make sure that you are collecting data from all user and system interaction points. Create a spreadsheet and document all interactions, events and their destination. Typically, it would fall into these three categories:
- Monitoring Data goes into Log Files: These are system events, alerts, warnings, execution time log.
- User Event goes to Google Analytics (GA): Data for all user interactions, from UI, Emails, Services, chatbots and other integrations.
- Application Data in Database: The actual user domain data, used by the application.
Collecting Log Files
Data written to log files are pushed to LogStash, which then pushes it to an AWS ElasticSearch service. When writing the logs, make sure you use consistent formatting and have enough details.
- Log Context: The active user, team, request, and activity that was going on when the log was written. Makes it easy to trace back the problem to its origin. And keep track of parts of the application causes the most problems.
- Consistent Formatting: Make it easier to extract the information later. While some people use JSON, we have found its easier to use a key-value pair. Not ideal, but quick to write, small in size and easy to read:
userId=xxxxx team=xxxxx event=xxxxxx error=xxxxxxx Some Plain Text Description
- Logging Size: If you are doing something in a loop, maybe you just need to log the summary. Keeping the logs summarized and at the right level will make any later analysis easier and help keep a tab on data collection costs. It isn't a big deal if you don't log 100Kb per request in a loop.
User Events
All user interactions are tracked and sent to Google Analytics (GA). This includes user events from the UI, which is straightforward. However, it also includes non-UI events, like:
- Email Opens: You can use a single pixel image to track email opens. This image URL has a tracker key. When the image is fetched, it sends a custom event to the GA backend.
- Chatbot Interactions: That is another user interaction. Messages send an action by a webhook URL. The application code handling web hooks send a custom event to the GA backend.
- Integrations: Your SAAS product would have integrations with others. Whenever a backend service is called, send a custom event to the GA backend to track that activity.
With GA, you should switch on Custom User ID tagging. This would be a UUID generated for every user in your app and sent to GA. This has multiple benefits and allows associating user activity from multiple sessions to a single signed up user. Make sure you send the right User ID for all custom events tracked from a backend.
Application Data
This is straightforward. You are already storing user data; just store these extra fields for (almost) every table and data entity. It will help analytics later on.
- createdDate: when was the entity created
- by: who created the entity
- lastModifiedDate: when was the entity last modified
- lastModifiedBy: who modified the entity last
This helps in analytics and auditing when things go wrong. For critical entities, you can add more fields, e.g. timesModified. Another option is to keep track of each and every modification. In such cases, a CQRS-like command-based modification will make sure you have full change history on the entity.
ETL Pipeline
All your data is now collected. However, it is in multiple places. For analytics, we want to get it all in one place, to run queries and get insights from it. This is what you can do.
ElasticSearch Data with Kibana & Elastalert
You shouldn't be logging user events in logs. They go to GA. If you ensure that, that's one less ETL activity. With that, there are two types of problems to solve using ElasticSearch logs.
- Monitoring Dashboards: AWS provides Kibana out of the box on the ElasticSearch service. Set it up and create dashboards for monitoring application health and metrics, e.g. jobs processed, response times etc.
- Altering: Just monitoring is not enough. You would need to know when something is going or about to go wrong. Yelp has an excellent utility called Elastalert. You can use it to write search queries to monitor and if the results cross a certain threshold, it can send a message to your team Slack channel. Just set it up on a t2.micro machine to run as a daemon.
DMS Pipeline for Importing Application Data
Reporting queries can't be on the production database. AWS offers Database Migration Service (DMS) which can keep two databases in sync. You can use DMS to set up a replication task and data warehouse, an AWS RDS instance. If you choose AWS Aurora as the database, the first 6 months of DMS instance is free. DMS uses log replication for syncing. The load on the production database due to it would be minimal.
In this step, you should exclude certain tables and fields.
- Tables are not needed for analytics, such as audit tables and temporary data
- Large text and binary fields not needed for Analytics
- Personally identifiable information e.g. name, email, address
GA API for Importing User Event Data
Google Analytics API can be used directly to import data. However, a different approach can work, too. Metabase is an open-source project to create beautiful dashboards. It integrates with data-sources like MySQL, Postgres, AWS Redshift and Google Analytics. You will be using Metabase for creating dashboards later and so might as well utilize it for GA integration. Here is how you can set it up:
- Setup Metabase and integrate GA account into it
- Create reports in Metabase for events and data we want to import
- Metabase exposes an API to download CSVs at this URL
- Setup an hourly Python script to download CSV and upload to Aurora DB
- You would need an API key, which can be generated using a Metabase user
curl -X POST -H "Content-Type: application/json" \ -d '{"username": "<user>", "password": "<password>"}' \ https://data.meetnotes.co/api/session
You would want the uploads to be overlapping and incremental. The easy way to do it is to run the script every 6 hrs and have the Metabase report return data for last 24 hrs. The Python script would then delete the existing data for the date range it already has and the re-upload everything from the response. This does more work than necessary but is simple enough to implement.
Data Warehouse
Now you have all the useful data in the Aurora DB. We need to create a reporting schema that's easy enough to write reports on as well as fast. The ETL pipeline gets normalized data. We will next create a Star schema for reporting. Here is how that is done.
Creating the Schema
- Identify Dimensions and create tables for that. For example, if the user, date, and country are dimensions, then you would have a dim_users, dim_dates and dim_country as dimension tables. The way to think about dimensions is that they show up in your SQL query where clauses.
- Identify dimension attributes for the dimensions e.g. is it a paid user, their signup date etc.
- Identify generated attributes for the dimensions. Sometimes, its a calculated attribute that we want to use in our queries e.g. user age in days since signup.
- Create Facts Tables for a combination of dimensions, like how many times a user logged in a particular day would be a fact for dim_user and dim_date. There are many combinations possible and the right one depends on a domain and the reports you want to generate.
Data Population
Once the schema is created, write queries to populate data. This is straightforward, except the queries would be run periodically, so you have to think about incremental updates.
Incremental Updates
Depending on the amount of data, there are multiple approaches. The simplest one is to wipe off the complete data set and repopulate it. It's the simplest method, but does more work than necessary and becomes untenable for larger data sizes. Here is a quick take on making things incremental.
- Dimension table rows are easily made incremental. Just insert rows not already present, based on primary keys.
- Dimension table columns will fall into two cases: the ones which never change, such as the signup date, can be part of the insert queries, and and those that change, like total_time_spent. You only need to update the ones that change.
- Fact Tables rows would need additional rows based on newly inserted dimensions. Its a cross join and then exclude rows that are already present.
- Fact Table columns will again fall into two cases. The first are those which have a temporal quality, meaning they don't need to be updated once calculated. An example would be logins_per_day, which when calculated once for a past date, wouldn't change. The other would be attributes that have to be calculated every time, like invitations_sent.
Schema creation scripts are run once. And then incremental updates are run using an hourly cron job.
Ongoing Maintenance
Once you set this up, there isn't really any ongoing maintenance needed to keep operating as is. However, as things change and new reports are needed, you would need to do the following.
- Performance & Query Optimization: Some reports will start running slow with more data. Create indexes, follow standard query optimization techniques to improve their runtime.
- Reporting Schema Changes: Simple schema changes can be additive. But from time to time, you would need a more destructive change. If so, just make the change, wipe out the schema, recreate it with changes and re-run the data population scripts.
- Production DB Schema Change: Fix the queries and run them again. If it's a big change, recreate the warehouse as in the previous step.
- AWS DMS Management: DMS tasks fail from time to time. However, it's pretty infrequent, less than once a month. If that happens, just restart the tasks.
As a best practice, create a dashboard to monitor the pipeline. It would keep track of the last update from each source and if that becomes too old, it means something is broken.
That should set you with a solid simple and relatively inexpensive analytics pipeline. But that's just the start. Now go ahead and build reports and dashboards to understand user behavior and grow your product.
Some configuration and black magic is needed for each of the parts above to work. If you are planning to do it and face setup issues, leave a comment or message. I'll be happy to help.
Published at DZone with permission of Anshuman Singh. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments