How to Provide Consistent Data Analysis Within a Big Data Team
We’ll talk about the challenges of being a data engineer vs software engineer and the relationships of their departments.
Join the DZone community and get the full member experience.
Join For FreeIn previous posts, we were mainly looking at process aspects of making the company data driven, like how to define good metrics and relationships between analysts and business users. Now we’ll talk about the challenges of being a data engineer vs software engineer and the relationships of their departments.
People often treat these as completely separate entities, which is not true if the company is data driven.
Data Challenges
When we think about data pipelines we typically imagine some bearded guy with eye circles that builds and maintains ETLs. This is valid for cases related to the integration of mature external products (Google and Facebook ads, Zendesk, etc.) into your analytics environment.
Data engineers can handle different tasks independently from the software engineering department. These can be pulling data from clear, well-documented, stable APIs, designed for retrieval of historical data from a third-party. Plus, today there’s a range of companies that build their business around syncing your data for you.
At the same time, most startups are software-first companies that build both customer facing and internal software, and this is an entirely different story.
This software is often not clear, not well-documented, not stable, and not designed for retrieval of historical data.
There are good reasons behind this:
- Resource constraints and sacrifices made for the sake of the iteration speed.
- Lack of data awareness among developers. A typical developer mindset is something that concentrates primarily on how to make things work here and now, and make them work properly. Change logging is usually at the bottom of the priority list.
- Modern frameworks based on the ORM concept. Data engineers describe entities in the program itself without thinking too much about the database model. The framework creates the database model automatically from the declaration of classes and their relationships.
This seems (and truly is) very convenient, however this paradigm creates a bias towards “here and now” behavior of the objects vs proper logging and database storage that is analytics friendly.
This causes a “garbage in, garbage out” problem when data that you get from our production system is not suitable for building business metrics. Data requires redundant transformation or is simply incurable because the entities change and their previous states are lost forever.
For any process to go smoothly, participants should have balanced responsibility and authority and sufficient resources, but there is a problem in both cases:
- Responsibility is complete while authority is incomplete. The data team is the one who is accountable for the quality of the final data product, while the quality is partially outside of the team’s control and depends on the software design.
- Lack of resources. The ratio between data engineers vs software engineers in the company is far from 1 to 1 (it can be 1 to 10).
This causes a lot of stress, which could be avoided if the software is designed right. This is why it is crucial to have an understanding on all levels that the data pipeline for the company-written software starts in the engineering department.
Let’s take a look at a very simple example. Imagine a company selling subscriptions that has two plans: basic that costs $10/month and premium that is $15/month. The most basic subscription record would look like this:
user_id |
start_date |
cancellation_date |
plan |
999 |
2018-01-01 |
null |
basic |
The fields are self-explanatory. What happens when the subscription is upgraded to premium on 02-01? A bare minimum would be to overwrite the “plan” column:
user_id |
start_date |
cancellation_date |
plan |
999 |
2018-01-01 |
null |
premium |
Also, if the developer wants to show the date when the current plan started in the UI they might want to update the “start_date” field:
user_id |
start_date |
cancellation_date |
plan |
999 |
2018-02-01 |
null |
premium |
The customer will see the plan updated in the UI, premium functionality will become available, and the billing system will use the updated plan to charge $15 going forward.
Now things are all good, except this is a nightmare for a data analyst because the data is overwritten and the past state of the subscriber is lost forever, as well as the transition date to the new state. Typical data questions about subscriptions include:
- How many subscriptions by plan did the company have on day X?
- What was the subscription revenue in month X?
- How many subscribers switched from one plan to another and how did this impact the revenue?
None of these questions can be answered accurately (or answered at all) when the state of the objects is overwritten.
The subscription above would be counted as premium before the transition date, the revenue will be skewed as well because it would count $15/month starting from 2018-01-01, which is incorrect, and the switches are not possible to identify at all.
Even worse, if “start_date” is updated to 02-01, in this case the revenue for January wouldn’t be counted at all. Same with contract pausing (when someone ends and then restarts their subscription), product repricing, and other changes.
This problem has been known for a while and is called the slowly changing dimensions problem, however today it’s not slow anymore.
In traditional BI, it was explained with less common examples, such as “the customer changed their state of residence” or “the customer changed their name.” In today’s startup environment, business is rapidly changing and there are lots of moving parts and hypotheses being tested, so this problem requires very special attention.
Ways to Handle This Problem
There are multiple ways to handle this problem:
- Regularly dump the table somewhere outside the database and sync dumps into a data warehouse.
The same record in consequent dumps would look like this:
Dump date |
user_id |
start_date |
cancellation_date |
plan |
2018-01-30 |
999 |
2018-01-01 |
null |
basic |
2018-01-31 |
999 |
2018-01-01 |
null |
basic |
2018-02-01 |
999 |
2018-01-01 |
null |
premium |
2018-02-02 |
999 |
2018-01-01 |
null |
premium |
The great advantage of this approach is that data is very clean:
- Every day you’ll get the exact state of the objects without needing to make any assumptions.
- Any change that is made in the software will be reflected, just take the output from the black box.
- Data is already in entity/day format, which is a great format for analysis.
- Data is easy to aggregate – just write group by queries or connect any BI tool.
The downside of this approach:
- It comes with the maintenance burden: the job needs to be executed successfully every day.
- It demands storage, which is less of a problem as storage gets cheaper and cheaper
- Regularly take a snapshot of the changing table right in the database.
Data will look the same as in the first option, except it would be right in the database and it will just be enough to sync your production database into the data warehouse.
The difference is that it comes with a lower maintenance burden because there are more tools to do this out of the box. However, the storage problem becomes more serious because you do this inside the database.
- Use object tracking tools like django-reversion that will dump the previous state of the object into a table on any change.
This inflates the database a lot, and queries to get meaningful data out of the snapshots will be quite ugly and slow (more work on the data side). The advantage of this approach is that it takes nearly zero time to implement on the engineering side.
Use this option only if you’ve developed the software already and you’d like to collect past states as soon as possible while working on a more long-term solution.
- Use event-driven architecture.
The software will create a special record that stores all necessary data and metadata for any business event. In our use case, the event table can look like this:
event |
related_id |
timestamp |
old_value |
new_value |
new subscription |
999 |
2018-01-01 12:00:00 |
null |
basic |
subscription plan update |
999 |
2018-02-01 18:00:00 |
basic |
premium |
This is very storage friendly and robust. On the flipside, unlike the above mentioned methods, this one requires development effort, preferably at the earlier stages of the projects.
In addition, this data can’t be used “as is” by most analytics tools. It has to be transformed into the same entity/day historical table that has the field values changing over time based on the event sequence.
- Design the class and methods in a way that allows the database to keep historical states.
This will require adding the “end_date” property and let any method “close” the object (assigning “date_to”) and create a new object with the new state on every change.
Below is what the data would look like before a change:
subscription_id |
user_id |
start_date |
end_date |
plan |
1000 |
999 |
2018-01-01 |
null |
basic |
… and after a change:
subscription_id |
user_id |
start_date |
end_date |
plan |
1000 |
999 |
2018-01-01 |
2018-01-31 |
basic |
1001 |
999 |
2018-02-01 |
null |
premium |
Basically, for every change the new subscription record is created and every subscription has its own time span. This time span doesn’t intersect with any other subscriptions of the same user/product.
This model also supports contract pausing. In this case, there will be a gap between the end_date and the next start_date.
It doesn’t take a lot of extra storage and extra effort from a software engineering standpoint. This approach easily converts this data structure into daily state records like in the first and second options.
Here is the rough summary of the above methods to prioritize them for your business cases:
Software Eng. development |
Data Eng. development |
Data Eng. maintenance |
Need to transform data |
Query performance |
Storage sensitivity |
|
Flat dump |
Zero |
High |
High |
Zero |
High |
Medium |
DB snapshot table |
Medium |
Zero |
Low |
Zero |
High |
High |
Object tracking |
Low |
Zero |
Low |
High |
Low |
High |
Event-driven architecture |
Medium |
Zero |
Low |
Medium |
Medium |
Low |
Historical model |
Medium |
Zero |
Low |
Low |
Medium |
Low |
How to Implement the Solution
The practice shows that the two last options are the best, and it is better to make it the company’s engineering standard.
Here are the ways to do it:
- Promote data awareness among software engineers. Make it clear that all objects should be unchanged once created, or an event should be issued on any change. No updates should be made without storing the information about the old state. In addition to that, there are soft methods. One good way to seed data awareness is to give presentations about the company’s metrics, how they’re defined, and which historical data is used. Engineers are smart and curious people. They are typically interested in the company they work for, and it starts making more sense for them when they see the metrics. If they have the knowledge about the analytical pipeline that is behind the software they write, they will be more conscious in offering the right designs. Another possible way is to promote the idea that:
- A good developer thinks about the present: how to develop a bugless feature that works according to business requirements.
- A great developer thinks about the future: has the big picture about the overall architecture and thinks about how compliant the new feature will be with possible future changes and other subsystems.
- An awesome developer thinks about the past: proper logging of meaningful information for the business.
Everyone wants to be awesome, give them a chance!
- Let the analytics manager review and approve the developer planning documentation and/or pull requests (given an engineering background). There should be a right to say “no” when the planned change in the software is not compliant with the analytical workflows. At the same time, data engineers should provide a constructive explanation on how to make it compliant.
Besides blocking changes that ruin analytical pipelines, this is especially useful in terms of the knowledge transfer. Even if the software is poorly documented, the data team participates in discussions and gets to know how the software actually works. - Schedule at least one 1:1 meeting with the analytics manager as a mandatory part of the onboarding plan for the newly hired engineers. This creates ties between data engineers and software engineers, which are very important here and prepares the developers for understanding the things above.
As already said, this initiative will require approval on the highest level, but the results will be rewarding. Putting slightly more thought into the software design from the beginning will save engineering time. You will avoid further iterations to fix historical tracking, and can collect the right data from the very beginning, which is very important for business.
Conclusion
In this article, we covered the most basic points about making a company data driven. This is, of course, not a complete list of problems the company might face during the data journey. Nevertheless, paying attention to them might already bring significant positive impact and be useful for both technical and non-technical stakeholders.
Originally this post was published here.
Published at DZone with permission of Alex Yeskov. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments