Intro to Data Integration Patterns – Aggregation
Intro to Data Integration Patterns – Aggregation
Join the DZone community and get the full member experience.Join For Free
SnapLogic is the leading self-service enterprise-grade integration platform. Download the 2018 GartnerMagic Quadrant for Enterprise iPaaS or play around on the platform, risk free, for 30 days.
[This article was originally written by Darko Vukovic.]
In this post I want to close the loop on introducing you to the last of the five initial patterns that we are basing our Anypoint Templates on. I am sure are we continue creating templates we are going to continue discovering new data integration patterns. If you are just entering at this post, I would recommend that you look through the previous four posts to understand the other patterns. I generally do not repeat things that overlap between a patterns, so I would encourage you to work your way through all five posts if interested.
Pattern 5: Aggregation
What is it?
Aggregation is the act of taking or receiving data from multiple systems and inserting into one. For example, lets say I have my customer data in three different systems, and I wanted to generate a report which uses data from all three of those systems. I could create a daily migration from each of those systems to a data repository and then query against that database. But then I would have a whole other database to worry about and keep synchronized. As things change in the three other systems, I would have to constantly make sure that I am keeping the data repository up to date. Another downside is that the data would be a day old, so if I wanted to see what was going on today, I would have to either initiate the migrations manually or wait. If I set up three broadcast applications, I could achieve a situation where the reporting database is always up to date with the most recent changes in each of the systems. Still, I would need to maintain this database whose only purpose is to store replicated data so that I can query it every so often. Not to mention the number of wasted API calls to ensure that the database is always up to x minutes from reality. This is where the aggregation pattern is really handy. If you build an application, or use one of our templates that is built on it, you will notice that you can just on demand query multiple systems merge the data set, and do as you please with it. So in our example above, you can build an integration app which queries the various systems, merges the data and then produces a report. This way you avoid having a separate database, you can have the report arrive in a format like .csv, or format of your choice. Similarly, if there is a system where you store reports, you can place the report there directly.
Why is it valuable?
The aggregation pattern derives its value from allowing you to extract and process data from multiple systems in one application. This means that the data is up to date at the time that you need it, does not get replicated, and can be processed/merged to produce the dataset you want. You can see the aggregation pattern in action with our Anypoint Templates where we use it to generate and email a CSV report of content in two different Salesforce instances.
When is it useful?
The aggregation pattern is valuable if you are creating orchestration APIs to “modernize” legacy systems, especially when you are creating an API which gets data from multiple systems, and then processes it into one response. The other great use case is for creating reports or dashboards which similarly have to pull data from multiple systems and create an experience with that data. Lastly, you may have systems that you use for compliance or auditing purposes which need to have related data from multiple systems. The aggregation pattern can be very helpful here in making it so that your compliance data lives in one system but can be the amalgamation of relevant data from multiple systems. This way you can reduce the amount of learning that needs to take place across the various systems to ensure you have visibility into what is going on.
What are the key things to keep in mind when building applications using the correlation pattern?
Given that the correlation pattern is almost the same as the bi-directional sync, the same considerations apply, so I would again recommend that post to understand it better. The main difference is in the impact of how you define the Same.
- Collecting data:
There are two ways to approach collecting data in an aggregation pattern. You can either create a listener type system which waits for messages from multiple systems and aggregates them in real time. This could be a file, a bulk set of data, or streaming individual objects. This form of application always is available and listens for events. Or, you can create an application that is triggered, like our migration pattern which is initiated, does what it does, and gives you a results. In general I find it easier to think about applications as multiple broadcasts that potentially feed into the same destination system, rather than one application which is responsible for collecting data from multiple sources in real time. In other words, its easier to think about design in terms of “I have this data, which other systems need to have it” rather than “I need this data, which systems can I get it from”. The main reason for this is that when you stand up a new system, its easier to go to the systems/queues that have that data and add this new system to the publish rather than building a new application for collecting data. If you make changes to the data at the source system, you would only have one place to adjust what happens to that data instead of having to first know which applications get notifications from that system and then having to update each of them. So, in general we reserve the aggregation pattern for collection of data on demand or for orchestration of data rather than a synchronization of data. One last thing to disambiguate is the difference between a migration from many to one and aggregation. Although you can create a migration that collects data from many systems and then inserts them into one system, we generally think of a migration as the movement of existing data to the destination system where the data will reside. You can think of a migration as a solid connection which is meant to move the data, transforming the object to a new model while preserving the value of the object. While, aggregation is meant to correlate, merge, and potentially completely transform the object and in general not storing the new data set into a system where it will continue to be used while deprecating the origin. there is a blurry line on which pattern you should use when you have two existing systems that do the same thing and you want to move the data from both of those to a third system that will be the new system that does that thing. You can either run two migrations or you can use one aggregation template. I would recommend using the migration pattern because it was designed to handle much larger datasets since it processes sets of records at a time. The aggregation pattern would try and pull the whole set at once which could be a problem for extremely large data sets. For systems with thousands of records you would be fine using either.
- Source data scope:
Since we use different connectors for each system we are pulling data from, or different instantiations of the same systems (in the case of our Salesforce to Salesforce templates), you have the ability to write different queries for each of those systems. This means that you can do things like get accounts that are older than 2 years in one system and account that are older that 1 year in the other.
- Merging the multiple datasets:
Once you have queried across the systems, you will most likely have stored the results in the local store for the duration of the app running. At this point, you need to choose how to merge the datasets or what to do with it. In our aggregation templates we store the results as java objects at which point we use custom logic to merge them. This is one of those things where each problem’s needs on how to merge the data will be different. This is why we use a custom Java class in our templates to leave room for the developer that extends or modifies the templates to build in their own logic. For example, our logic takes the two sets, and generates a list where we first show the objects that are in Salesforce A with some fields, then the ones in Salesforce B, and lastly the objects that are in both. You can imagine how a report like this is valuable to someone who is checking for data consistency, or wants to see what is being managed where and where the intersection lays. Similarly, you can imagine how someone may have completely different needs where they may want a report which only shows the objects that are in both, or merges the data and takes specific fields form one and none from the others. Or maybe just looks for fields that are missing between the objects and then kicks off a flow to populate the missing data. Or looks for contact information and generates a report where the email value for the contact is different in the two systems. Since there was no prescriptive solution here, we left it as a custom class to give you the flexibility to doing what you need with the aggregated data.
- Formatting the data:
Once you perform the merge logic on the aggregated data, this is where you should then convert the format of from the generic java object to the format that you want. In our aggregation templates we convert from the java object which is the result of the merge to a .csv format. We do a .csv format because in our template we email the .csv file in an email so its generic enough to serve as a good illustration of what is possible. If you were using this pattern in an API, you may want to convert to a JSON or XML format, or if you are inserting it into a database you may want to keep in the java objects, but transform it further.
- Insert scope:
This only applies if you are inserting the data into another system rather than sending it out or saving it as a file. If you are inserting the data into another system, you still have control over the insert statement where you can additionally modify how the data will be stored. For example, if you pulled data out of your ERP and Issue Tracking systems, and are inserting that data into a CRM, you may want to create a new object where the merged data will be instantiated, or you may choose to extend an existing object and have the new fields be stored.
- Additional destinations:
Lastly, our aggregation templates to date have all been built to just send an email of the report that is generated. You may decide that you want it to update a database, update salesforce and email a report. To do this, you can expand our template by adding additional outbound flows to be called from our mainFlow. Because the data would be processed and formatted the same way you would just need to use it in three different ways and handle how you map and insert the data into the three different systems.
To summarize, the aggregation pattern if the fifth pattern that we identified as the initial set of templates that we have provided. Its very useful when building APIs that collect and return data from multiple systems, or when you want to report across multiple systems. It has custom logic that can be modified to merge and format the data as you would like, and can be easily extended to insert the data into multiple systems.
I hope that this series was valuable in understanding the base patterns that our applications are built from. Again, if you have any questions, comments, or would like to discuss these further please leave comments below!
Published at DZone with permission of Ross Mason , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.