DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • A Guide to Using Amazon Bedrock Prompts for LLM Integration
  • Boosting Efficiency: Implementing Natural Language Processing With AWS RDS Using CloudFormation
  • Setting Up CORS and Integration on AWS API Gateway Using CloudFormation
  • AWS CDK: Infrastructure as Abstract Data Types, Part 3

Trending

  • Integrating Model Context Protocol (MCP) With Microsoft Copilot Studio AI Agents
  • The End of “Good Enough Agile”
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  • Implementing API Design First in .NET for Efficient Development, Testing, and CI/CD
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. AWS Step Function for Modernization of Integration Involving High-Volume Transaction: A Case Study

AWS Step Function for Modernization of Integration Involving High-Volume Transaction: A Case Study

The serverless offerings of AWS are getting more and more popular. But it remains a challenge to know them well enough to leverage them properly.

By 
Satyaki Sensarma user avatar
Satyaki Sensarma
·
Oct. 09, 22 · Analysis
Likes (3)
Comment
Save
Tweet
Share
4.1K Views

Join the DZone community and get the full member experience.

Join For Free

The serverless offerings of AWS (or any other public cloud provider today) are getting more and more popular. One of the biggest reasons is of course not needing to think about infra (server provisioning and so on). The other reasons include the offerings themselves, which at times fit well into the technical need for finding solutions to the given problems. But on the other side, it remains a challenge to know the offerings in detail to leverage them properly. There could be always some new feature hidden deep inside the bulk of the documentation.

Background

A retail company decided to leverage the AWS Serverless offering for modernizing their on-premises integrations for order management business process. We can go serverless in the cloud in various ways, but the initial solutioning was decided to leverage the Step Function and Lambda Function along with some more integration offerings from AWS, namely SNS, SQS, and EventBridge. 

The coding was done, so the unit testing and system integration testing followed by the UAT. Everything looked fine. It went live and started handling the transactions quite well. And then one day, there was an unplanned AWS outage spanning almost a whole business day. What could happen next? Let’s see.

Challenge With the Initial Solution

The outage brought on a disastrous situation. While the integration was designed to get triggered by an EventBridge schedule, it was the Lambda Function which would pull the inbound payloads from the SQS which would in turn subscribe to the SNS topic where the order management system would publish the payloads when an order would be placed from the portal, due to the outage, the inbound payloads got piled up in the SQS. Not only that, but the messages which were ingested into the database got piled up unprocessed.

And then the real challenge began once the outage was restored. The flows started progressing, but the rate of the payloads getting processed were too low to cope with the business need. That low rate being allowed, the whole set of pending transactions would take months to complete, which was obviously not rationally acceptable. So, the need for the moment became to enhance the solution in a way that scales up when needed to reduce the backlogs.

Improved Solution

The current design had three Lambda Functions — the first one would pull the payloads from the SQS and insert it into the database, the second one would parse those and insert the values as records in another table in the database, the third one would process the records, i.e., invoke Oracle ERP APIs (the company is using Oracle ERP as matter of their choice) to create, update, or delete the purchase orders there. The Step Function would be triggered by the EventBridge rule. At first, the Lambda 1 would be executed inserting the pulled payloads into the table. The Lambda 2 would then iterate over the unparsed payloads in the database to parse each of the payloads and insert the respective records into the database. Lambda 3 would then iterate over the unprocessed records in the table and process those as required by invoking APIs and so on.

The catch is: The queries used to select the unparsed or unprocessed records from the database would be limited by the maximum number of records those could fetch based on how much time the respective Lambda Function would take to process those. This is required because Lambda Function in AWS can run only up to 15 minutes, beyond which it will time out and stop its execution abruptly. 

That’s fair enough from the AWS offering perspective, because Lambda Function is not intended to be used for long-running jobs. Also, it's fair to raise a question on the initial design decision to use Lambda Function for this integration. A better solution could be leveraging an EC2 instance or a containerized solution leveraging AWS Fargate, if staying serverless was one of the key requirements. But here, we would focus only on the capability of the Step Function to scale up to handle the given situation. The same feature of the Step Function can be leveraged in other scenarios that might appear when modernizing integrations, especially when there is a need for orchestration or workflow and when we prefer to go serverless.

AWS Step Function comes with some features that help achieving parallel execution. The State Machine of a Step Function is basically a collection of states and there are some defined state types which must be mentioned for each state. The state type that is commonly used to trigger a Lambda Function is "Task." As the name explains, the Lambda Function is supposed to perform some tasks. 

Out of the remaining state types, two which are relevant to this topic are "Map" and "Parallel." Well, the answer to this problem is Map State, but why am I mentioning the "Parallel?" There is a thin line of difference between them, but the names can really be misleading at times. It is worth explaining the difference in this context. 

To explain in simple words — a Parallel State is for performing two or more set of tasks in parallel but on the same input. A Map State is for performing a same set of tasks concurrently on each item of a given list of inputs. The below diagrams can be helpful to understand it better.

The above diagram may make it clear why Map State is the desired option in this case. To explain the improved solution more specifically in detail:

  1. The Lambda 2 and 3 need to be inside two Map States respective to each of them.
  2. A new Lambda 0 needs to be incorporated which would first prepare the input arrays of the Lambda 2 and 3. It would do that by querying the database tables to fetch the unparsed and unprocessed records.
  3. The Lambda 1 will not require to be in a Map State, as that would still pull the messages from the SQS and insert into the table.
  4. In every execution of the State Machine, the Lambda 0 would prepare the input arrays based on what was inserted into the tables by the Lambda 1 and the Lambda 2 in the previous execution of the State Machine which would respectively become the input for the Lambda 2 and the Lambda 3 in the current execution of the State Machine.
  5. The input array of Lambda 2 would be a list of the list of unique identifiers of the inserted payloads. The input array of Lambda 3 would be a list of the list of unique identifiers of the parsed records.
  6. Step Function requires the input array to be set at the field ItemsPath (this is specific to the Map State, i.e., “Type”: “Map”. It would be set from the Lambda Function code of the Lambda 0 and would be propagated through the Step Function.
  7. The value of the field MaxConcurrency would be set as 0 (the default value so also the same) which means there is no quota on parallelism and the State Machine would execute as much in parallel as possible. However, the architect/development team would decide the sizes of the inner and outer lists of the input arrays which would respectively mean the number of payloads/records that the Lambda Function would be allowed to process per instance of execution and the number of instances of the execution (in other words, the number of iterations) which in turn would be the limit of parallelism as well. The database would be queried by the Lambda 0 according to these configurable numbers (Lambda Function’s environment variable could be leveraged).

After this solution was implemented, the backlog took around 24 hours to get cleared. However, it was not an all-green situation. The backlog was cleared but the Operations Team observed further challenges over time which were even more deep-rooted to the fundamental design decisions of this solution.

Further Challenges

Like it happens in any standard retail order management business process, the high-level steps of the order journey remain, like Purchase Order (PO) Creation > Receipt of the PO > Invoice Generation of the PO > Return/Replacement Processing (if any). There were integrations for each of these modules which were modernized following the same design pattern. That said, the challenges observed over the time were as below.

  • Time Lag Issue – Many times, it was observed that the PO for which the integration tried to generate the invoice or receipt was not even created in the Oracle ERP at that point of time. It could be due to various reasons, but typically, the observation was depending on the transactional volume. Some PO creation and the respective receipt creation would get delayed, and by the time those were processed, the integration would get the invoice generation request. This is mainly because the source application responsible for the invoice generation is agnostic about the PO creation status.

A possible solution could be usage of SNS FIFO topic with SQS FIFO queue to achieve the fan-out pattern across all these integrations to ensure the sequence of events are processed in the same sequence as generated. But there could be many other ways of solving this issue (e.g., publishing the PO creation status to an SNS topic and have all the involved systems subscribed to that topic to sync it across those systems and so on) subject to the technical feasibilities and dependencies.

  • Intermittent Lack of Parallelism Issue – Despite using the Map State in the Step Function, it was observed that sometimes some iterations would not start until the previous ones are completed. This issue increases especially when the input array size is greater than 40. This is a known limitation as mentioned in the AWS documentation.

There could be two possible ways to resolve this.

  1. The concurrency limit could be restricted such that it does not exceed 40. However, that might again impact or limit the desired performance of the integration.
  2. In order to achieve higher concurrency, the State Machines could be nested that cascades the Map States. In this way, the Step Function code would become more complex but higher concurrency can be achieved.

Takeaway

Modernization of integrations using Lambda Functions alone may not always be a favorable option, especially when the integration has huge transformation logic and a high volume of transactional requirements. It may shoot up the cost too high in those cases, or sometimes fail miserably to scale up. However, some of the Step Function features (when used to orchestrate the tasks defined in Lambda Functions) may be helpful to come up with decent and flexible solutions in terms of scalability.

Sample Code of Step Function

The below is sample code with similar orchestration logic as described in the improved solution for this case study. This is not the exact solution, but just an indicative sample.

JSON
 
{
  "Comment": "A description of my state machine",
  "StartAt": "Process ID Concurrently",
  "States": {
    "Process ID Concurrently": {
      "Type": "Map",
      "ItemsPath": "$.id",
      "Iterator": {
        "StartAt": "Perform ID Task",
        "States": {
          "Perform ID Task": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:495125011821:function:ProcessUniqueID:$LATEST",
            "End": true
          }
        }
      },
      "ResultPath": null,
      "Next": "Process Name Concurrently"
    },
    "Process Name Concurrently": {
      "Type": "Map",
      "ItemsPath": "$.name",
      "Iterator": {
        "StartAt": "Perform Name Task",
        "States": {
          "Perform Name Task": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-east-1:495125011821:function:ProcessUniqueName:$LATEST",
            "End": true
          }
        }
      },
      "End": true
    }
  }
}

Sample Execution

The above sample Step Function, when executed with an input as seen in the screenshot below, identifies the iterations for each Map State and executes them concurrently.

The below timestamps show that each of the three iterations of the first Map State started exactly at the same time, ensuring concurrency.

The below timestamps show that each of the two iterations of the second Map State started exactly at the same time, ensuring concurrency. Also, it is noticeable that this state started after the first Map State was completed, i.e., two separate Map States are not concurrently executed.

The below screenshot shows the three iterations for the first Map State have taken each item in the outer list of the JSON input as their individual inputs.

The below screenshot shows the two iterations for the second Map State have taken each item in the outer list of the JSON input as their individual inputs.

The below screenshot shows the CouldWatch Log for the Lambda (ProcessUniqueID) which is invoked from the first Map State to perform the ID Task.

The below screenshot shows the CouldWatch Log for the Lambda (ProcessUniqueName) which is invoked from the second Map State to perform the Name Task.


AWS Integration

Opinions expressed by DZone contributors are their own.

Related

  • A Guide to Using Amazon Bedrock Prompts for LLM Integration
  • Boosting Efficiency: Implementing Natural Language Processing With AWS RDS Using CloudFormation
  • Setting Up CORS and Integration on AWS API Gateway Using CloudFormation
  • AWS CDK: Infrastructure as Abstract Data Types, Part 3

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!