Good Workflows? Think Software Design
Good Workflows? Think Software Design
For the most efficient and well-run workflows, start by thinking about the software design behind the workflow itself.
Join the DZone community and get the full member experience.Join For Free
Code is Code
Orchestrating data pipelines, training ML models, detecting fraud in AML flows, or applying preventive maintenance analytics to maximize oil well production are all uses cases in which workflow solutions play a significant role. As an industry, we generally focus on the code being run by workflow tools but rarely allocate the same level of attention to the design and maintenance of the workflows themselves. This blog puts forward the proposition that workflows are code, whether written in JSON, XML, Python, Perl, or Bash, and all code should be treated with the respect it deserves, especially the code that frequently is responsible for operationalizing and thus delivering the applications and services that power our digital economy.
Here are some practices submitted for your consideration along with capabilities that you may want to include in your tool selection.
What Does “Good” Look Like?
Well, It Should Be Code
I’ll begin with what should be universally agreed but is frequently not done: version control is mandatory. The corollary is that workflows are or can be written in some text notation that can be managed by version control tools.
The code in version control is the code that’s run everywhere. If different environments have different requirements, like names or filesystem references or credentials, etc., there should be a deployment process that can apply rules that modify the code automatically for the target environment.
Do not allow or tolerate any manual modification of the code you tested and committed to the SCM. If changes are required for bug fixes or any other reasons, check out the code, fix it, test it, commit it, and drive the release through an automated CI/CD pipeline.
Avoid monoliths. This applies to workflows just as it does to application code. Identify functional components or services. Use an “API-like” approach for workflow components to make it easy to connect, re-use and combine them, like this:
Service (Flow) A:
Do something1, emit “something1 done”
Emit “something2 running”, Do something 2, Emit “something2 done”
Emit “Service A” done
Wait for “Service A” done
Do B thing, emit BThing done
DO NOT run while something2 is running
Wait for BThing done
Don’t Reinvent the Wheel
If you have a common function, create a single workflow “class” that can be “instantiated” as frequently as required yet maintained only once.
Instead of creating multiple versions of a service, use variables or parameters that can accommodate the variety.
Rerun Versus Instantiate
Consider the differences between launching several instances of a workflow versus a single instance that executes over and over. For example, a set of tasks pulls new data from a location where it is dropped by a third party, the data is cleansed and then loaded into some persistent store.
If there is no particular order or relationship among the various datasets that are processed, multiple instances of the flow may be a good approach. If datasets must be processed in sequence, execution of each instance should be dependent on the previous instance. If the duration of the non-processing time is important (e.g. once a dataset is processed, we must wait x minutes before processing the next one) a single cycling instance that begins a fixed amount of time after the previous cycle ends may be the right solution.
Data lineage is frequently cited as a major requirement in complex flows to support problem analysis.
Process lineage is just as important and a mandatory requirement for effective data lineage. Without the ability to track the sequence of processing that brought a flow to a specific point, it is very difficult to analyze problems.
The need for process lineage arises quickly when a problem occurs in a pub/sub or “launch and forget” approach used in triggering workflows.
Make the Work Visible
Process relationships should be visible.
One scenario where such visualization is particularly valuable is when everything appears perfectly normal but nothing is running. Having a clear line of sight between a watcher or sensor that is waiting for an event and the downstream process that wasn’t triggered because the event did not occur, can be extremely valuable.
The best way to define a non-event as an error is by defining an “expectation,” commonly called a service level.
At its most basic, an unmet SLA is identified as an error. For example, we expect a file to arrive between 4:00 PM and 6:00 PM. It takes approximately 15 minutes to cleanse and enrich the file and another 30 minutes to process that file so we can set the SLA to be 6:45 PM. If the processing hasn’t completed by then, whether the processing is running late or hasn’t even started yet, the error can be recognized at 6:45 PM if the flow hasn’t completed.
A more sophisticated approach is to use trending data to predict an SLA error as early as possible. We know the cleanse step runs approximately 15 minutes because we collect the actual execution time for the last n occurrences. The same for the processing step. If the cleanse step hasn’t finished by 6:15, we know we will be late. Similarly, if the processing step hasn’t started by 6:15, and so on. We can generate alerts and notifications as soon as we know, so that we have the maximum time to react and possibly rectify the problem.
A final enhancement is providing “slack time” to inform humans how much time remains for course correction. In the above scenario, if the cleanse step doesn’t start on time, at 6:00 PM, there are 45 minutes available to fix the problem before the SLA is breached.
Separate Instance Data From Config Data
Workflows frequently require application-level configuration data and task-specific code. For example, running a database query requires credentials or connection information together with the specific SQL that performs the required task. These should be separated so that the credentials/connection “object” can be re-used by any query tasks without requiring duplication.
Not only is this a more elegant approach, but in the case of sensitive credentials, may be mandatory.
As you are designing your workflow “microservices” and connecting tasks into process flows, make sure you tag objects with meaningful values that will help you identify relationships, ownerships, and other attributes that are important to your organization.
Imagine creating an API for credit card authorization and calling it “Validate.” If your response is, "Sounds good,” go read some other blog. I’m hoping most will think the name should be more like “CreditCardValidation” or something similarly meaningful.
This point is simply to think about the workflows you create in a similar way. It may be great to call a workflow “MyDataPipeLine” when you are experimenting on your own machine but that gets pretty confusing even for yourself, never mind the dozens or hundreds of other folks, once you start running in a multi-user environment.
Think of Others
You may be in the relatively unique position of being the only person running your workflow. More likely, that won’t be the case.
But even if it is, you may have a bunch of workflows and you don’t want to have to re-learn every time you need to analyze a problem or when you modify or enhance it.
Include comments or descriptions or if it’s really complicated, some documentation. And remember to rev that together with the workflow.
Inquiring minds want to know…everything. Who built the workflow, who ran it, if it was killed or paused, who did it and why? Did it always run successfully or did it fail? If so, when and why? How was it fixed?
And so on, and so on.
Basically, when it comes to workflows for important applications, you can never have too much information, so make sure your tool collects what you need.
Prepare for the Worst
You know tasks will fail. Make sure you collect the data that will be needed to fix the problem and that you keep it around for a while. That way, not only can you meet the “Keep Track” requirement but when problems occur, you can compare this failure to past failures or to previous successes to help determine the problem.
Finally, look for flexibility in determining what is success and what is failure. It’s correct and proper to expect good code but we have all seen code that issues catastrophic error messages but the task completes with an exit code of zero.
You should be able to define what is an error and what is not and accordingly you should be able to define recovery actions for each specific situation.
Do you have some practices or requirements you would add or remove? Is there another point of view you would like to put forward?
Workflow orchestration, in one way or another, is almost universally used but rarely discussed. It would be great to add lots more voices to this discussion.
Opinions expressed by DZone contributors are their own.