Sharpen Your Data Science Toolkit With CI/CD
Maximizing efficiency is about knowing how the data science puzzles fit together and then executing them.
Join the DZone community and get the full member experience.Join For Free
Today there is a mass of new software packages and repositories arriving on the scene that has made the data science process more interactive, nuanced, and user-driven than ever before. For evidence of this, just check the Towards Data Science homepage on any given day. In the face of this new wave of choices, it is important to understand the basic structure of development pipelines.
Data Scientists have become newly minted developers in their own right. In becoming developers, it is useful to understand development principles that software engineers use to iteratively test, construct and shape their deployed code.
In this article, we’ll talk about some often-misunderstood development principles that will guide you to developing more resilient, production-ready development pipelines using CI/CD tools. Then, we’ll make it concrete with a tutorial about how to set up your own pipeline using Buddy.
The Development Pipeline
Understanding the components to development is the first step to understanding which pieces go where, and how things fit together. Each of these elements serves as crucial building blocks towards the coveted end-to-end pipeline. In this case, “end-to-end” is jargon for “you make a code-level change, and the end-user experiences the effect”.
In a nutshell, you as the data scientist would use the development pipeline to push changes from your local machine to a version control tool, and have these changes be reflected in the cloud deployment service for your end users.
Next, we’ll break down an example development pipeline step by step.
- Dashboard: Let’s say you have created an awesome new Data Science dashboard that uses data to answer a burning question within your organization. In this example, we’ll use a sample project from Streamlit, that visualizes Uber rides in New York City. In this case, this is a project folder that lives on your local machine that contains images, python scripts, datasets, and more.
- Version Control: The next step is to use a version control tool like Github to host your project. This helps you and your team track code changes and edits over time, write documentation and enable collaboration with the wider community. Many data scientists are already intimately familiar with GitHub as a place to host and showcase their projects.
- Cloud Deployment: Finally, you use a Cloud Server Service to deploy your project. In this example, we’ll use AWS to deploy the original dashboard visualization and share it with your end-users.
At this point, many data scientists stop and call it a good day at the office.
However, we’re still missing a crucial piece of the puzzle. Uploading code to GitHub and setting up an AWS deployment is great, but if there are changes or upgrades to that codebase, the AWS deployment will not automatically reflect them. Instead, each version for deployment will have to be manually updated. While this is suboptimal in terms of effort and time, there is also a possibility that your flashy new update might break the basic functionality of your original dashboard. This mistake is compounded when working with a team of data scientists to create a product.
CI/CD With Buddy
To patch this missing puzzle piece, we introduce the concept of Continuous Integration / Continuous Deployment — abbreviated as CI/CD. This tool bridges the gap between development and operation activities through automation. It helps you test and integrates your new changes into existing the body of work. Buddy Works is an excellent option for this tool when setting up your deployment pipeline.
You might be wondering — how is this going to stop my development pipeline from breaking? Let’s explore the value of using Buddy. This CI/CD tool is actually a process, that involves adding testing, automation, and delivery benchmarks to connect your GitHub repository to the cloud configuration.
Let’s examine each element in turn:
- Rigorous Testing: After installing the relevant environment, you can run unit-testing through a full suite of tools, including Cypress, Split Testing, and tests written natively in Python, PHP, Django, and more. Each time you run the pipeline, all your tests will be re-run to ensure production-quality code is being deployed.
- Automated Deployment: Once your GitHub repository is connected, the pipeline will run according to your setup instructions. Just by pushing to the master-branch or using a recurrent trigger, the pipeline will automatically start, run tests and update your cloud instance if there are no errors.
- : The CI/CD workflow allows you to connect notification settings across a wide array of options, including SMS/email, Slack — as well as mainstream performance and app monitoring tools like Datadog and Sentry. If there are any errors in integrating changes, you can be notified and go straight in to fix the issue.
Now that we have established the premise of CI/CD and its uses, let’s dive right into a first look of Buddy’s platform and how you can get a basic pipeline off the ground.
Tutorial: The CI/CD Pipeline
1. Setting Up the Pipeline
First off, head over to Buddy and make an account. It is recommended to use your version control login here, as it will save you the step of connecting your repositories. Either way, connecting your version control software will allow Buddy to connect to any of those repositories.
We’ll use the demo-uber-nyc-pickups repository for the purposes of this tutorial, which is an interactive dashboard built with Streamlit. After forking the repository on Github, it will show up in our repo list within Buddy. Clicking on the name will lead us to the next screen.
Here, Buddy has already detected that the repository’s contents contain a Python app and shows us more options for setting up the relevant Python environment. At this step, we also have to select how the pipeline should trigger.
2. Adding Actions to the Pipeline
After naming the pipeline, we can choose what action will trigger the pipeline. Since we care about deploying new changes to the AWS instance, we can set it to run the pipeline every time a new push is made to the master branch. Alternatively, we can set it to only trigger manually, or even on a timed basis (e.g. every day at 5pm, every Friday, etc).
As mentioned, Buddy has detected that our app is written in Python, so we’ll click on that icon first. Here’s where we can configure the environment, choose the relevant Python version (in this case, it’s
3.7). A quick look in the
README.md of the project tells us the BASH lines needed to get the app up and running:
The first line ensures that we are running the latest version of streamlit, and the
requirements.txt contains the remaining dependencies we need to be able to run our app.
At the bottom, we can also notice the Exit Code Handling section — this allows for a way of helpfully identifying behavior in case of errors at any step in the pipeline. We can either solider on (not recommended for obvious reasons), or stop the pipeline where it broke and send a notification that something went wrong, or try running different commands. Identifying where something has broken is perhaps the most frustrating part of fixing a broken process. Proactively setting error-handling behavior and notifications as a priority will help keep frustrations at a minimum going forward, when some element inevitably breaks.
3. Building the Environment
Next, we’ll try running the very basic pipeline so far, and see whether it works. Click “Run Pipeline” in the top left, and sit back as Buddy integrates the latest commit on the master-branch, prepares the environment, and executes the BASH setup. On subsequent runs, the cache will continue to only update changes, and this process will run faster over time.
Awesome! The build is complete and without errors. If you are following along with the tutorial to this stage and faced errors, check that the Python version is exactly
3.7, because that is required for this particular app’s dependencies.
4. Adding Testing Functionality
Testing is a central element in software development, but it is unfortunately not prioritized or taught in most data science curriculums.
“Unit tests give you the confidence that your code does what you think it does”
Adding unit tests can be simple as adding python files in the same repository. In order to run these tests, we’ll return to Step 3: Building the environment, and add in our new line to run the tests here.
When the tests have been implemented, this is where we would expect to see the results. In this case, error handling setup becomes particularly important, as Buddy can share notifications if some tests fail.
5. Adding Notification Functionality
Adding in notifications is critical to ensuring we know where breaks in the pipeline occur, or which tests have failed. From the pipeline overview, click on the “Actions Run On Failure” section, where we can decide what actions will run if there is an error anywhere in the pipeline. For our purposes, it will be sufficient to set this up using environmental variables that will indicate which execution or test broke the pipeline execution.
$BUDDY_PIPELINE_NAMEgives us the name of the pipeline that is broken
$BUDDY_EXECUTION_IDgives us the unique identifier of the instance of the pipeline that created an error, including the
$BUDDY_FAILED_ACTION_LOGS will give an extensive overview of the logs of what went wrong, which is convenient because it helps in diagnosing any issues that pop up. It may even help solve the issue by just glancing in the email, fixing the code, and making a new commit to patch the issue — without even needing to visit the CI/CD tool at all.
6. Add More Actions + Iterate
This is the last step we will take in setting up the pipeline. By connecting this pipeline to a free-tier AWS EC2 machine, we will arrive at an end-to-end pipeline, as per the overall goal.
In order to do this, select the
SFTP action, and make the connection between Buddy and the public IPv4 address of the EC2 machine.
Here, I’ve entered my
Hostname & Port, and
Login information, as well as used my Private SSH key to actually give Buddy access to the EC2 machine.
There are two caveats to mention here:
- The EC2 instance must be running at the time you run your pipeline, to ensure that the Host is live, and can be found. This will not happen if the EC2 instance is stopped or interrupted.
- It is useful to have
tmuxrunning streamlit on your EC2 configuration, since the Buddy pipeline will update the files, and allow the visualization to reflect those changes without any manual input.
Having completed this final step, we can run the pipeline by simply making a change at the code-level, and then committing the change to the master branch.
Hooray! We correctly set up the environment, sent across changes from the local machine to GitHub. The code was then executed, ran unit tests, and uploaded to the EC2 machine, where the changes were reflected in our visualization.
Let’s take a look at the final product:
This is the front-end visualization, powered by Streamlit. To review, we’ve taken python code and committed it to a versioning tool (in this case, Github). This repo is then linked to a CI/CD tool (Buddy), which syncs, tests, and integrates our commits to the overall build, hosted on an AWS EC2 machine.
In conclusion, every time we make a new commit to Github, this will trigger a Buddy pipeline execution which will:
- Build the environment
- Run any tests
- Upload our changes to AWS using SFTP.
In the event of any errors or snags along with the execution, we’ll receive an email highlighting exactly what went wrong. With this level of detail and refinement, Buddy’s CI/CD tool has elevated our deployment of data science platforms and made it easier than ever to maintain user-driven products.
About the Author
Other pieces of data science creativity:
It’s worth noting that the specifics of the EC2 setup are not in the purview of this article, but some helpful advice and content on setting up Streamlit applications on AWS is available here.
Published at DZone with permission of Saif Bhatti. See the original article here.
Opinions expressed by DZone contributors are their own.