DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • The Transformative Power of Artificial Intelligence in Cloud Security
  • Simplify Your Compliance With Google Cloud Assured Workloads
  • Building a Modern Data Architecture: From Legacy Systems to Cloud Integration
  • Oxide Computer: Revolutionizing Cloud Computing for the Modern Data Center

Trending

  • Prioritizing Cloud Security Risks: A Developer's Guide to Tackling Security Debt
  • Strategies for Securing E-Commerce Applications
  • Data Lake vs. Warehouse vs. Lakehouse vs. Mart: Choosing the Right Architecture for Your Business
  • Yet Another GenAI Nightmare: Seven Shadow AI Pitfalls to Avoid
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Optimizing Airflow: A Case Study in Cloud Resource Efficiency

Optimizing Airflow: A Case Study in Cloud Resource Efficiency

A case study troubleshooting the running of an orchestration tool for a limited amount of time per day, resulting in optimization and cost efficiency.

By 
Aliaksandr Sheliutsin user avatar
Aliaksandr Sheliutsin
·
Dec. 20, 23 · Tutorial
Likes (32)
Comment
Save
Tweet
Share
10.8K Views

Join the DZone community and get the full member experience.

Join For Free

Throughout my career, I've worked with many companies that required an orchestration tool for a limited amount of time per day. For example, one of my first freelance clients needed to run an Airflow instance for only 2-3 hours per day, resulting in the instance being idle the rest of the time and wasting money.

Because it wasn't a large company, the client asked if I could intervene. The infrastructure was hosted on Google Cloud, which I was familiar with.

After a quick online search, I found the official manual that precisely addressed my needs. I estimated that the task would take about 20 hours. Here's the design diagram.

The image from the official manual

The image from the official manual

I have to stop here to explain why there were exactly 20 hours:

  1. I had to transfer the code from App Engine (I'm not sure why Airflow was initially deployed to App Engine). 
  2. Currently, I use the official Airflow Docker Compose for such cases, but in the past, I installed raw Airflow in LocalExecutor mode, and the database was running in the same instance (I know it's bad, but you can't blame me if you've never done shitcode).
  3. The dags themselves must be slightly refactored to accommodate the new schedule, and as you might expect, there was a lot of low-quality code that I had carefully reviewed.

To cut a long story short, I squeezed in 18 hours, and the result was as follows:

before and afterThis is my “before and after.” I am really proud of it.

The main disadvantage of the solution was that the pipeline execution could take much longer than three hours, which I was not aware of at the time. There were occasions when pipelines should take 5 hours or even 12 hours, so what should we do?

Pretty simple: if we look closely at the design, we can see that there is a job in Cloud Scheduler that sends a message to the PubSub topic, which triggers the Cloud Function, which stops the Airflow instance. So why can't we just turn it off and send the message to the topic via Airflow? It's simple, just a few lines of code using the PubSubPublishMessageOperator:

Python
 
check_that_still_latest >> PubSubPublishMessageOperator(

   task_id="send_pub_sub_message",

   project_id=conf.GCP_PROJECT_ID,

   topic=conf.TOPIC_TO_SHUTDOWN_AIRFLOW_INSTANCE,

   messages=[conf.AIRFLOW_SHUTDOWN_MESSAGE],

   gcp_conn_id=conf.GCP_CONN_ID,

   trigger_rule=TriggerRule.NONE_SKIPPED,

   execution_timeout=timedelta(minutes=5)

)


Have I mentioned setting the trigger_rule and the previous check_that_still_latest?

Yes, after a few issues with the pipeline, I realized two things:

  • Because the pipeline can take more than 24 hours at times, we should consider whether we should skip the instance termination.
  • Even if the previous tasks fail, we must still terminate the instance.

Because I am not supposed to check that on a regular basis, I used Google Cloud Monitoring for automated monitoring to avoid any unnecessary interactions. When an issue with the Airflow pipeline is detected, a message is sent to PubSub, allowing the GC Monitoring service to raise an alert and send my client and me an email with all relevant information. The client is aware that I will contribute hours to the timesheet and check the error, but I will not have to waste time manually monitoring the potential errors on a regular basis.

This solution has proven to be effective after more than a year of operation with no changes. During this time, I only had to restart one pipeline twice.

Cloud Google (verb) Cloud computing

Published at DZone with permission of Aliaksandr Sheliutsin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The Transformative Power of Artificial Intelligence in Cloud Security
  • Simplify Your Compliance With Google Cloud Assured Workloads
  • Building a Modern Data Architecture: From Legacy Systems to Cloud Integration
  • Oxide Computer: Revolutionizing Cloud Computing for the Modern Data Center

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!