Developing a Nationwide Real-Time Telemetry Analytics Platform Using Google Cloud Platform and Apache Airflow
Designed a real-time telemetry analytics platform using GCP and Airflow to process 10TB+ daily data, reduce support escalations, and improve operational visibility.
Join the DZone community and get the full member experience.
Join For FreeIn my tenure at TELUS, I was assigned a prominent project requiring substantial technical expertise: the development of a telemetry analytics platform that could analyze data in real-time from over 100,000 set-top boxes (STBs) deployed throughout Canada. The objective was not just about scale; it aimed to assist teams to make quicker operational decisions and enhance the experience for millions of customers. Initially, I recognized the outdated data infrastructure as a bottleneck, obstructing the data from reaching the teams who required it the most. This article portrays the methodologies we employed to modernize our infrastructure using Google Cloud Platform (GCP), Apache Airflow, and Infrastructure-as-Code tools to surmount the obstacles and deliver a future-proof solution.
The Predicament: Ancient Bottlenecks and Unseen Black Spots
Prior to this revamp, we predominantly relied on segregated and batch-oriented data pipelines incapable of supporting real-time diagnostics. Key concerns encompassed:
- High delays in pipeline execution
- Limited insights into device-level health and performance
- Inadequate automated infrastructure management and drifting configuration
- Inability to correlate STB telemetry with customer issues in a timely manner
A vivid incident that still lingers in my memory is a major regional outage where we were unable to identify the affected devices in real-time—all analysis had to be done postmortem using dated batch data. This incident highlighted the urgency for a more responsive, dependable, and predictive data platform.
The Remedy: A Cloud-Based Streaming Telemetry Platform
We aimed to design a comprehensive telemetry analytics solution anchored on scalability, dependability, and automation. Our aspirations were to eliminate all manual interventions, accelerate decision-making, and provide a central repository for telemetry data. Here’s how we achieved it:
- Google Cloud Platform (GCP)
- BigQuery functioned as the central analytics engine, effectively handling both structured and semi-structured telemetry data.
- Cloud Storage facilitated the staging of raw ingestion data from STBs.
- Cloud Functions aided in triggering lightweight transformation jobs and alerting logic.
- Apache Airflow for Task Management
- Modular DAGs enabled us to exert precise control over each stage of the ETL life cycle.
- We capitalized on PythonVirtualenvOperator to manage dependencies meticulously, especially when engaging with external libraries like Paramiko.
- We utilized time-based scheduling and event-based triggers as per the sensitivity and nature of the pipeline.
- Infrastructure-as-Code
- With Pulumi and Terraform, we automated infrastructure provisioning across the development, testing, and production environments.
- We kept all configurations YAML-driven and version-controlled, leading to enhanced onboarding and rollback capabilities.
- Compression and Export Layers
- For downstream ingestion (including Adobe Experience Platform), we exported telemetry in .csv.gz format for optimal compression.
- The platform securely transferred files via SFTP using a combination of Paramiko scripts and the GCS client.

Figure 1: End-to-End Architecture of TELUS Real-Time Telemetry Analytics Platform
This diagram illustrates how telemetry data flows from nationwide set-top boxes (STBs) into Google Cloud Platform, where modular Airflow DAGs orchestrate data ingestion, processing, and export. BigQuery serves as the analytics engine, while Looker dashboards and secure SFTP exports deliver insights to internal and external stakeholders.
Operational Architecture Synopsis
- Telemetry data flowed from the STBs to GCP Cloud Storage predominantly in real-time, from where Airflow DAGs regulated structured pipelines for validation, transformation, enhancement, and export. For example, we utilized BigQuery streaming inserts to immediately analyze the most recent metrics. Additionally, we implemented data quality rules directly within DAG tasks to issue alerts for schema incongruities.
- We developed a centralized Looker dashboard powered by BigQuery views that granted operations teams to search telemetry based on serial number, firmware version, or regional cluster, granting them self-service access to previously siloed data.
Results: Quantifiable Operational Advantages
The platform’s impact was immediate and quantifiable. Within the first month:
- Customer support escalations pertaining to STB issues diminished by approximately 25%, owing to proactive diagnostic alerts.
- We secured 98%+ precision in telemetry transformation through stringent schema validations and reconciliation checks.
- Real-time insights were displayed on our dashboards with under 5-minute latency from the source ingestion to visualization.
- The system supported daily processing of over 10 TB of telemetry data across 100,000+ STBs.
- Most importantly, for the first time, operations teams could observe live device health and anticipate outages before they evolved into customer complaints.
One of the most gratifying experiences was witnessing the reaction of our network reliability team when we showcased the first live dashboard. “We’ve never possessed this kind of visibility before,” one of them informed me. That moment validated the countless late nights troubleshooting DAG failures and refactoring Terraform scripts.
Key Insights
- Cloud-based architectures significantly boost scalability, observability and data accessibility.
- Modular Airflow DAGs allowed us to refine and optimize without disrupting upstream or downstream flows.
- Employing Pulumi and Terraform provided us complete control over deployments and eliminated configuration drift.
- Utilizing BigQuery enabled us to consolidate both streaming and batch use cases with minimal operational overhead.
- Empowering end-users with tools such as Looker dashboards remarkably enhances incident response and the democratization of data within the organization.
Conclusion
This platform has drastically altered the way we manage device health and customer-impacting issues. What commenced as a technical transformation evolved into a business catalyst. Our telemetry analytics architecture has now become a repeatable pattern across teams, influencing the scaling of data pipelines in other domains. Building it wasn’t merely a data engineering milestone; it was a lesson in the union of people, pipelines, and platforms to drive innovation.
Opinions expressed by DZone contributors are their own.
Comments