DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Your API Authentication Isn’t Broken; It’s Quietly Failing in These 6 Ways
  • Context-Aware Authorization for AI Agents
  • Leveraging AI-Based Authentication Factors in Modern Identity and Access Management Solutions
  • The Architect's Guide to Logging

Trending

  • AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work
  • Monitoring Spring Boot Applications with Prometheus and Grafana
  • MuleSoft MCP and A2A in Production: What 17 Recipes Reveal
  • Real-Time AI Inference at Scale Using Cloud Run, GPUs, and Vertex AI
  1. DZone
  2. Software Design and Architecture
  3. Security
  4. A Framework for Securing Open-Source Observability at the Edge

A Framework for Securing Open-Source Observability at the Edge

Build secure observability solutions for distributed edge environments using open-source telemetry. Achieve zero security incidents and full compliance.

By 
Prakash Velusamy user avatar
Prakash Velusamy
·
Oct. 31, 25 · Analysis
Likes (1)
Comment
Save
Tweet
Share
3.3K Views

Join the DZone community and get the full member experience.

Join For Free

The Edge Observability Security Challenge 

Deploying an open-source observability solution to distributed retail edge locations creates a fundamental security challenge. With thousands of locations processing sensitive data like payments and customers' personally identifiable information (PII), every telemetry component running on the edge becomes a potential entry point for attackers. Edge environments operate in spaces where there is limited physical security, bandwidth constraints shared with business-critical application traffic, and no technical staff on-site for incident response. 

Therefore, traditional centralized monitoring security models do not fit in these conditions because they require abundant resources, dedicated security teams, and controlled physical environments. None of them exists on the edge. 

This article explores how to secure an OpenTelemetry (OTel) based observability framework from the Cloud Native Computing Foundation (CNCF). It covers metrics, distributed tracing, and logging through Fluent Bit and Fluentd.  

Securing OTel Metrics

Mutual Transport Layer Security (TLS) 

Security of metrics is enabled through mutual TLS (mTLS) authentication, where both client and server end need to prove their identity using certificates before communication can be established. This ensures trusted communication between the systems. Unlike traditional Prometheus deployments that expose unauthenticated HTTP stands for Hypertext Transfer Protocol (HTTP) endpoints for every service, OTel's push model allows us to require mTLS for all connections to the collector (see Figure 1).

OpenTelemetry security architecture Figure 1: Multi-stage security through PII removal, mTLS communication, and 95% volume reduction 


Security configuration, otel-config.yaml 

YAML
 
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: mysite.local:55690
        tls:
          cert_file: server.crt
          key_file: server.key
  otlp/mtls:
    protocols:
      grpc:
        endpoint: mysite.local:55690
        tls:
          client_ca_file: client.pem
          cert_file: server.crt
          key_file: server.key 

exporters:
  otlp:
    endpoint: myserver.local:55690
    tls:
      ca_file: ca.crt
      cert_file: client.crt
      key_file: client-tss2.key 


Multi-Stage PII Removal for Metrics 

Metrics often end up capturing sensitive data by accident through labels and attributes. A customer identity (ID) in a label, or a credit card number in a database query attribute, can turn compliant metrics into a regulatory violation. The implementation of multi-stage PII removal fixes this problem in depth at the data level. 

Stage 1: Application-level filtering.

The first stage happens at the application level, where developers use OTel Software Development Kit (SDK) instrumentation that hashes out user identifiers with the SHA-256 algorithm before creating metrics. Uniform Resource Locators (URLs) are scanned to remove query parameters like tokens and session IDs before they become span attributes.  

Stage 2: Collector-level processing.

The second stage occurs in the OTel Collector's attribute processor. It implements three patterns: complete deletion for high-risk PII, one-way hashing for identifiers using SHA-256 with a cryptographic salt and use regex to clean up complex data.  

Stage 3: Backend-level scanning.

The third stage provides backend-level scanning where centralized systems perform data loss prevention (DLP) scanning to detect any PII that reached storage, triggering alerts for immediate remediation. When the backend scanner detects PII, it generates an alert indicating the edge filters need updating, creating a feedback loop that continuously improves protection.  

Aggressive Metric Filtering 

Security is not only about encryption and authentication, but also about removing unnecessary data. Transmitting less data reduces the attack surface, minimizes exposure windows, and makes anomaly detection easier. There may be hundreds of metrics available out of the box, but filtering and forwarding only the needed metrics reduces up to 95% of metric volume. It saves resources, network bandwidth utilization, and management bottlenecks.  

Resource Limits as Security Controls 

The OTel Collector sets strict resource limits that prevent denial-of-service attacks. 

resource Limit Protection against

Memory 

500MB hard cap 

Out-of-memory attacks 

Rate limiting 

1,000 spans/sec/service 

Telemetry flooding attacks 

Connections 

100 concurrent streams 

Connection exhaustion 


These limits ensure that even when an attack happens, the collector maintains stable operation and continues to collect required telemetry from applications. 

Distributed Tracing Security 

Trace Context Propagation Without PII 

Security for distributed traces can be enabled through the W3C Trace Context standard, which provides secure propagation without exposing sensitive data. The traceparent header can contain only a trace ID and span ID. No business data, user identifiers, or secrets are allowed (see Figure 1).  

Critical Rule Often Violated 

Never put PII in baggage. Baggage is transmitted in HTTP headers across every service hop, creating multiple exposure opportunities through network monitoring, log files, and services that accidentally log baggage. 

Span Attribute Cleaning at Source 

Span attributes must be cleaned before span creation because they are immutable once created. Common mistakes that expose PII include capturing full URLs with authentication tokens in query parameters, adding database queries containing customer names or account numbers, capturing HTTP headers with cookies or authorization tokens, and logging error messages with sensitive data that users submitted. Implementing filter logic at the application level removes or hashes sensitive data before spans are created.  

Security-Aware Sampling Strategy 

Reduction of 90% normal operation traces is supported by the General Data Protection Regulation (GDPR) principle of data minimization while maintaining 100% visibility for security-relevant events.  

The following sampling approach serves both performance and security by intelligently deciding which traces to keep based on their value. 

trace type sample rate rationale

Error spans 

100% 

Potential security incidents require full investigation 

High-value transactions 

100% 

Fraud detection and compliance requirements 

Authentication/authorization 

100% 

Security-critical paths need complete visibility 

Normal operations 

10-20% 

Maintains statistical validity while minimizing data collection 


Logging Security With Fluent Bit and Fluentd 

Real-Time PII Masking 

Application logs are the highest risk involved data, which contain unstructured text that may include anything developers print. Real-time masking of PII data before logs leave the pod represents the most critical security control in the entire observability stack. The scanning and masking happen in microseconds, adding minimal overhead to log processing. If developers accidentally log sensitive data, it's caught before network transmission (see Figure 2).

Fluentbit and Fluentd security architectureFigure 2: Logging security enabled through two-stage DLP, Real-Time Masking in microseconds, TLS 1.2+ End-to-End, Rate Limiting, and Zero Log-Based PII Leaks 


Security configuration, fluent-bit.conf 

YAML
 
pipeline:
  inputs:
    - name: http
      port: 9999
      tls: on
      tls.verify: off
      tls.cert_file: self_signed.crt
      tls.key_file: self_signed.key 

  outputs:
    - name: forward
      match: '*'
      host: x.x.x.x
      port: 24224
      tls: on
      tls.verify: off
      tls.ca_file: '/etc/certs/fluent.crt'
      tls.vhost: 'fluent.example.com'  

Fluentd.conf  

<transport tls>
    cert_path /root/cert.crt
    private_key_path /root/cert.key
    client_cert_auth true
    ca_cert_path /root/ca.crt
  </transport>  


Secondary DLP Layer 

Fluentd provides secondary DLP scanning with different regex patterns designed to catch what Fluent Bit missed. This includes private keys, new PII patterns, sensitive data, and context-based detection.  

Encryption and Authentication for Log Transit 

Transmission of logs is secured through TLS 1.2 or higher encryption method using mutual authentication. In this communication method, Fluent Bit authenticates to Fluentd using certificates, and Fluentd authenticates to Splunk using tokens. This approach prevents network attacks that could capture logs in transit, man-in-the-middle attacks that could modify logs, and unauthorized log injection.  

Rate Limiting as Attack Prevention 

Preventing log flooding avoids both performance and security issues. An attacker generating massive volume of logs can hide malicious activity in noise, consume all disk space causing denial of service, overwhelm centralized log systems, or increase cloud costs until logging is disabled to save money. Rate limiting at 10,000 logs per minute per namespace prevents these attacks.  

Security Comparison: Three Telemetry Types 

Aspect Metrics (Otel) Traces (Otel) Logs (Fluent bit/fluentd)

Primary Risk 

PII in labels/attributes 

PII in span attributes/baggage 

Unstructured text with any PII 

Authentication 

mTLS with 30-day cert rotation 

mTLS for trace export 

TLS 1.2+ with mutual auth 

PII Removal 

3-stage: App --> Collector --> Backend 

2-stage: App --> Backend DLP 

3-stage: Fluent Bit --> Fluentd --> Backend 

Data Minimization 

95% volume reduction via filtering 

80-90% via smart sampling 

Rate limiting + filtering 

Attack Prevention 

Resource limits (memory, rate, connections) 

Immutable spans + sampling 

Rate limiting + buffer encryption 

Compliance Feature 

Allowlist-based metric forwarding 

100% sampling for security events 

Real-time regex-based masking 

Key Control 

Attribute processor in collector 

Cleaning before span creation 

Lua scripts in sidecar 


 Key Outcomes 

  • Secured open-source observability across distributed retail edge locations
  • Achieved Full Payment Card Industry (PCI) Data Security Standard (DSS) and GDPR compliance 
  • Reduced bandwidth consumption by 96% 
  • Minimized attack surface while maintaining complete visibility 

Conclusion 

Securing a Cloud Native Computing Foundation-based observability framework at the retail edge is both achievable and essential. By implementing comprehensive security across OTel metrics, distributed tracing, and Fluent Bit/Fluentd logging, organizations can achieve zero security incidents while maintaining complete visibility across distributed locations.

Observability authentication security

Opinions expressed by DZone contributors are their own.

Related

  • Your API Authentication Isn’t Broken; It’s Quietly Failing in These 6 Ways
  • Context-Aware Authorization for AI Agents
  • Leveraging AI-Based Authentication Factors in Modern Identity and Access Management Solutions
  • The Architect's Guide to Logging

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook