DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Related

  • Reliability Models and Metrics for Test Engineering
  • Data-Driven Decision-Making in Product Management: The Key to Success
  • Decoding the Confusion Matrix: A Comprehensive Guide to Classification Model Evaluation
  • Using A/B Testing To Make Data-Driven Product Decisions

Trending

  • How to Marry MDC With Spring Integration
  • Secure DevOps in Serverless Architecture
  • Why 99% Accuracy Isn't Good Enough: The Reality of ML Malware Detection
  • Elevating LLMs With Tool Use: A Simple Agentic Framework Using LangChain
  1. DZone
  2. Data Engineering
  3. Data
  4. Monitoring with DataDog

Monitoring with DataDog

By 
Giorgio Sironi user avatar
Giorgio Sironi
·
Apr. 10, 13 · Interview
Likes (1)
Comment
Save
Tweet
Share
16.0K Views

Join the DZone community and get the full member experience.

Join For Free

Recently I found myself sending more and more business metrics to Datadog, a Software as a Service solution that promises to collect all your data points and build business metrics, displaying them as graphs and triggering alerts whenever they get to critically low (or high) levels.

The goals

The more your automated tests raises their level of abstraction, the more they become oriented to external quality (what the customer wants and does) instead of internal quality (low coupling, high cohesion of the software design). The largest end-to-end tests that we have in place at Onebip connect several different projects on an integration server and run everything from the creation of a purchase or subscription to its renewal and termination (events that would happen months after creation).

However, even end-to-end tests cannot guarantee that our applications work against external resources, such as merchants, mobile carrier, and ISPs. The only way to catch integration problems is monitoring. These problems, like a mobile carrier experiencing an outage, may be due to our errors or to external conditions; but they should nevertheless be discovered as early as possible.

The infrastructure

Datadog is the only data-collection service that passed the stress tests of SLL, our solution architect. It ships as an UDP server that you pay basing on the number of machines you want to run it on; for example, a preproduction and a production server are a common choice to start out. The server collects data locally and periodically uploads it to Datadog in bursts, where you can access it via a web application or via APIs in case you want to call it from your build.

The UDP protocol is aligned with the goals of metric collections: a silent server that decouples the sending of metrics from the rest of the business logic:

  • UDP packets are just lost if no process is there listening to them, no errors are raised if the server crashes or is not running or installed for some reason for instance in development machines).
  • The monitoring code, which you write, should be decoupled and asynchronous as much as possible. The part that talks over the network is already externalized in the DataDog server, but you don't want the user to wait because you have to send some strange number.

So the internal part (sending via UDP) is performed in Listener objects that implement the Observer pattern. These object still have to be wrapped in all-encompassing try/catch constructs so that any errors in the monitoring part never influence the business logic. Againg, you don't want a payment to fail because of an exception in how monitoring DateTime objects are built.

For PHP we built a SilentListener class to wrap all of our object:

class SilentListener
{
  private $wrapped;

  public function __construct($wrapped)
  {
      $this->wrapped = $wrapped;
  }

  public function __call($method, $args)
  {
      try {
          call_user_func_array(array($this->wrapped, $method), $args);
      } catch (Exception $e) {
          $this->log($e);
      }
  }
}SLL


An example

In some countries, we receive payments through mobile-originated messages (MO), a fancy word for saying SMS sent by the end user. So a simple way to monitor if we are receiving payment or if the server is exploded is to upload a metric counting them every time we receive one (pseudo-JSON format to show you the data):

{
  counter: 1
}

However, we can be more precise than this: an external outage or an integration problem may happen to a lower level than the whole application. For example, MOs can be delayed in Argentina, by a single carrier, while the rest of the world is still working fine.

So our data points look like this:

{
  counter: 1,
  tags: {
  country: "IT",
  carrier: "Vodafone",
  merchant: "Tasty Cookies, Inc.",
  }
}

and in turn graphs on DataDog or calls to its API can set up filters so that we can, if necessary, view only the data related to any combination of country, carrier and merchant.

The nice thing, SLL says, is that you just start send data from production and only after you have data points available you build a graph or an alert system basing on what appears to be the most important tags. For example, a big merchant may benefit from some dedicated monitoring, while minor countries such as Vietnam should be monitored as a whole since their traffic is by far lower than that of the others.

Data (computing) mobile app SaaS Metric (unit) UDP (Networking) Testing Object (computer science) Web application Data collection Merchant

Opinions expressed by DZone contributors are their own.

Related

  • Reliability Models and Metrics for Test Engineering
  • Data-Driven Decision-Making in Product Management: The Key to Success
  • Decoding the Confusion Matrix: A Comprehensive Guide to Classification Model Evaluation
  • Using A/B Testing To Make Data-Driven Product Decisions

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: