DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Monitoring at eBay: Big Data Problems

Monitoring at eBay: Big Data Problems

Matt O'Keefe user avatar by
Matt O'Keefe
·
Nov. 14, 12 · Interview
Like (0)
Save
Tweet
Share
16.00K Views

Join the DZone community and get the full member experience.

Join For Free

this post is based on a talk by bhaven avalani and yuri finklestein at qconsf 2012 ( slides ). bhaven and yuri work on the platform services team at ebay.

this is a big data talk with monitoring as the context. the problem domain includes operational management (performance, errors, anomaly detection), triaging (root cause analysis), and business monitoring (customer behavior, click stream analytics). customers of monitoring include dev, ops, infosec, management, research, and the business team. how much data? in 2009 it was tens of terabytes per day, now more than 500 tb/day. drivers of this volume are business growth, soa (many small pieces log more data), business insights, and ops automation.

the second aspect is data quality. there are logs, metrics, and events with decreasing entropy in that order. logs are free-form whereas events are well defined. veracity increases in that order. logs might be inaccurate.

there are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. the data is distributed to olap, hadoop, and hbase for storage. some of the data is dealt with in real-time while other activities such as olap for metric extraction is not.

logs
how do you make logs less “wild”? typically there are no schema, types, or governance. at ebay they impose a log format as a requirement. the log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. transactions can be nested. another type is atomic transactions. there are also types for events and heartbeats. they generate 150tb of logs per day.

large scale data distribution
the hardest part of distributing such large amounts of data is fault handling. it is necessary to be able to buffer data temporarily, and handle large spikes. their solution is similar to scribe and flume except the unit of work is a log entry with multiple lines. the lines must be processed in correct order. the fault domain manager copies the data into downstream domains. it uses a system of queues to handle the temporary unavailability of a destination domain such as hadoop or messaging. queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. the queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. there are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.

metric extraction
the raw log data is a great source of metrics and events. the client does not need to know ahead of time what is of interest. the heart of the system that does this is distributed olap. there are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. the system maintains counters in memory on hierarchically described data. traditional olap systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. the result of the aggregators is olap cubes with multidimensional structures with counters. the combiner then produces one gigantic cube that is made available for queries.

time series storage
rrd was a remarkable invention when it came out, but it can’t deal with data at this scale. one solution is to use a column oriented database such or hbase or cassandra . however you don’t know what your row size should be and handling very large rows is problematic. on the other hand opentsdb uses fixed row sizes based on time intervals. at ebay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. to solve this, they introduced a concept of multiple row spans for different resolutions.

insights

  • entropy is important to look at; remove it as early as possible
  • data distribution needs to be flexible and elastic
  • storage should be optimized for access patterns

q&a
q. what are the outcomes in terms of value gained?
a. insights into availability of the site are important as they release code every day. business insights into customer behavior are great too.

q. how do they scale their infrastructure and do deployments?
a. each layer is horizontally scalable but they’re still working on auto-scaling at this time. ebay is looking to leverage cloud automation to address this.

q. what is the smallest element that you cannot divide?
a. logs must be processed atomically. it is hard to parallelize metric families.

q. how do you deal with security challenges?
a. their security team applies governance. also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.




Big data

Published at DZone with permission of Matt O'Keefe, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Public Cloud-to-Cloud Repatriation Trend
  • Distributed Stateful Edge Platforms
  • SAST: How Code Analysis Tools Look for Security Flaws
  • Three SQL Keywords in QuestDB for Finding Missing Data

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: