DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. What is Streamdrill Good For?

What is Streamdrill Good For?

Mikio Braun user avatar by
Mikio Braun
·
Jan. 10, 13 · Interview
Like (1)
Save
Tweet
Share
3.36K Views

Join the DZone community and get the full member experience.

Join For Free

a few weeks ago, we released the beta (oh, sorry, the ”β”) of stream drill . one question we heard quite often was, “well, what is it good for?” so i’ll try to elaborate a bit more on what it does.

first of all, streamdrill is not yet another general purpose big data framework. in its current version it doesn’t even have support for clustering (although that’s something we plan to add in the next months).

in a nutshell: streamdrill is useful for counting activities of event streams over different time windows and finding the most active ones. for the sake of simplicity, let’s call the above problem the top-k problem .

so let’s break this down a bit.

event streams are actually common place. some examples:

  • really any kind of server logs
  • user interactions in a social network
  • logs of web accesses, page impressions, etc.
  • monitoring applications of any kind

in general, such logs may contain all kinds of data and have varying level of structures, but often it is possible to identify groups of events that have similar structure. for example:

  • for server logs: performance reports, errors, exceptions thrown, etc.
  • for user interactions: user posting something, user sending messages, etc.
  • for web logs: request of page consisting of path, referrer, ip address of the client, user-agent

such event streams usually contain an enormous amount of data, much too much for a single human being to grasp. as a first step, you’re often interested in doing some kind of aggregation, extracting some basic statistics. for example:

  • for server logs: what are the average performances over the last hour on average for the whole cluster? which exceptions are thrown most often?
  • for user interactions: which are the most active users? which media are trending (i.e. viewed/reshared most often)?
  • for web logs: which web pages are viewed most often, from which locations? who are the most active referrers?

this is exactly what streamdrill does. for a given type of event, it counts activities and aggregates them over time windows. furthermore, it let’s you filter results so that you can drill down into your trends.

for streamdrill, an event consists of a fixed number of fields (which are called entities ). currently, they are all just plain strings, but this might change in the future.

for every event, you send an update command to streamdrill which then automatically updates the counts for the timescales configured. these timescales aggregate the counters using a technique called exponential decay counters, which is very memory effective. in addition, streamdrill keeps an ordered index of all entries, such that you can quickly query the most active entries, potentially filtering for some of the entities (for example, only show entires for a given path, user-agent, ip address, etc.)

in the next post, we’ll discuss why you would want to use streamdrill for that purpose instead of cooking something up of your own.

Big data

Published at DZone with permission of Mikio Braun, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • DevOps Roadmap for 2022
  • The Top 3 Challenges Facing Engineering Leaders Today—And How to Overcome Them
  • Top 5 Java REST API Frameworks
  • Promises, Thenables, and Lazy-Evaluation: What, Why, How

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: