Two Helpful Data Concepts
Two Helpful Data Concepts
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I’ve been batting around a couple terms while talking about various technical solutions for years, and I’ve found them useful while selecting and constructing technical solutions for managing data. They’ve helped me both build, and provide input on what I need from a technical solution.
When considering solutions for your data needs, you may say you need something real-time, but you’re unlikely to ever find such a solution. Most of them will incur some latency while persisting a record, and the availability of these items will be based on the latency it takes to write a record. Aggregate data will be further delayed, since it will depend on the latency of the first record. There are rarely any true real-time solutions, most solutions are real-timey; they’ll have some time measured in milliseconds, seconds, minutes, or hours before that data becomes useable. In addition, most data is not actionable for sometime after it become available; a single data point is rarely enough to make a decision.
Is your answer correct? This is a fundamental question in dealing with data, and most people I’ve dealt with make one fundamental incorrect assumption. For most metrics, the data you collect is mostly correct. You will loose some transactions, you will over count others, your software will have bugs, and these will all lead to inaccuracy in your data. To improve your correcty-ness you’ll likely have to trade resolution or latency in order to push yourself closer to your achievable accuracy. Correcty-ness unlike real-timeness can be improved over time. You can look at higher resolution data, and adjust your initial measurement for a time period.
A Practical Example
Scalability of you’re application will generally be determined by your requirements around real-timey-ness and correcty-ness. Take for example a visits counter on a web-page. The most straight forward implementation would be increment a counter for each page render and store it in a database(UPDATE counter_table SET views_count = views_count+1 WHERE page_id = 5). This solution will have a high degree of accuracy, but comes with limited scalability since we’d be locking a row in order to increment its value. Furthermore the accuracy of this solution may actually degrade as usage increases since aborted page loads may fail to increment the counter. This solution would have a high real-timey and correcty-ness values, at the expense of scalability.
A more complex solution would be to look at the request logs and increment the row by batch using an asynchronous process. This solution will update the counter in whatever time it takes to aggregate the last set of logs. The correcty-ness of the count at any given point will be less than that of the above solution, as will the real-timey-ness since you will only have the answer since the last aggregation. However, the solution will support larger requests volumes, since the aggregation of requests will take place out-side of the page render.
The first solution presented is perfect for a small web-application. The small number of requests you receive at small scale can make asynchronous solutions look broken, and the latency incurred per page render is relatively small. In the small scale its probably better to favor a higher-degree of correcty-ness.
The second solution will perform much better at larger scale. Its lack of correcty-ness and real-timey-ness will be hidden by the hit counter incrementing by large numbers with each refresh. This solution would generally be called eventually consistent, but you can never really achieve consistency without looking at a fixed time window that is no longer being updated.
A Third Solution
During each page render a UDP packet could be sent to an application that increments an in memory counter. The page could then pull this count from the secondary application, and display the current count. To achieve consistency the request logs could be aggregated on a given interval, that then replaces the base value of the counter.
This solution will have a high degree of real-timey-ness since page views will be aggregated immediately. However the correcty-ness of the application will be less than the first solution, since the data transmission method is less reliable. This is a fairly scalable solution, that would balance actionable real-time data with the ability to correct measuring errors. That said it is likely less scalable then the two previous solutions.
Great, What Now?
When designing an application take the time to think about the acceptable real-timey-ness and correcty-ness. In general high correcty-ness and real-timey-ness create slow applications at larger scale. So, when spec’ing out an application consider assigning a real-timey-ness value to data you present to users. I would typically define it as time windows, i.e. data presented in the UI will be at least 5 seconds old, and no more that 2 minutes. As for correcty-ness, I would define it as the acceptable accuracy within a given time period. For example, data must have an error no larger than 50% within 5 seconds, 10% within 5 minutes, and 0.1% within 7 minutes.
Deciding what these numbers should be is a different problem. You can generally address real-timey-ness at the expense of correcty-ness, but its hard to improve correcty-ness and real-timey-ness at scale. I would generally look at the scale of the application to decide how important either is. Solutions that are low scale won’t generally have contention issues, so I would favor high real-timey-ness and correcty-ness. In addition, people are more likely to notice issues when the numbers increase by small amounts(i.e. the counter not going from 2 to 3 for 4 page views, and are more likely to complain about problems in the software. For large scale solutions I would take an approach looking at how long the data takes to become actionable. Being accurate within 1 to 5 minutes may be enough to help you derive a result from your data, but it may take 24 hours before you can conclude anything. Think about the amount of time this will take, and then build your specification accordingly.
Hopefully these concepts are useful, and can be put to use when designing solutions for data.
Published at DZone with permission of Geoffrey Papilion , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.