Time, Identity, Meaning, and Observability: Necessary Qualities for Actionable Data
Time, Identity, Meaning, and Observability: Necessary Qualities for Actionable Data
Time, identity, meaning, and observability fundamentally necessary for extracting insights from trends. These insights serve as the building blocks of data-driven decisions.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
In addressing data quality and usability, we often focus on attributes like data format and storage type. But in considering what makes data truly actionable, there are qualities that are more important — without these qualities, data becomes difficult to analyze regardless of format. Time, identity, meaning, and observability are fundamentally necessary for extracting insights from trends. These insights serve as the building blocks of data-driven business decisions.
Single events in time are not sufficient representations of trends; they serve as evidence of a trend. A process can only be truly understood through observation over time. The process could be a learning arc or progression, macroeconomic activity, website usage, or the migration paths of a particular animal, but none of these processes could be identified or understood by a single event alone.
There are a couple of pitfalls we see in event time storage that can muddy the waters and reduce the accuracy of event data: reliability and specificity.
Avoiding the first of these pitfalls requires us to keep event time separate from storage or ingestion time. We sometimes see systems that complect these two, which results in the loss of the fully accurate event time data. xAPI, for instance, accounts for this by providing two different time fields: one for event timestamp and one for data stored time. Without this consideration, an ingestion system is asserting when something occurred based on its own constraints instead of providing a true recording of the event.
The second pitfall is time specificity.
Sometimes we see data that has had its timestamp truncated to show only the date, potentially leaving a large number of events blobbed into a cluster with no clear event series apparent. Certain data analysis techniques can utilize this type of data, but these techniques are unable to surface trends that occur within the timescale of a day. This data is effectively unorganized and unordered within a day, making it impossible to identify the event series within that time block.
Even when a full timestamp is saved with data, it sometimes fails to include necessary context. When we tell time in our everyday lives, we just say, "It's 4:35 PM," and that's all the context needed. But as anyone who's ever had to book an international conference call knows, time is complicated. Time data should be recorded according to an international standard so that specificity is not lost and all systems which need to consume the data will have a shared understanding of when the event occurred. Use Coordinated Universal Time (UTC) for recording when events occur; it is the primary time standard by which the world regulates clocks and time. This will steer you clear of the time zone quagmire.
In order to keep your systems in the clear with regards to time specificity, you will need to demand that the systems you integrate with handle time with full specificity, or you may need to supplement in order to ensure that each event is tied to one finite point in time.
A consistent and secure means of defining and resolving identity is a must.
There are many possible forms of identity. Here in the states, we might think of personal identity as being defined by your name and Social Security Number. A pseudo-anonymous system might identify you by a randomly generated unique identifier. The scope of identity is not limited to an individual; it can also encompass an identified or anonymous group or cohort. When designing or integrating with a system, it is crucial to establish concepts of identity up front. The domain of the data can affect what kind of identity is needed; for instance, in education, a group or class is likely important. It's necessary to consider what identity means in your dataset at the individual and group level, and what is needed to resolve identity at the onset of any project which will utilize identified data.
We've established that clear identity is critical to actionable data. But how you manage identity has a huge impact on how secure and consistent your systems are.
Much in the way that retailers maintain a clear separation between internal systems and financial systems that touch your personal financial information, it's more secure, efficient, and consistent to store sensitive data in a separate system.
Avoid packaging identity in each stored data statement; instead, store a reference to an identity stored in a secure system. This way, you avoid directly exposing people's personally identifiable information (PII) to anyone who has access to a dataset. For example, an actor within an xAPI statement can be identified using a trusted external systems' unique internal identifier for that person. This is accomplished by using the account Inverse Functional Identifier (IFI). The only way to convert this opaque identifier into PII is to have access to both the data statement and the secondary system. An additional benefit of storing identity this way is that it reduces the likeliness of conflicting identity issues. Your system is not trying to assert identity — it is relying on a trusted source to provide that information. On top of all that, your system is now not storing any sensitive PII at all!
While some systems do not track identity — and for good reason — data without identity provides a much lower resolution view of activity and behavior. In a corporate setting, for instance, identity would let you track behaviors of groups and teams, showing what content is most accessed by different groups and what kinds of learning patterns different employees demonstrate. If all the pattern data you have is limited to broad categorization, you will never be able to build dynamic personalization engines, automated career path trees, etc.
We assume that all data has meaning. We value data and we pay for disk space and infrastructure because we believe that the data we store has meaning and therefore value.
But resolving meaning within stored data can be messy if no systems exist to establish canonical, resolvable definitions for data.
Consider a single word: "watched." This verb has contextual meaning that is not embedded in the word alone, that we as humans in verbal communication would infer from a variety of time and place and interaction specific elements. We would understand intuitively the difference between watching a video and watching an event in person. But when we attempt to record this event and store the word "watched" as a piece of data, all that real-world context is missing.
We could pack all the necessary elements into the event data, but if you pack in super-detailed information to every data point, it will take up a ton of space. And for each piece of data, you will end up duplicating encyclopedic volumes of information per each data point — which means your costs to store all that data will quickly become unmanageable.
Much as our identity management system allows us to establish consistent, resolvable identity, we need systems that allow us to store and resolve canonical meaning for words and data. Canonical meaning systems allow us to be fully specific about the meaning of a term in a given context and free us from the burden of storing all the definitional context in each piece of event data.
A prerequisite for any system that provides canonical definitions of meaning and pathways to those definitions is that each pathway must be unique and follow a standardized format. For example, dictionary citations allow someone to find a word and its associated meaning. If we were to reference the word "crane" as defined within a traditional print dictionary, we would need to provide a reference to our source to ensure the usage of the word is not misinterpreted, i.e. the bird, "Crane [Def. 1]. (2005). In Webster's II New College Dictionary (3rd ed., p. 269). Boston, MA: Houghton Mifflin." versus the machine, "Crane [Def. 3]. (2005). In Webster's II New College Dictionary (3rd ed., p. 269). Boston, MA: Houghton Mifflin." These citations provide the necessary lookup information so that anyone can come along and find the intended meaning; we have the name of the source, the word in question, the definition number of the word, the year at which this definition was asserted, and the page number. As long as we have access to the source material, we will always be able to find the intended meaning due to the uniqueness of the citation. When it comes to referencing online definitions, we provide another form of citation: a URI.
In order to translate the reliability of a print citation to an online citation, the same principles that guarantee uniqueness must be followed for a URI. This seems simple enough: I provide a web address at which my definition is located and then I have access to the referenced material. But what happens when the content at that address is overwritten, updated, or deleted? Well, now, we have a disconnect between our citation and the material being cited. To avoid this, a system hosting canonical definitions must structure their URIs in a way similar to traditional print citations; the URI must contain enough information to guarantee uniqueness. Unique URIs are not enough, though — the additional constraint on online, canonical systems is the information located at a unique URI must never change.
Historically, when a publisher needs to make an update to material, they wouldn't recall all copies of the previous work, make their updates, and then re-release the text; they would leave the old text alone and publish an updated version with its own unique, identifying properties. This same process needs to be followed by an online system providing canonical definitions otherwise the reliability and usefulness of any reference is lost.
The reliability of a canonical system is paramount when considering streams of big data. Imagine you set up a stream of data from one system to another. In order to maximize efficiency, you would only send the information necessary for the receiving system to interpret the data it is consuming. This data will most likely just be a collection of references the receiving system can then use to look up the information which provides the contextual information you didn't want to send over the wire. Now, imagine those references were broken and when resolved, actually referenced nothing or, even worse, inaccurate information. You have now either lost critical data or are using inaccurate data as the basis for data-driven decisions. In the case of receiving inaccurate data, this can be a hard thing to catch, as it requires someone familiar with the data domain on both ends of the stream to analyze the data flow and say, "Hey, something's not right here." In some cases, this realization is only possible if the analyst is familiar with what the data is supposed to mean without an accurate reference.
Data doesn't do you any good if you can't see it. Data you can see is only useful if you can drill down and explore it in ways not strictly predetermined by a system's designer.
Your ability to actually access the insights in data can be affected in a number of ways, including by the format of your data. It could be that your data is in some artisanal format; all the data points are there, but it's hard to access. Or your data could be in a legacy format, again, making it difficult to access.
Technology offers a lot of help with this. Mechanically, we can get you access to almost any kind of data. But data in an old forgotten format, even if you can see inside, the old format obscures the content. The signal to noise ratio in this scenario is not favorable. So typically, we do quite a lot of work normalizing data, putting into databases, to get it to a point to where the contents are accessible to us. But ultimately, our relationship with the data relies not just on accessibility, but on observability and explorability.
Different kinds of databases let you explore data to different extents. For instance, a local credit union's online portal may only show you the last few days of your transactions-this is not a fully explorable, observable dataset. Observability would be a full ledger of all your activities with the bank. There are no regulations saying the bank can't show you this — they just haven't chosen to implement it as a tool for their users.
Another example would be a visualization that gives an interesting insight but updates infrequently — this is the result of a less observable dataset. Visualizations like the ones we build that update live with data and that are configurable to allow intuitive exploration make a dataset more observable.
Published at DZone with permission of Milt Reder , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.