Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Application Data Value Characteristics: Everything Is Not The Same

DZone's Guide to

Application Data Value Characteristics: Everything Is Not The Same

In the first post of a five-part series, we look at general application server storage I/O characteristics that have an impact on data value as well as access.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

This is part one of a five-part mini-series looking at Application Data Value Characteristics: Everything Is Not The Same as a companion excerpt from Chapter 2 of my new book, Software-Defined Data Infrastructure Essentials: Cloud, Converged, and Virtual Fundamental Server Storage I/O Tradecraft (CRC Press 2017), available at Amazon.com and other global venues.

In this post, we start things off by looking at general application server storage I/O characteristics that have an impact on data value as well as access.

Everything is not the same across different organizations, including IT data centers, data infrastructures, and the applications and data they support. For example, there is so-called big data that can be many small files, objects, blobs, or data and bit streams representing telemetry, clickstream analytics, and logs, among other information.

Keep in mind that applications impact how data is accessed, used, processed, moved, and stored. What this means is that a focus on data value, access patterns, and other related topics need to also consider application performance, availability, capacity, and economic (PACE) attributes.

If everything is not the same, why are so much data along with many applications treated the same from a PACE perspective?

Data infrastructure resources, including servers, storage, or networks, might be cheap or inexpensive; however, there is a cost to managing them along with data.

Managing includes data protection (backup, restore, BC, DR, HA, security), along with other activities. Likewise, there is a cost to the software, along with cloud services, among others. By understanding how applications interact with and use data, smarter, more informed data management decisions can be made.

Image title

IT applications and data infrastructure layers.

Keep in mind that everything is not the same across various organizations, data centers, data infrastructures, data, and the applications that use them. Also keep in mind that programs (applications) = algorithms (code) + data structures (how data is defined, organized, structured, or unstructured).

There are traditional applications, along with those tied to the Internet of Things (IoT), artificial intelligence (AI), and machine learning (ML), big data, and other analytics, including real-time clickstream, media and entertainment, security and surveillance, log and telemetry processing, among many others.

What this means is that there are many different applications with various character attributes, along with resources (i.e. server compute, I/O network and memory, storage requirements) and service requirements.

Common Application Characteristics

Different applications will have various attributes, as well as different ways in which they are used; for example, database transaction activity vs. reporting or analytics; logs and journals vs. redo logs; indices; tables; import/export; and scratch and temp space.

Image title

Application PACE attributes (via Software-Defined Data Infrastructure Essentials).

All applications have PACE attributes; however:

  • PACE attributes vary by application and usage.
  • Some applications and their data are more active than others.
  • PACE characteristics may vary within different parts of an application.

Think of the PACE data associated with an application as its personality: how it behaves, what it does, and how it does it and when, along with its value, benefit, and cost, as well as quality-of-service (QoS) attributes.

Understanding applications in different environments, including data values and associated PACE attributes, is essential for making informed server, storage, I/O, and data infrastructure decisions. Data infrastructure decisions range from configuration to acquisitions or upgrades; when, where, why, and how to protect; and how to optimize performance, including capacity planning, reporting, and troubleshooting, not to mention addressing budget concerns.

Primary PACE attributes for active and inactive applications and data are:

  • P: Performance and activity (how things get used).

  • A: Availability and durability (resiliency and data protection).

  • C: Capacity and space (what things use or occupy).

  • E: Economics and energy (people, budgets, and other barriers).

Some applications need more performance (i.e. server computer storage and network I/O), while others need more space capacity (i.e. storage, memory, network, I/O connectivity). Likewise, some applications have different availability needs (i.e. data protection, durability, security, resiliency, backup, business continuity, disaster recovery) that determine the tools, technologies, and techniques to use.

Budgets are also nearly always a concern, which, for some applications, means enabling more performance per cost while others are focused on maximizing space capacity and protection level per cost. PACE attributes also define or influence policies for QoS (i.e. performance, availability, capacity), as well as thresholds, limits, quotas, retention, and disposition.

Performance and Activity (How Resources Get Used)

Some applications or components that comprise a larger solution will have more performance demands than others. Likewise, the performance characteristics of applications along with their associated data will also vary. Performance applies to the server, storage, and I/O networking hardware, along with associated software and applications.

For servers, performance is focused on how much CPU or processor time is used, along with memory and I/O operations. I/O operations to create, read, update, or delete (CRUD) data include the activity rate (frequency or data velocity) of I/O operations (IOPS). Other considerations include the volume or amount of data being moved (bandwidth, throughput, transfer), response time or latency, and queue depths.

Activity is the amount of work to do or the amount of work that is being done in a given amount of time (seconds, minutes, hours, days, weeks), which can be transactions, rates, or IOPs. Additional performance considerations include latency, bandwidth, throughput, response time, queues, reads or writes, gets or puts, updates, lists, directories, searches, pages views, files opened, videos viewed, or downloads.

Server, storage, and I/O network performance include:

  • Processor CPU usage time and queues (user and system overhead).
  • Memory usage effectiveness, including page and swap.
  • I/O activity, including between servers and storage.
  • Errors, retransmission, retries, and rebuilds.

The following figure shows a generic performance example of data being accessed (i.e. mixed reads, writes, random, sequential, big, small, low and high latency) on a local and a remote basis. The example shows how, for a given time interval (see lower right), applications are accessing and working with data via different data streams in the larger image left center. Also shown are queues and I/O handling along with end-to-end (E2E) response time.

Image title

Server I/O performance fundamentals (via Software-Defined Data Infrastructure Essentials). Click here to view a larger version of the above figure.

Also shown on the left in the above figure is an example of E2E response time from the application through the various data infrastructure layers, as well as lower center, the response time from the server to the memory or storage devices.

Various queues are shown in the middle of the above figure, which are indicators of how much work is occurring if the processing is keeping up with the work or causing backlogs. Context is needed for queues, as they exist in the server, I/O networking devices, and software drivers, as well as in storage among other locations.

Some basic server, storage, and I/O metrics that matter include:

  • Queue depth of I/Os waiting to be processed and concurrency.
  • CPU and memory usage to process I/Os.
  • I/O size, or how much data can be moved in a given operation.
  • I/O activity rate or IOPs = amount of data moved/I/O size per unit of time.
  • Bandwidth = data moved per unit of time = I/O size x I/O rate.
  • Latency usually increases with larger I/O sizes and decreases with smaller requests.
  • I/O rates usually increase with smaller I/O sizes and vice versa.
  • Bandwidth increases with larger I/O sizes and vice versa.
  • Sequential stream access data may have better performance than some random access data.
  • Not all data is conducive to being a sequential stream, or random.
  • Lower response time is better; higher activity rates and bandwidth are better.

Queues with high latency and small I/O size/rates could indicate a performance bottleneck. Queues with low latency and high I/O rates with good bandwidth or data being moved could be a good thing. An important note is to look at several metrics — not just IOPs, activity, bandwidth, queues, or response time. Also, keep in mind that metrics that matter for your environment may be different from those for somebody else.

Something to keep in perspective is that there can be a large amount of data with low performance, or a small amount of data with high performance, not to mention many other variations. The important thing to note is that as space capacity scales, that does not mean that performance also improves or vice versa; after all, everything is not the same.

Where to Learn More

Learn more about application data value, application characteristics, PACE, data protection, software-defined data centers (SDDC), software-defined data infrastructures (SDDI), and related topics via the following links:

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,tutorial ,application data ,data value ,data access

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}