Effective Data Management
We discuss several aspects of working with big data sets, such as data management, data mining, data warehousing, and more.
Join the DZone community and get the full member experience.Join For Free
The most useful analytics come from data that is stored properly, categorized correctly, and mined thoroughly. To effectively store and use the data your business collects, you must first incorporate the following aspects.
Over the last half-century, data management has significantly changed the way in which data is organized to be processed by a computer. Today, data can be stored non-sequentially and still used effectively. The usefulness of proper data management has not been lost, as its principles extend far beyond how data is stored.
Data must be verified before it can be used, and there needs to be a built-in timeline for the life of this data. Data gained from surveys and customer data needs to be checked to look for outliers and incorrect entries.
Data eventually becomes irrelevant as consumer demands, demographics, and offered products change. Expect that data gained from your business will have a finite lifespan.
How you store this data and its usefulness comes down to how you categorize data and the timescale that you use to measure it. For example, with wages and revenue, we try and work in small time scales, whereas for overhead costs, we work in month-long time scales.
Depending on the scale of your business and what your data management needs are, data mining will range in forms of complexity. With data mining, we gain insight into large data sets by running a series of examinations on data to try and make sense of emerging patterns, or lack there of.
In its simplest form, this can be running a regression analysis on two large data sets and searching for a correlation. Data mining is commonly confused as looking for useful data in information that is already stored; however, what is really being mined, are patterns and the significance of large data sets.
Combining data sets so that they can be analyzed as a whole is referred to as data integration. In business terms, it is most commonly used during company mergers and acquisitions. In these situations, large troves of data exist from two companies that provide similar services and products.
To get the most out of all of this data, the data needs to be combined, removing data that is not relevant to both sets. For businesses outside of mergers, data integration can also come in the form of using data from similar firms. Some software packages, such as ClearStory Data, will integrate data from other businesses and public records.
Using data from sources that do not have any immediate relation, we refer to records of analysis and integration of this disparate data as data warehousing. For businesses of a small size, this can come in the form of using public records to bolster analysis in a particular topic.
For larger businesses, this comes in the form of using data from large firms that specialize in different disciplines. While the relationship between the data is not directly correlated, similarities in the size of the firm or how the data can be used to make this a necessary step in data synthesis.
Data management, data mining, data integration, and data warehousing work together to form the types of analysis that benefit businesses the most.
Each component is necessary for different forms of analysis. Starting with data management, the verification and categorization of datasets make the data usable for a business. Data mining is a repeated step and is the scanning of data for useful patterns and statistics. This is done through regressions among data sets and other statistical data to find emerging patterns that describe the data as a whole.
Data integration is the combining of data sets across multiple businesses, bolstering the data that can be mined from any one source of data. Data warehousing is the merging of data sets that are not related to each other for types of analysis that cannot be done on any one data set. For data warehousing, imagine a company using its own records for sales of a product or service, and correlating this with data from public records on public transportation pick up locations.
The two data sets have nothing in common but can be combined if the goal is to relate sales of a product to a demographic that uses this form of transportation. These data sets together fall into the category of data management, and can then be used for further mining or data integration.
How to Handle and Manage Big Data
Regardless of the size of your business, there are several principles necessary to get the most out of your data. A key challenge, depending on the size of your firm, is the exponential increase in data that is collected and processed.
Handling such large quantities of data and processing it efficiently can be a challenge, but following these rules will aid you in this process:
You must store data in a central location that can be accessed and processed through multiple sources.
Data must be sifted through to remove common duplicates, particularly after data integration.
Data must be protected and secured, wiping customer footprints off personal data when possible.
Depending on the amount of data, it may be best to use a third party such as Amazon, or backup data yourself to prevent the loss of information. Sensitive data must be disposed of at regular intervals and customers made aware of data retention.
Depending on your business, each and every one of these principles will range in necessity and feasibility. It's most important that data be kept in a state where it can be processed by multiple programs, and the best way to do this is the proper categorization of data, as well as a standardization of data retrieval.
Disposing of customer data after a given time is necessary for all businesses, but equally as strong is wiping the customer footprint from said data. This means turning data with identifiable information into metadata that can be used, but not tracked back to individuals.
The security of your data also depends greatly on the size of your firm. A very small business can store data locally, and even back up this data themselves. The cost of storage is cheap enough that this is a reasonable solution for many businesses. When going this route, access to where this data is stored must be handled responsibly.
Granting access to this data to multiple users across a network will likely be a necessity, but ensuring the data repository is kept on a secure computer is dire. Cloud services solve data storage and access for a great many of larger firms, and backing up this data locally is an option, as well as paying additional fees for duplicates in case of a cloud server malfunction. This will depend on the company from whom you purchase cloud services.
Amazon ensures data with backups up to a certain size, but additional backups can be created for a fee. Microsoft, as well as several other companies, has a competing service. The best option for your business will be a function of data quantity, security needs, and how many users need regular access to data.
Opinions expressed by DZone contributors are their own.