Principles of Data Analysis for Beginners
Principles of Data Analysis for Beginners
Looking to get into the data analysis field? Read on for an overview of the skills required to up your data game!
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Data analysis is a huge project, sometimes too abstract and dependent on experience. This article is a summary of the author's analysis of learning and practicing data science. I hope to provide a general data analysis idea, and introduce relevant analysis algorithms and their application scenarios in each step of the analysis. For the algorithm, only the shallow level is used.
This article is intended for readers who are new to data analysis or who don't know how to start with a bunch of data. At the same time, the analysis ideas introduced in this paper have certain limitations due to the author's experience and knowledge. I hope readers can make a reasonable reference in the analysis.
Before doing any data analysis, you should first prepare the following:
1. Be familiar with the business, understand the source of the data
This is the premise of the data analysis. Data analysis, in addition to the data we face, is more of the various services hidden behind these data. For example, when we see the user's consumption record, it may not only be purchased goods in the cash register system, but also the order for the member system's full reduction, the activity management system's opening discount product, or the recommendation system's recommendation. An in-depth understanding of the business helps to better identify the dimensions of the analysis and quickly pinpoint the problem and cause.
2. The purpose of clear analysis
Data analysis is not the accumulation of model algorithms and visualizations, but rather the purposeful discovery of certain phenomena that underpin certain decisions. Therefore, before the analysis, we must clearly define the purpose of our analysis, avoid copying the analysis content of other projects, or randomly combine the analytical model algorithms on hand, which will lead to the analysis of the results.
3. Multi-angle observation
To achieve some kind of analysis, you need to observe the data from multiple perspectives, so that you can not only have a comprehensive understanding of the data as a whole but also help to discover potential new insights. For example, when we need to find potential members, the most direct way is, of course, to look at the people who consume our services more but are not members. But from the perspective of promotional activities, those who are keen to buy discounted goods are also potential members, because they will get more discounts when they join the membership. At the same time, from the perspective of the recommendation system, those who are satisfied with the products recommended by the recommendation system will be more likely to join the membership program.
After getting ready, let's get to the point and start analyzing.
1. What Is Data Analysis?
Data analysis must be targeted at certain objects and the first thing to do is to describe this object through data.
Statistics is the most straightforward method, and it is also very simple to apply. Common methods include sum, average, maximum and minimum, median, variance, growth rate, type ratio, distribution, frequency, and so on. There is not much to introduce here.
"Objects are clustered, people are grouped." Clustering is unsupervised learning. Clustering can divide a group of data into multiple categories. The data inside each category is similar, but the two categories are different. Clustering helps to discover the characteristics of the data distribution and can greatly reduce the amount of data analyzed. For example, in trajectory analysis and prediction, through clustering, we will find that a person mainly appears in three places, around the dormitory, around the canteen, around the teaching building, so when we predict where he is, you can count from the latitude and longitude. The analysis of the coordinates becomes an analysis of the three locations.
Feature engineering is very large. As described, data and features determine the upper limit of machine learning, and models and algorithms can only approximate this upper limit. Feature engineering includes feature extraction and feature selection. Due to its numerous and complex algorithms, it is not introduced here. Feature analysis begins with a clear analysis of the units, including time, space, and type. Just like in trajectory prediction, it is much more practical to analyze the location of every ten minutes than to analyze the coordinates of latitude and longitude per second, and the location of the analysis hour is too rough. Then there is feature extraction. There are many algorithms for feature extraction, linear PCA (principal component analysis), LDA (linear discriminant analysis), ICA (independent component analysis), text F-IDE, expected cross entropy, image HOG, LBP, etc. The main purpose of feature analysis is to reduce dimensionality, reduce redundancy, and improve storage computing power.
2. What Happened To The Data?
What happens to it is normal and abnormal. We usually pay more attention to the exceptions, so I'll also focus on exception analysis. What happened to the data is consistent with the ideas and methods used for analysis, but only for different stages, such as the current month and last month. For anomaly analysis, there are two main parts, abnormalities and push warnings. Pushing the warning is relatively simple, as long as you pay attention to the level of the warning and the person who pushes it. The abnormal discovery, in addition to the abnormalities that can be directly observed, may need more attention paid to their 'dark matter.' The so-called dark matter is a phenomenon and correlation that cannot be directly observed.
In the case of abnormal judgment, some coefficients are usually set according to the specific business, and the potential anomalies are discovered by the mutation of these coefficients. These coefficients are especially important in trajectory analysis. For example, if we want to analyze whether a person's trajectory is abnormal, we will first see if he appears in a place that has never been seen. If not, the second step uses a vector of trajectories for analysis. For example, through clustering, schoolmasters mainly appear in the classroom, library, and their home. The time spent at each place is assumed to be 8 hours a day so that a vector — (8,8,8) — is formed. If we take another vector, (2, 2, 20), we can find the anomaly by calculating the distance between the two vectors, usually the Euclidean distance and the cosine distance.
3. Why Did It Happen?
Whenever something happens, we will ask why. Deep mining and the diagnosis of data are how we explore the why of a problem, and accurate problem diagnosis is conducive to making the right decision. Generally, the following methods can be used:
Year-Over-Year Trend Analysis
This is a very simple method, both to observe our data's past and other cycles, not to mention here.
Drilling is definitely the most common and effective way to find causation, both layering and pulling until the root cause is found. In the process of drilling down, we must pay attention to the area and direction of the drill, just like digging a well. It is not just looking for a direction in any direction to get water. Take the decline in the sales of a certain mall. To find out the reasons for the decline in sales, first of all, I would like to find the products with the greatest reduction in sales. Say, for example, we found that the coffee is the most reduced, we should ask why coffee sales are reduced.
If we need to change our strategy and look for products that have sold well in the past and have very low sales, we can drill down into multiple levels, starting by only focusing on large classification changes, such as clothing, diet, etc., and then continue to drill down from the larger classes.
Correlation analysis is the analysis of the relationship between different features or data to discover the key impacts and drivers of the business. Commonly used methods for correlation analysis are covariance, correlation coefficients, regression, and information entropy. Correlation coefficients and regression can also be used for the predictions that will be discussed below. Correlation is the premise of regression, the correlation coefficient indicates that the two variables have a relationship, and regression indicates the relationship between the two variables. Correlation coefficients and regression can also be extended to typical correlation analysis (multivariate) and multiple regression. For example, the classic "beer and diaper problem" — if you want to know why beer sales increase, you can analyze its correlation with diaper sales.
4. What Else Will Happen To The Data?
We then use our data to make predictions. There are many algorithms used for making predictions, but not that all prediction analyses need to be solved with incomprehensible algorithms. For example, industry trends, growth rate, year-on-year ratio, basic probability, etc., sometimes can explain the problem. But here, I will introduce some common prediction methods:
For the prediction of low real-time and continuity requirements, this is definitely the most worry-free method, but this is linked to the specific business, so one must be familiar with business and multi-perspective observation.
Classification and Regression
Both classification and regression construct and validate a function from known data such that y = f(x). For unknown x, predict y by f. The difference is that the output of the regression is continuous and the output of the classification is discrete. For example, we predict that tomorrow's temperature will be the same as today's, and predicting whether tomorrow is rainy or sunny is a classification. Classification methods include logistic regression, decision trees, and support vector machines, while regression analyses generally use linear regression.
Of course, there are still many prediction algorithms, such as Hidden Markov (HMM), Maximum Entropy, CRF, etc. It is only necessary to choose the correct method based on the specifics of the predicted data. These can be very good suggestions from our algorithm engineers, of course, if we want to accurately tell the characteristics of the data and the things that need to be predicted.
5. What Should I Do?
What to do is the ultimate goal of data analysis. Let's introduce some methods that can be used even if you know what the problem is and don't know what to do:
Fitting and Graph Theory
This is the most commonly used when planning route planning. For example, when a store is frequently robbed, we can where the goods are most easily stolen. Then we can connect these places and fit them into the security guard's patrol. Similarly, you can build a patrol path by building a graph and using the algorithm that finds the shortest path (Dijkstra, Floyd, etc.).
Collaborative filtering is a way of using collective intelligence. Just like the classic interview question, what should you do when you encounter a problem that no one has ever encountered? The answer is to ask those who have more experience than you what they would do. Collaborative filtering is used most in recommendation engines. The general idea is to find n similar users to a particular user, then recommend the product that the user likes, or find the first n items that the current user likes, and then select the m items similar to the n items are recommended to the current user.
There is also a situation that is very common with data analysts. It is when you get the data, but there is no set purpose. This is called exploratory analysis. In this case, with the help of data analysis tools, we can do some general exploratory analysis, look at the data trends, and gradually deepen our insights.
For companies, the tools for exploratory analysis are primarily reporting and BI. A perfect example is FineReport, which can produce a variety of complex reports, as well as a large screen for data visualization. On the basis of reports and BI, early warning systems can be added, such as alerting abnormal indicators, so that leaders can only pay attention to these indicators without having to look at all the indicators to save time and improve efficiency. If necessary, we may look at the corresponding report or BI presentation, which is one of the application methods of enterprise exploratory analysis.
Opinions expressed by DZone contributors are their own.