Small Data Meets Big Data
Small Data Meets Big Data
There are tremendous learning opportunities in the big data and analytics space for anyone interested in using data to solve business problems.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I've been using data to solve business problems since I got out of business school a couple of lifetimes ago. I've been able to help:
- Proctor & Gamble know the optimal price differential between Bounce and generic fabric softeners.
- Rolaids know that their advertising budget had a strong positive correlation with market share.
- Wachovia Bank see the correlation between "heart" advertising and "switching preference."
- Blue Cross and Blue Shield see the positive correlation between their advertising and the reduction in negative perceptions of the brand, increase in positive perceptions of the brand, and increase in inbound leads.
I went to TDWI's big data and analytics conference to see what else I needed to learn to leverage my love of data and my desire to help solve business problems.
Following are a few key takeaways from each class.
Data Science Best Practices
- The purpose of analytics is to turn data into insights that guide positive business action.
- The single biggest issue is cultural and not technological — Uber, Lyft, Netflix, and Amazon are all driven by data and have an inherent data-driven culture.
- Insight means different things to different people in the organization.
- Data analytics is about finding patterns, computer analysis, understanding, and insight.
- Business analytics is about finding meaning, human analysis, decisions, and action.
- Start by thinking about the problem you are trying to solve, then strategy, tactics, and operations. Don't chase nickles with quarters. Solve $100 problems with nickles.
- The Cross Industry Standard Process for Data Mining (CRISP-DM) is the most widely used process for data mining.
- "Data mining" is a misnomer since you are really mining for insights — insights are gold.
Ask the Right Questions
- Get the data in a raw form and become comfortable with manipulating it.
- Start by sorting on every variable to see what you have and where the mess is.
- Glean insights from skills based on logic, math, IT, and business knowledge of the company.
- Big data is just one source of information. Look at the entire world and direct further data collection based on needs and how you see the data coming together.
- Know how the data is collected, know what the business need is, be transparent, and document your work. Someone is going to ask questions months after you've finished the project.
An Overview of Data Science
- There are seven types of data scientists: 1) R number crunchers; 2) data engineers; 3) old-style modelers; 4) linear domain experts; 5) math modelers/scientists; 6) modern machine learners (XGBoost); 7) deep learning geeks.
- Everyone needs good SQL skills to pull out the data they need to analyze.
- Intellectual curiosity is more important than education — the ability to notice weird things in the data and then digging in to figure it out.
- Go beyond the question that has been asked to ask why the question has been asked.
- Be willing to work with uncertainty and noise.
- Define the problem you're trying to solve up front. Most projects fail because they're trying to solve the wrong problem.
- Learn several algorithm techniques so you have the breadth to do what you are doing.
- Build a lot of models and aggregate them to get the best performing model.
Preparing Data for Predictive Modeling
- Nine skills for data analysts: 1) education; 2) SAS and/or R; 3) Python coding; 4) Hadoop platform; 5) SQL database/coding; 6) unstructured data; 7) intellectual curiosity; 8) business acumen; 9) communication skills.
- The way you prepare the data will vary depending on the algorithm you are going to use.
- Data preparation is 80% of the job.
- Do not assume the data you prepared for one job is prepared for another.
Modeling Your Data: Building and Assessing Models
- Recognize the difference between correlation and causation.
- Score your model by putting it into production and comparing the prediction to the actual results.
- 10-bucket testing checks the accuracy of the model: test ten times with a different tenth of the data each time.
- The frequency of the model change depends on the volatility of the data and the industry.
Effectively Visualizing and Communicating Data
- The goal of data visualization is good communication.
- Understand the intended purpose for the subject of the data as well as the audience that will be looking at the visualization.
- Eliminate chart junk: non-informative or information-obscuring elements of quantitative information displays.
- Educate those involved in the data science process on the value of effective visualization.
- A confusion matrix is a great way to evaluate the accuracy of a classification model.
- Gain or lift of the effectiveness of the classification model is the ratio between the results you get with and without the model.
Data Mining With R
- More easily understood by people with statistics background.
- RStudio is an IDE for R.
- R has evolved as the standard for statistical computing for many disciplines and industries.
- Machine learning covers:
- Statistical learning: Linear and logistic regression.
- Supervised learning: A data scientist is required to engineer features.
- Unsupervised learning: Algorithms find patterns without requiring anyone to engineer features.
- Deep learning: Build off machine learning, putting AI together with neural networks and strong computing power to train the model faster with more data.
- Python is better for deep learning.
- Boost or bag your model with random forests — a tried and true method.
Python for Data Analysis
- More popular with developers and programmers.
- Popular choice for scripting, rapid prototyping, and data science/machine learning.
- Jupyter Network is a web application that allows you to create and share "scientific notebooks" with interactive, runnable code, making them perfect for exploratory analysis.
- Pandas panel data are the most popular in the Python library.
- CRISP-DM is the most popular predictive analytics methodology.
- If a project doesn't pay out, it's an expense, not an investment.
- What's the ROI for true positives, true negatives, false positives, and false negatives?
- If it's not greater than $1 million, it's not worth pursuing.
- Most organizations have one project concept based on their biggest pain point. They need five or six to evaluate what's the best fit for predictive analytics. Many projects are BI versus predictive analytics.
- Assess, plan, prepare, model, validate, deploy, and monitor every project.
- Companies need an analytics manager to manage the process and serve as the interface between the C-level and data scientists/analysts.
- The most accurate model is not necessarily the best.
"The value of data mining results is not determined by the accuracy or stability of predictive models." — Tom Khabaza
Give it another year or two and "big data" will just be "data," and it will be driving all significant business decisions for forward-thinking organizations.
Opinions expressed by DZone contributors are their own.