Data is fueling the world’s most valuable companies. Today the list is topped by Apple, Google, Microsoft, Amazon, and Facebook. These top companies harness data to drive outsized value. While the companies are unique, they share a more common approach to analytics than you might expect.
The Rapid Rise of Data Capture for Analytics
In a short span, entire industries have been born that didn’t exist previously. Each of these areas is supported by one or more of the world’s largest companies:
- App stores from Apple and Google.
- Online music, videos, and books from Apple, Google, and Amazon.
- Seller marketplaces from Amazon.com.
- Social networks from Facebook.
These areas have common characteristics driving the data workloads:
Incredibly large end user bases numbering hundreds of millions.
A smaller (but still large) base of creators or sellers.
The platform providers (Apple, Google, Amazon, and Facebook) seek analytics for:
The content producers or sellers.
Often, all the way to the end users.
All of these characteristics culminate in a stack that starts with the platform provider, extends up to the creators or sellers, and ends with consumers. At each level, there is a unique analytics requirement.
The App Store Example
Let’s use the App Store example to explore analytics architectures across this type of stack. App Stores are also an ideal example of new workloads that require a fresh approach to data engineering.
App Store Characteristics
The largest App Stores have the following characteristics:
- Hundred of millions of end users.
- Millions of application developers.
- Dozens of app segments.
- One primary platform provider (i.e. Apple, Google).
App Stores also represent a large, fast-growing segment of the economy. According to a recent article in the San Francisco Chronicle based on data from analytics firm App Annie, both Apple and Google are growing, with Android taking a recent lead.
This year, things are changing: Android app distributors will leap ahead of the App Store, according to projections by analytics firm App Annie. In 2017, the App Store will generate $40 billion in revenue, while Android app stores run by Google and other parties will generate $41 billion, App Annie said. That gap is expected to widen in 2021, with Android app stores generating $78 billion and Apple’s store at $60 billion, according to the analytics firm’s report, which was released on Wednesday.
App Store revenue and projections.
Data Workloads From App Stores
App Store workloads produce and collect information on:
The distribution of apps to end users.
App data coming from each app from each end user.
Desired Data Engineering Capabilities
To meet the needs for comprehensive and multilevel App Store analytics, data solutions need to provide:
Fast data capture, including the ability to ingest data in real-time.
Low latency query capability to support sophisticated queries with sub-second responses.
High concurrency, enabling many users to access the system simultaneously without slowdown.
Desired Analytical Capabilities
To serve all levels of requirements, App Stores (and many other areas with similar characteristics) need to deliver:
Analytics for the platform: Real-time analytics to understand operationally what is happening at any moment and ad hoc analytics for impromptu drill downs on specific queries.
Analytics for app developers: Including ad hoc queries so developers can segment the data any way they want and traditional solutions, serving many groups of analytics users often required pre-computing results (but this negated the option for ad hoc analytics).
Analytics for end users: Responsive, lightweight analytics for hundreds of millions of users, such as what apps are installed and up to date.
Analytics Architecture Strategies
For App Stores or any other large data-driven business, the following goals and implementation approaches can make analytics at scale easier to achieve.
Multilevel: Provides analytics across the platform, developers, and end consumers. Using the appropriate indexing and sharding approaches, the platform provider can architect a solution to meet the needs of all three constituents
Self-service: Empowering self-service analytics ensures that results are instant and up-to-date without the cost and complexity of pre-computing
Use a scale-out distributed system: A distributed system can support both the speed and volume required for large scale analytics. Further, the right indexing and sharding allow for queries to be segmented appropriately, i.e. if thousands of developers are each issuing queries about data regarding their own applications, those queries can be directed to data partitions specific to those developers and not the entirety of the distributed system. This approach allows a high degree of concurrent access.
Ensure a modern query execution system.
Newer systems include features such as:
Code compilation to facilitate sub-second responses on repetitive queries.
Distributed joins for efficient operations across multiple tables.
Vectorization to take advantage of the latest CPU capabilities such as Single Instruction Multiple Data (SIMD).
Bundle transactions support for enable real-time analytics.
In a real-time world, there is no time to wait for lengthy extract, transform, and load processes. Using a system that supports transactions as well as analytics allows data to be analyzed in place.