Effective Self-Service Analytics Depends on Database Preparation
Effective Self-Service Analytics Depends on Database Preparation
The world of BI relies on making sure your database is ready to scale and perform to meet the needs of your incoming data. These guidelines will lend a hand.
Join the DZone community and get the full member experience.Join For Free
New whitepaper: Database DevOps – 6 Tips for Achieving Continuous Delivery. Discover 6 tips for continuous delivery with Database DevOps in this new whitepaper from Redgate. In 9 pages, it covers version control for databases and configurations, branching and testing, automation, using NuGet packages, and advice for how to start a pioneering Database DevOps project. Also includes further research on the industry-wide state of Database DevOps, how application and database development compare, plus practical steps for bringing DevOps to your database. Read it now free.
The promise of self-service business intelligence (BI) is that it allows application end users to create their own reports, dashboards, and data visualizations by running queries on the fly.
However, if organizations undertaking a project to deliver self-service analytics do not consider the infrastructure that will support the new solution, even the best BI and analytics project will fail to meet end user expectations. This failure can often be attributed to a lack of database preparation.
But developers typically have little time to consider database performance issues when launching an application. Unfortunately, database resources are often further taxed by the increased load from visualizations and self-service capabilities offered by modern BI.
This article covers database and data-related issues that organizations undertaking a project to deliver improved self-service analytics to their users should consider. These are general guidelines designed to help frame requirements for a business intelligence project — stemming from lessons learned thanks to 1,500 successful implementations of the Izenda platform. But every database is unique. Your database administrator will be able to help determine the best path to prepare your data for reporting.
Self-service analytics in near real time is a primary objective of modern BI. If an analytics solution cannot support end-user expectations for performance, or if the reporting infrastructure requires too much IT support, then the economics of the new solution might not make sense. Before implementing self-service reporting, it's important to determine whether the current database system can handle the increased load, or if investment in other hardware or software is necessary.
Database and data-related issues that organizations should consider to ensure improved self-service analytics vary. In general, when it comes to decisions about reporting databases, teams should understand:
End user needs: Do they require ad hoc reports, dashboards containing key performance indicators (KPIs), or visualizations? These needs generate varying levels of demands on the relational database management system (RDBMS) beyond those of basic ad hoc reports.
Data needs: What data sources users need, whether they involve more than one database and whether there are data quality issues that need attention before users try to do self-service reporting.
Performance expectations: How quickly a query need to be returned, how live the data must be and whether users expect instant, real-time results.
Project scale: Number of clients, users, and roles. Given the scale and scope of the project, infrastructure changes may move from "nice to have" to "necessary."
Legacy reporting: What existing reports must be delivered when migrating from a legacy reporting platform, which reports will need to be run in parallel, and for how long post-launch. There are three basic options for data sources for self-service reporting:
Traditional data warehouse or online analytical processing (OLAP) solutions
Proprietary third party analytics databases
An existing database, optimized for reporting
Each of these options has pros and cons for managing data for analytics, and should be evaluated prior to deciding on an analytics solution.
One option is to offload reporting data to an analytics database, such as a data warehouse or OLAP cube. Although they significantly improve query speed, reporting databases require regular (usually nightly) ETL processing, as well as an OLAP engine that performs analytic operations, and related database tools.
If data is coming from more than one data source, the analytics database will require a staging layer to hold the data extracted from each source and an integration layer to combine it before passing it to the data warehouse.
Not surprisingly, implementing an analytics database is in itself a significant project. Once implemented, it will continue to require maintenance in the form of nightly ETL updates, as well as ongoing investments in hardware, software and database admin personnel. Nightly batch updates can be slow, leaving a large window of time during which the system is inaccessible.
Queries against OLAP databases are not SQL-based, and so may require additional technical skills to implement. Finally, the nature of the ETL process means that data in the warehouse is only as fresh as the last ETL process, not real time. End users will need to understand the limitations on the freshness of their data.
Analytics can also be performed on a relational database by implementing a variant of OLAP known as relational OLAP, or ROLAP. Although it uses a standard relational database, a ROLAP implementation does not run against an OLTP database; it requires a specialized star or snowflake schema design.
There are benefits to ROLAP. It is built on stable relational database technology. It’s scalable and fast: ROLAP cubes can be built on the fly and cached to improve reporting speeds. Queries are SQL-based, so proprietary query tools are not necessary.
But a ROLAP implementation is still a significant undertaking. It requires creation of an additional database to hold aggregated data and (possibly custom coded) ETL to maintain.
Many third party BI tools have solved the issue by implementing their own analytics databases. Most solutions require loading data into a proprietary database that is unique to the tool. Some provide an interface for managing data sources that lets a business analyst, rather than a developer, define the data model. These platforms are designed to rapidly return results from queries of large data volumes, as well as to return very granular query results.
Proprietary databases tend to work well as enterprise tools, but because they are developed and deployed primarily for enterprises they are a poor match for small and medium businesses (SMBs). They may:
be resource-heavy, requiring additional investments in RAM memory or other third party tools
require a high investment in IT resources, both for initial deployment and in daily processing, since these databases are often still loaded using batch ETL processes
require ongoing IT support – and because some of these use proprietary query languages instead of SQL, it may be difficult to hire resources with these skills
Software vendors who are accustomed to agile development cycles may have a hard time adapting to the sometimes glacial pace of enterprise software products. Finally, most third party products are not designed to be embedded in, and therefore may not mesh well with, existing applications.
A third option is optimizing existing database(s) for reporting. It is a viable option that should not be overlooked. Delivering reporting against an optimized database, instead of a data warehouse or third party solution, can help speed time to market and control costs.
Optimization may work as a phase one deliverable; or, if the data doesn’t scale beyond the tens or hundreds of thousands of rows, it may be effective enough to allow reporting against live data. This will depend on the application’s database model.
Options to enhance database performance for reporting often seem counterintuitive, but they can be effective. Consider these key workarounds to enhance database performance:
Denormalize data: Selectively add data redundancy to improve read speed by reducing the number of JOINs required by a query.
More indexes: Add additional foreign and primary keys.
Preaggregate data for reporting: Use existing reports and KPIs to identify the most common queries, which are the most likely candidates for preaggregation.
Caching: Use an in-memory database caching system like Memcached or Redis.
Views: Set up reporting friendly views and procedures that preaggregate the data that end users interact with most often. Using views for reports also simplifies the end user’s experience and improves query performance.
Specify NOLOCK in selected queries: Not always the best workaround when running queries against transactional databases, NOLOCK can end up returning dirty data if updates are rolled back after a query is returned.
Infrastructure upgrades: Ensure that the database server has adequate CPU, memory, and disk space. Consider clustering databases, updating the database version, and/or implementing load balancing. Also consider upgrading hardware, like moving from disk storage to solid state storage drives.
Many software companies can enhance their applications with self-service reporting to increase adoption, improve customer retention and free scarce technical resources to focus on innovation. However, if the organization doesn’t consider what infrastructure will support the new reporting solution, even the best BI and analytics project will fail to meet end-user expectations.
Opinions expressed by DZone contributors are their own.