As a type of front-end data service intended for end-users, reports in applications have received a lot of concerns about their performances. Users want to put their parameters in, perform a query, and get their results in no time. Though less than twenty seconds’ waiting time is within a tolerant range, three to five minutes will threaten to extend beyond a user’s patience with the worst product experience.
But why is a report so slow? And how shall we optimize its performance?
Usually, a report in an application is developed with a reporting tool. Each time the reporting process is slow, users tend to point their fingers at report developers or reporting tool vendors. In fact, in most cases, slowness is a symptom, the real disease lies in the data preparation stage. This means data handling is already slow before the reporting stage. So it’s of no use to simply optimize the report development or to push the reporting tool too hard.
Reports are for viewing. Since the ability of our naked eye is very limited, there’s no need to let a report present a large amount of data. The descriptive computing mode the reporting tool adopts, though intuitive, is not suitable for complex computations. It’s inappropriate and pushy to let a reporting tool deal with big data computing and complex computations. Reporting tools are only good at handling the arrangement of small data sets, and simple computations. It doesn't take long to perform these types of reports.
About 80% of slow downs in reporting are caused by terrible data preparation. Though a report presents only a very small amount of data, the original data for generating it could be huge. It takes very long to filter and aggregate a massive volume of data to get just a small volume. A similar thing happens with a complex computation, for which the most time-consuming job is the data preparation. So data preparation is the bottleneck in trying to facilitate report building. Ways to optimize the data preparation process are:
1. Optimizing data preparation code. Generally, it is SQL code or stored procedure code to optimize. Sometimes it is the application code (when involving non-database data sources or multiple databases).
2. Scaling up/out the database. We can also scale up/out the database, like clustering, if there is a huge amount of data and there is nothing more to be done in terms of code optimization.
3. Employing a high-performance computing engine. We can replace traditional databases that are poor in performance or costly for certain operations with a better computing mechanism.
Report Presentation Stage
The computational performance problem might also happen during the report presentation stage.
A typical scenario is building a multi-source report, for which joining and aligning multiple two-dimensional data sets by the primary key, and sometimes grouping and aggregation, are needed. As the reporting tool requires that all calculations be written in the cells, the relationship between a cell and other data sets can be only expressed by dataset filtering, such as the formula ds2.select(ID=.ID). The calculation has quadratic complexity. This has little impact on performance when the data amount is very small but has a serious effect if there is a bigger amount of data (say, thousands of rows) and a lot of data sets. The time it takes in performing the calculation could range from seconds to minutes or hours.
But the time can be significantly reduced if the calculations are handled during the data preparation stage outside of the reporting tool. With data sets coming from the same database, we just write the SQL JOIN statement; if data sets come from different databases or we want to save the database from high computational pressure, we can use the HASH JOIN algorithm outside of the database. The HASH JOIN algorithm treats multiple databases as a whole and brings a much higher efficiency than the filtering join used by the reporting tool does. It achieves an almost instant response for a several-thousand-row data set.
On the surface, the poor performance is a report-presentation-stage issue, but its solution is often found beyond the reporting tool.
Another scenario is a grouped and aggregated report with detail rows. Its problem is the slow computation caused by the handling of a big amount of data during the report presentation stage. The solution is the same – moving the data computing out of the reporting tool.
Data transmission is another bottleneck of reporting performance.
Currently, many applications are J2EE-based, reporting tools are developed in Java, and JDBC is used for database access. The problem is that, since the performance of some popular databases’ JDBC is bad (we’ll omit the brands here), it takes a fairly long time for users to retrieve data, even in a slightly large amount (tens of thousands of rows). Here’s a paradox: the database has light pressure on it and delivers quick computations, and the report tool handles a relatively simple computation, yet the overall reporting process is still slow.
Neither application vendors nor reporting tool vendors are capable of improving the database JDBC performance. What they can do is to think of ways to fix it from the outside. According to tests, data retrieval performance can be increased by several times using the multi-threaded processing (a precondition is a light-loaded database). No reporting tools, however, have ever provided the parallel retrieval functionality directly (it’s difficult to make it a reporting tool function because the data segmentation plan varies with database products as well as data retrieval syntax, and thus needs to be controlled by code). Still, the solution is implementing the parallel retrieval during the data preparation stage beyond the reporting tool.
If we can cache the results of recent computations for generating a report, then when other users perform queries using the same parameters, there will be no need to re-compute the data and they can just return the results. Obviously, this will enhance user experience. This kind of cache functionality is offered by many reporting tools, but not for intermediate results. Only the final result of the computations for generating a report can be cached, and different reports have separated and unconnected caches.
1. Partial caching. For certain types of reports, particularly the commonly seen multi-source reports, most of the involved original data is static (historical data) and only a small part is dynamic (current data). The lifetime of the whole-report cache data is measured by the shortest data history, which results in repeated computations. To cache a part of the data, it can extend its term of cache validity, and thus reduce the number of necessary computations.
2. Reusability of caches. Different reports may reference the same data. But a fragmented cache mechanism for reporting will result in replicated computations of the same data. If a cached computational result can be made available across all reports, the number of necessary computations may be considerably reduced.
Controlling cache is complicated and requires coding. It’s difficult to provide such a function within a reporting tool, but it’s relatively easy to implement it during the programmable data preparation process.
As previously mentioned, the relationship between reporting and big data isn’t as strong as many reporting tool vendors advertising big data report claim it to be.
There’s indeed a type of list reports that can be called big data reports. A list report, which lists transaction details in a certain period of time, is frequently seen in the financial sector. It features a huge amount of data across thousands of, even ten thousand, pages. The computations, however, are relatively simple, such as listing data, most of the time, or the summarization by groups or pages, at most.
Generally, reporting tools adopt a full memory storage strategy in order to manage inter-cell computations with flexibility. It is very likely that the strategy will lead to memory overflow when loading a big list into the reporting tool. Besides, it takes an intolerably long time to retrieve and load the list.
It’s easy to think about a way of presenting data while retrieving that data, page by page, causing that will present no danger of memory overflow and reduce the time users have to wait. All databases provide cursors to retrieve data out, part by part. Users may flip pages on the front-end, which demands the ability to retrieve data by random page numbers (i.e. by rows), but databases don’t offer this kind of API. Retrieving data with conditional filtering not only is quite slow but can’t ensure data consistency.
There’s no better way than to get it done during the data preparation stage. We can devise two asynchronous threads: one for retrieving data from the database and buffer it to the external storage (suppose there’s too much data for the memory); and the other for receiving a front-end request and loading data from the buffer by random page numbers (i.e. by rows) to return.