Postgres Query Execution: JDBC Prepared Statements
JDBC provides client side prepared statements, but not immediately server side ones.
Join the DZone community and get the full member experience.Join For Free
This post is about Postgres JDBC (PGJDBC) prepared statements, the setting prepareThreshold, and the performance implication of it.
A prepared statement is a statement that is defined so that its variables can be set, and executed. An example:
PreparedStatement p = c.prepareStatement("select i from t where i = ?"); p.setInt(1,99); rs=p.executeQuery();
1: Definition of a prepared statement p.
2: set ('bind') the first variable of p as 99.
3: execute prepared statement p.
In order to execute prepared statement p again, all that needs to be done is 2 for a new value, and 3 to execute the prepared statement again.
The obvious advantage of a prepared statement is that the statement doesn't need to be created in java again, a new value can be bound, and executed. The above construction is known as a 'client-side prepared statement'.
A separate advantage of a client-side prepared statement is that it can create a prepared statement in the database. This is known as a 'server-side prepared statement'. Despite looking quite alike, this distinction is important, as you will see. A prepared statement in the database has the advantage that it can skip the parse phase, and execute the bind and execute phases only, meaning a reduction of work. Despite the difference between a client-side and server-side prepared statement, they do work closely together.
Once a prepared statement exists in Postgres, Postgres can choose a generic plan, which -if chosen- means the planning step in Postgres will be skipped, which is a further reduction of work, and thus a potential improvement in performance.
But there are some details that need to be aligned to make this work predictably. Also, mind this is not a solution for every case.
On the client/java side there are a few settings with default values that dictate how the client uses prepared statements. The client also dictates how the server-side of this is handled.
The following settings can found on the Postgres jdbc documentation website:
- prepareThreshold, default: 5
- preparedStatementCacheQueries, default: 256
- preparedStatementCacheMiB, default: 5
When a prepared statement is defined and executed in Java, it doesn't create a server-side prepared statement. Instead, it waits for 4 client-side executions before a server-side prepared statement is created. It is my personal opinion this is not obvious.
A prepared statement is automatically cached. It's logical a prepared statement is kept available as long as the prepared statement is not closed. It's less obvious that the prepared statement is cached even if it's closed. The reason for the cache is to keep a potential server-side prepared statement in existence.
The prepared statement cache described with preparedStatementCacheQueries also has a limit in size, which is 5 MB. Please mind this is the client-side prepared statement itself, not the data.
What I don't know for the cache's number of queries and size is how that relates to open/used prepared statements, in other words: is the number the total (open+closed) number or the number of closed prepared statements, and does this limit open cursors.
On the database side, there are two mechanisms in play that need to be understood. These mechanisms have no settings that can change the behavior in Postgres version 11. Starting from Postgres version 12 there's the setting plan_cache_mode for changing the planner's generic/custom plan choice.
As described earlier, a client-side prepared statement that is executed will result in a regular execution for the first four times by default. So the database executes the parse, plan and execute steps, exactly like a regular statement; the database/backend doesn't know that the executed statement is a client-side prepared statement at that time.
Once the client reaches the fifth (prepareThreshold) execution, JDBC will create a server-side prepared statement and execute that.
After the creation of the server-side prepared statement, so with a sixth execution, the client can take advantage of the server-side prepared statement and only bind and execute the server-side prepared statement. The obvious advantage here is the parse step is skipped.
A server-side prepared statement means the relevant specifics of a query are stored on the server-side. That can only be possible if that is stored in memory in the server-side session that the client is communicating with. This ultimately means that server-side prepared statements take some extra memory. The cases that I've seen and investigated show that it's not extremely memory hungry, but it still takes memory.
As mentioned above, a server-side prepared statement means the relevant specifics of a query are stored on the server-side. This means for server-side prepared statements to work, the client must be able to reuse the specific server-side session that it has been communicating with in order to re-execute the prepared statement that was stored.
If a connection pool cannot guarantee it can keep the client session linked with the same server session, such as statement and transaction pool mode with pgbouncer, then prepared statements cannot work successfully.
During the binding phase of the server-side execution, the bind values are taken and used by the planner to generate a custom plan. A prepared statement by default takes 5 custom plans before it is considering a generic plan. A custom plan means the planner computes the plan. A generic plan needs to be computed too, but if it is chosen, it sticks with the generic plan and skips plan computing, which is a further optimization. The plan choice can be influenced starting from Postgres version 12 with the plan_cache_mode setting.
Please mind the following specifics:
- Only once a statement is a server-side prepared statement, it can start gathering custom plan costs.
- It takes 5 (hardcoded value) custom plans before a generic plan is considered.
- A generic plan does not take the bind value specific properties into account, it assumes average (generic) data distribution. This means that there are cases where you don't want a generic plan.
- Starting from version 12 you can set plan_cache_mode.
- A prepared statement's logical and default path is going towards a generic plan. So if you don't want a generic plan, you either need to take care of that yourself somehow or don't use prepared statements in the first place.
- A non-performance reason for using a prepared statement is using the value binding for avoiding SQL injection. In such a case, you have to make up your mind if you want the performance advantage and the planner consequences. If not, set prepareThreshold to 0, which disables server-side prepared statement generation, and thus cannot switch to a generic plan.
A thing that I think is important to highlight is the persistent myth of prepared statement creation generating an extra (network?) roundtrip. When using a current JDBC version (which is what I have tested with), the server-side prepared statement generation is simply a slightly different version of parsing for the Postgres server. If the request size allows it, in a regular request and the request of creating a prepared statement, the parse, bind and execute steps are inside a single packet.
When using the simple query protocol, which is possible for JDBC but not defaults, but is used when the 'psql' utility is used, prepared statement creation can only be done using the prepare command, after which a prepared statement can be used using the execute command. So in that case, an extra command must be executed. But for high performance, the extended query protocol is really the only option.
Now that I've listed all the details, the obvious next question is: how much do I actually benefit from it? The short answer is that this depends on a lot of factors. However, to still be able to answer it, I created a java program that allows you to run it in your own environment so you can see and measure the effects.
Below is a bar chart of the results from running the tests on my machine in a VirtualBox virtual machine.
- The purpose of the test is to show prepared statement execution latencies. So this is a pure microbenchmark.
- Therefore I created a simple table with a thousand entries and a primary key index so that the index fetch would be sufficient.
- I performed a warmup run before the tests to make sure all/most of the data is cached in Postgres, so no disk IO is required to obtain the value for the select.
- The Y-axis is nanoseconds.
- The X-axis is grouped per prepareThreshold value (0-5), each bar represents the average time the #th execution took (so the first bar is the average time the first execution took, the second bar is the average time of the second execution, and so on until the tenth execution.
- I found that I cannot get all runs to have latencies really close to each other, there seem to be to unknown factors that can make a run perform differently. I do think the general tendency is clear.
- 0 means no server-side prepared statements at all. The latencies seem to be consistent. This is logical because outside of the client-side prepared statement creation, it's doing the same all the time.
- 1 means the first execution creates the client-side prepared statement, but it also creates the server-side prepared statement. The latency is significantly lower between the 1st and 2nd execution (25% latency reduction), which is the server-side prepared statement execution-only requiring bind and execute. Exactly after 5 executions, there is another drop in latency (33% reduction of previous execution), which the generic plan having been chosen and thus the plan generation is skipped.
- 2 means the first execution is done regularly and the second execution creates the server-side prepared statement. The latency drops with the third execution, and after 5 executions the latency drops again because the generic plan is chosen and thus plan generation skipped.
- 3 means the first two executions are done regularly and the third execution creates the server-side prepared statement, reduces the latency, and after 5 times the latency is further reduced. What I cannot explain is why the overall latency here is higher. However, the pattern of reduction is consistent.
- 4 means the first three executions are done regularly and the fourth execution creates the server-side prepared statement. This round of executions has latencies close to the ones with 3.
- 5 means the first four executions are done regular and the fifth execution creates the server-side prepared statement. This too shows the drop in latency when the server-side prepared statements are used. The drop in latency because of the choice for a generic plan is not visible because the test didn't get beyond running 5 times with a server-side prepared statement.
Prepared statements with JDBC are divided between client-side prepared statements and database server-side prepared statements. By default, you only get server-side prepared statements after five times executing the client-side prepared statement.
Once a server-side prepared statement is created, successive executions have reduced latency because the parse phase is skipped. The rate depending on the architecture of your application, this gives lower latency and reduced CPU usage.
After again five executions a prepared statement gathered enough custom plan costs to consider a generic plan. If a generic plan is chosen, the plan generation step is omitted, which gives lower latency and reduces CPU usage, ratio depending on application architecture.
Essentially, a prepared statement leads to a generic plan. If a generic plan is not appropriate, your choices are to not use a prepared statement, set prepareThreshold to zero so a server-side prepared statement is never created, or set plan_cache_mode to custom (version 12 and higher).
So far, in a (simple!) laboratory environment, I have shown how prepareThreshold and the generic planner threshold, and have proven there are no negative performance side-effects of setting the value of prepareThreshold to 1 (default: 5) to get server-side prepared statements as soon as possible, which leads to crossing the generic planner threshold as soon as possible, leading to potentially the most efficient way in the least amount of executions (6).
Published at DZone with permission of Frits Hoogland. See the original article here.
Opinions expressed by DZone contributors are their own.