The Best Way To Batch Inserts Via saveAll(Iterable<S> entities)

Leonard Anghel
Jan 8, 2020
4 min read

Updated: May 19, 2020

Motivation:

This article is useful for Implementing an optimal insert batching mechanism via saveAll() method. The best way to do it is as in this article. But, using the Spring Data built-in saveAll() for batching inserts is a solution that requires less code. Only that this solution raises some gotchas.

150+ PERSISTENCE PERFORMANCE ITEMS

THAT WILL ROCK YOUR APPS

Description:

This article shows you the best way of batching inserts via the Spring Data built-in saveAll(Iterable<S> entities) method.

Key points:

In entity (entities) that are involved

in batching you must add @Version property

The Hibernate IDENTITY generator (e.g., MySQL, AUTO_INCREMENT, PostgreSQL (BIG)SERIAL) disables JDBC insert batching.

Adding @Version applies if you use the assigned generator (manual assigment of IDs).

For example, this statement is useful for MySQL (don't use TABLE generator because it has serious performance penalites). In other databases, such as PostgreSQL, you can use SEQUENCE generator and adding @Version is not needed.

But, why is @Version needed for the assigned generator? This is needed to avoid additional-SELECT statements fired before batching. For N inserts there will be N additional-SELECT statements, therefore a perfomance penalty that should be eliminated. These additional-SELECT statements are caused by the saveAll() implementation. Behind the scene, saveAll() call save(), which, in case of non-new entities (entities that have assigned IDs), will call merge(). Calling merge() instructs Hibernate (the default Spring Data persistence provider) to trigger a SELECT statement before an INSERT. This way, Hibernate ensures that there is no record in the database having the same identifier (ID). By adding @Version field in entities that are involved in batching inserts you will eliminate these additional-SELECT statements (they will not be triggered):

Don't overwhelm the Persistence Context

Pay attention on the amount of data passed to saveAll() to not overwhelm the Persistence Context (1st Level Cache). The 1st Level Cache should be flushed and cleared periodically to avoid slow flushes or even worse, memory errors (OOM). But during the saveAll() execution you simply cannot force flush/clear of 1st Level Cache, so if you pass to saveAll() a huge iterable (e.g., 1 000 000 items) then all that items will hit the Persistence Context (1st Level Cache) and will remain in memory until the flush time triggered before transaction commit. They will be cleared from memory after transaction committed. This may cause slow flushes or memory errors. Using relatively small amount of data (e.g., 100 items) should be ok. Otherwise, execute each batch in a separate transaction: in the following example, we batch 1000 entities. Each batch contains 30 entities (instances of Author) and it run in a separate transaction. So flush and clear operations take time after each batch. This is exactly what we want because this way we avoid long-runing transactions and, in case of a failure, we rollback only the last batch:

So, don't write the code as below unless you want to bach all 1000 entities in a single transaction, with a single flush and clear at the end:

Potential useless List objects

The saveAll() method return a List<S> containing the persisted entities. Each persisted entity is added into this list. If you just don't need this List then it is created for nothing, but, as long as we rely on the default saveAll() implementation, there is nohing we can do.

Prepare batching:

In application.properties set spring.jpa.properties.hibernate.jdbc.batch_size (recommended value is between 5-30):

In application.properties set spring.jpa.properties.hibernate.generate_statistics (just to check that batching is working):

Additionally, if is not needed, then ensure that Second Level Cache is disabled via spring.jpa.properties.hibernate.cache.use_second_level_cache=false

Optimize parent-child batching:

This setting adds a huge boost if you need to batch parent-child associations - this aspect is detailed in Spring Boot Persistence Best Practices).

In case of using a parent-child relationship with cascade persist (e.g. one-to-many, many-to-many) then consider to set up spring.jpa.properties.hibernate.order_inserts to true to optimize the batching by ordering inserts.

For MySQL only:

This is mandatory: In entity, use the assigned generator since MySQL IDENTITY will cause insert batching to be disabled.

The below settings are needed only if you want to use MySQL - their goal is to provide optimizations that will speed up the batching process).

In application.properties set JDBC URL with rewriteBatchedStatements=true (SQL statements get rewritten into a single string buffer and sent in a single request to the database; supported in MySQL and PostgreSQL)
In application.properties set JDBC URL with cachePrepStmts=true (enable caching and is useful if you decide to set prepStmtCacheSize, prepStmtCacheSqlLimit, etc as well; without this setting the cache is disabled)
In application.properties set JDBC URL with useServerPrepStmts=true (this way you switch to server-side prepared statements (may lead to signnificant performance boost)):

Tam Ta Da Dam! :) The complete application is available on GitHub.

If you want to avoid the shortcomings of saveAll() listed below then this is what you're looking for:

use persist() instead of merge()
don't require the presence of @Version in entities for avoiding extra-SELECTs
don't return a List of the persisted entities
expose batching in Spring style via a method named, saveInBatch(Iterable<S>)

If you need a deep dive into the performance recipes exposed in this repository then I am sure that you will love my book "Spring Boot Persistence Best Practices".