Caching in Extracts and Transforms
Caching can be useful if an extract or transform is accessed several times during the run time of a job, including all of its sub-jobs and loads. This can occur if the job is used in different transforms, lookup functions, or loads, or when looping over it in a TableLoop transform or a Loop job.
If caching is activated, the complete output of the extract or transform is temporarily stored during the first call, using an internal H2 database. Subsequent calls of the extract/transform read directly from the cache without connecting to the underlying source system. If the extract/transform or the underlying connection contain variables, a separate cache is build for different values of these variables.
Caching can be enabled in settings of many extracts and transforms. For example, the screenshot below shows caching enabled at the bottom of Settings for an Excel extract:
Some advantages to using caching include:
- Performance gain if the retrieval of data from the source system is time consuming, e.g. in case of complex SQL queries or remote systems with a slow network connection
- Improved consistency if the data in the source system can change during the execution of an Integrator job
One disadvantage to caching is increased overhead due to storage of the cache, especially in case of huge result sets for the extract.
Implicit caching of sources
In some cases, caching is required to calculate the result of the source. This implicit caching occurs independently of the manual setting "Use caching." In these cases, an additional manual setting has no effect.
TableAggregation transform TableNormalization transform TableDenormalization transform |
If an aggregate is set that is different than "none" (e.g. "sum"). |
TableView transform | If sorting is defined. |
TableJoin transform | If "Persisted Join" is set. |
File extract | If a SQL query is defined. |
You can see in the execution log whenever a source is cached. An information message starting with "Explicit caching of source ..." or "Caching of source..." (in case of implicit caching) is raised in the log.
Note
The notation of numeric source values with fixed scale will change if caching is used in Relational or RelationalTable extracts or in transforms that have a relational extract as their direct or indirect source.
For example, if the SQL datatype has a numerical type with fixed scale NUMERIC(30,5) in the relational database, this fixed scale is not kept in the source, and trailing zeros are removed with caching. A value of 1000.12300 read from the relational database will become 1000.123 (with the trailing 0 omitted). See results below.
SQL datatype NUMERIC(30,5) with fixed scale of 5
Value in Relational extract: 1000.12300
- without caching: 1000.12300
- with caching: 1000.123 --> no fixed scale, trailing 00 are omitted
Value in Relational extract: 1000.00123
- without caching: 1000.00123
- with caching: 1000.00123 --> no trailing 0, value unchanged
Best Practices for "Use Caching" Option
When you have a large project, the ETL Monitor helps you find the most time-consuming tasks in the Integrator job. You can then use caching on this task to quicken the process.
To find out the duration of each task, look at the "Execution Runtime Details" Table. To view this table, right-click on the project name in the Integrator Monitor and select Show execution runtime details, as shown below.
This table provides the complete detail of the current Job in a list format. You can sort the items by "Duration" or "Input calls".
If a long-running transform or extract has multiple input calls and output calls, it is recommended to set up the use caching option for this item. For example, if in a project both the input calls and output calls of an item are larger or equal to five ( >=5) and its "Duration" is one of the most time-consuming items of the list, set up the "use caching" option for this item. With the next job run, you can notice the optimization effects with the new figures.
Note that caching is beneficial if both the "Input calls" and "Output calls" in the monitor show great numbers. If "input calls" is showing 1 and "output calls" is showing a large number, caching of the transform only reduces the multiple processing of the transform but not its underlying sources.
When to use caching
When you check the "Execution Runtime Results" Table and the output call number of a job with an extract or transform is smaller or equal to one ( <=1), you do not need to select the "use caching" option.
Furthermore, it is important to know that the extract/transform is always expected to output the same data (except when there are different variable values). If there is a load between those calls that alters that component's output, caching is not feasible.
Also, note that not every caching results in better performance. If the extract/transform contains different project variables, the "use caching" flag can drastically slow down the execution time. However, you can immediately detect this in the "Execution Runtime Detail" Table. Furthermore, it is important to note that the extract/transform is expected to always output the same data (except when there are different variable values.) If there is a load between those calls that alters that component's output, caching is not feasible.
Factors that show you need to use caching:
-
High number of output calls
-
High duration of the component
-
A low number of processed output rows (if the previous two factors apply)
-
The output data is the same in all output calls (except for different variable values)
Further information
-
If the different output calls of the source are processed with different variable settings, then manual caching does not help. Even if you set the cache, those calls will not use it, as this may lead to different results.
-
For extracts from a remote system, when the processing of the source system takes a long time, it is better to use manual caching. For example, use manual caching in a relational database with complex SQL statements and/ or a slow network.
-
When an extract is used in a transform, and they both have the same number of output calls, it is better to set the manual caching on the transform. If the transform itself is used in another transform with the same number of output calls as the extract and the first transform, then set the manual caching on the second transform.
-
Sources that are already automatically cached should not additionally be cached manually.
-
After the end of each execution, all caches are cleared.
Updated January 30, 2025