Caching in Extracts and Transforms
Caching can be useful if an extract is accessed several times during the run time of a job, including all of its sub-jobs and loads. This can occur if the job is used in different transforms, lookup functions, or loads, or when looping over it in a TableLoop transform or a Loop job.
If caching is activated, the complete output of the extract is temporarily stored during the first call of the extract, using an internal H2 database. Subsequent calls of the extract read directly from the cache without connecting to the underlying source system of the extract. If the extract or the underlying connection contain variables, a separate cache is build for different values of these variables.
Caching can be enabled in settings of many extracts and transforms. For example, the screenshot below shows caching enabled at the bottom of Settings for an Excel extract:
Some advantages to using caching include:
- Performance gain if the retrieval of data from the source system is time consuming, e.g. in case of complex SQL queries or remote systems with a slow network connection
- Improved consistency if the data in the source system can change during the execution of an Integrator job
One disadvantage to caching is increased overhead due to storage of the cache, especially in case of huge result sets for the extract.
See also Best Practices for "Use Caching" Option
Implicit caching of sources
In some cases, caching is required to calculate the result of the source. This implicit caching occurs independently of the manual setting "Use caching." In these cases, an additional manual setting has no effect.
TableAggregation transform TableNormalization transform TableDenormalization transform |
If an aggregate is set that is different than "none" (e.g. "sum"). |
TableView transform | If sorting is defined. |
TableJoin transform | If "Persisted Join" is set. |
File extract | If a SQL query is defined. |
You can see in the execution log whenever a source is cached. An information message starting with "Explicit caching of source ..." or "Caching of source..." (in case of implicit caching) is raised in the log.
Note
The notation of numeric source values with fixed scale will change if caching is used in Relational or RelationalTable extracts or in transforms that have a relational extract as their direct or indirect source.
For example, if the SQL datatype has a numerical type with fixed scale NUMERIC(30,5) in the relational database, this fixed scale is not kept in the source, and trailing zeros are removed with caching. A value of 1000.12300 read from the relational database will become 1000.123 (with the trailing 0 omitted). See results below.
SQL datatype NUMERIC(30,5) with fixed scale of 5
Value in Relational extract: 1000.12300
- without caching: 1000.12300
- with caching: 1000.123 --> no fixed scale, trailing 00 are omitted
Value in Relational extract: 1000.00123
- without caching: 1000.00123
- with caching: 1000.00123 --> no trailing 0, value unchanged
Updated December 24, 2024