Caching in Extracts


Most of the tabular-based extract types have the parameter “Use Caching” with the following possible options:

  • None
  • Memory
  • Disk

Caching can be used if an extract is accessed several times during the runtime of a job, including all of its subjobs and loads. This can occur if the job is used in different transforms, lookup functions, or loads, or when looping over it in a TableLoop transform or a Loop job.

If caching is activated, the complete output of the extract is temporarily stored during the first call of the extract, using an internal H2 database. Subsequent calls of the extract read directly from the cache without connecting to the underlying source system of the extract. If the extract or the underlying connection contain variables, a separate cache is build for different values of these variables.

Some advantages to using caching include:

  • Performance gain if the retrieval of data from the source system is time consuming, e.g. in case of complex SQL queries or remote systems with a slow network connection
  • Improved consistency if the data in the source system can change during the execution of an ETL job

One disadvantage to caching is increased overhead due to storage of the cache, especially in case of huge result sets for the extract.