Intervals and Windows
When defining a new job, you need to define the interval (cron) at which it will be triggered. This parameter can give you a precise value when the job is scheduled for execution but only a rough estimate exactly when the job is executing. It is very common in a ETL pipeline to know when the job is exactly executing as well as for what time window the current transformation will consume the data.
For example, assume there is a job that querying from a table using below statement:
SELECT * FROM table WHERE
created_at >= DATE('{{.DSTART}}') AND
created_at < DATE('{{.DEND}}')
DSTART and DEND could be replaced at the time of compilation with based on its window configuration. Without the provided filter, we will have to consume all the records which are created till date inside the table even though the previous rows might already been processed.
These DSTART and DEND values of the input window could vary depending on the ETL job requirement.
- For a simple transformation job executing daily, it would need to consume full day work of yesterday’s data.
- A job might be consuming data for a week/month for an aggregation job, but the data boundaries should be complete, not consuming any partial data of a day.
Window Configuration
Optimus allows user to define the amount of data window to consume through window configurations. The configurations act on the scheduletime of the job and applied in order to compute _DSTART and DEND.
- Truncate_to: The data window on most of the scenarios needs to be aligned to a well-defined time window like month start to month end, or week start to weekend with week start being monday, or a complete day. Inorder to achieve that the truncate_to option is provided which can be configured with either of these values "h", "d", "w", "M" through which for a given schedule_time the end_time will be the end of last hour, day, week, month respectively.
- Offset: Offset is time duration configuration which enables user to move the
end_time
post truncation. User can define the duration like "24h", "2h45m", "60s", "-45m24h", "0", "", "2M", "45M24h", "45M24h30m" where "h","m","s","M" means hour, month, seconds, Month respectively. - Size: Size enables user to define the amount of data to consume from the
end_time
again defined through the duration same as offset.
For example, previous-mentioned job has 0 2 * * *
schedule interval and is scheduled to run on
2023-03-07 at 02.00 UTC with following details:
Configuration | Value | Description |
---|---|---|
Truncate_to | d | Even though it is scheduled at 02.00 AM, data window will be day-truncated (00.00 AM). |
Offset | -24h | Shifts the window to be 1 day earlier. |
Size | 24h | Gap between DSTART and DEND is 24h. |
Above configuration will produce below window:
- DSTART: 2023-04-05T00:00:00Z
- DEND: 2023-04-06T00:00:00Z
This means, the query will be compiled to the following query
SELECT * FROM table WHERE
created_at >= DATE('2023-04-05T00:00:00Z') AND
created_at < DATE('2023-04-06T00:00:00Z')
Assume the table content is as the following:
name | created_at |
---|---|
Rick | 2023-03-05 |
Sanchez | 2023-03-06 |
Serious | 2023-03-07 |
Sam | 2023-03-07 |
When the job that scheduled at 2023-03-07 runs, the job will consume Rick
as the input of the table.
The above expectation of windowing is properly handled in job spec version 2, version 1 has some limitations in some of these expectations. You can verify these configurations by trying out in below command:
$ optimus playground