* New file-based CDK module scaffolding * Address code review comments * Formatting * Automated Commit - Formatting Changes * Apply suggestions from code review Co-authored-by: Sherif A. Nada <snadalive@gmail.com> Co-authored-by: Alexandre Girard <alexandre@airbyte.io> * Automated Commit - Formatting Changes * address CR comments * Update tests to use builder pattern * Move max files for schema inference onto the discovery policy * Reorganize stream & its dependencies * File CDK: error handling for CSV parser (#27176) * file url and updated_at timestamp is added to state's history field * Address CR comments * Address CR comments * Use stream_slice to determine which files to sync * fix * test with no input state * test with multiple files * filter out older files * group by timestamp * Add another test * comment * use min time * skip files that are already in the history * move the code around * include files that are not in the history * remove start_timestamp * cleanup * sync misisng recent files even if history is more recent * remove old files if history is full * resync files if history is incomplete * sync recent files * comment * configurable history size * configurable days to sync if history is full * move to a stateful object * Only update state once per file * two unit tests * Unit tests * missing files * remove inner state * fix tests * fix interface * fix constructor * Update interface * cleanup * format * Update * cleanup * Add timestamp and source file to schema * set file uri on record * format * comment * reset * notes * delete dead code * format * remove dead code * remove dead code * warning if history is not complete * always set is_history_partial in the state * rename * Add a readme * format * Update * rename * rename * missing files * get instead of compute * sort alphabetically, and sync everthing if the history is not partial * unit tests * Update airbyte-cdk/python/airbyte_cdk/sources/file_based/README.md Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com> * Update docs * reset * Test to verify we remove files sorted (datetime, alphabetically) * comment * Update scenario * Rename method to get_state * If the file's ts is equal to the earliest ts, only sync it if its alphabetically greater than the file * add missing test * rename * rename and update comments * Update comment for clarity * inject the cursor * add interface * comment * Handle the case where the file has been modified since it was synced * Only inject from AbstractFileSource * keep the remote files in the stream slices * Use file_based typedefs * format * Update the comment * simplify the logic, update comment, and add a test * Add a comment * slightly cleaner * clean up * typing * comment * I think this is simpler to reason about * create the cursor in the source * update * Remove methods from FiledBasedStreamReader and AbstractFileBasedStream interface (#27736) * update the interface * Add a comment * rename --------- Co-authored-by: Catherine Noll <noll.catherine@gmail.com> Co-authored-by: clnoll <clnoll@users.noreply.github.com> Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
1.3 KiB
1.3 KiB
Incremental syncs
The file-based connectors supports the following sync modes:
| Feature | Supported? |
|---|---|
| Full Refresh Sync | Yes |
| Incremental Sync | Yes |
| Replicate Incremental Deletes | No |
| Replicate Multiple Files (pattern matching) | Yes |
| Replicate Multiple Streams (distinct tables) | Yes |
| Namespaces | No |
We recommend you do not
Incremental sync
After the initial sync, the connector only pulls files that were modified since the last sync.
The connector checkpoints the connection states when it is done syncing all files for a given timestamp. The connection's state only keeps track of the last 10 000 files synced. If more than 10 000 files are synced, the connector won't be able to rely on the connection state to deduplicate files. In this case, the connector will initialize its cursor to the minimum between the earliest file in the history, or 3 days ago.
Both the maximum number of files, and the time buffer can be configured by connector developers.