airbyte/airbyte-cdk/python/airbyte_cdk/sources/file_based/remote_file.py at 6ebabdc2fa89b24583c7e7f10aee8ef5329a0da8 - airbyte - Gitea: Git with a cup of tea

jprdonnelly/airbyte

mirror of synced 2026-01-08 12:03:02 -05:00

Files

Alexandre Girard 6ebabdc2fa File-based CDK: Support for incremental syncs (#27382 )

* New file-based CDK module scaffolding

* Address code review comments

* Formatting

* Automated Commit - Formatting Changes

* Apply suggestions from code review

Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
Co-authored-by: Alexandre Girard <alexandre@airbyte.io>

* Automated Commit - Formatting Changes

* address CR comments

* Update tests to use builder pattern

* Move max files for schema inference onto the discovery policy

* Reorganize stream & its dependencies

* File CDK: error handling for CSV parser (#27176)

* file url and updated_at timestamp is added to state's history field

* Address CR comments

* Address CR comments

* Use stream_slice to determine which files to sync

* fix

* test with no input state

* test with multiple files

* filter out older files

* group by timestamp

* Add another test

* comment

* use min time

* skip files that are already in the history

* move the code around

* include files that are not in the history

* remove start_timestamp

* cleanup

* sync misisng recent files even if history is more recent

* remove old files if history is full

* resync files if history is incomplete

* sync recent files

* comment

* configurable history size

* configurable days to sync if history is full

* move to a stateful object

* Only update state once per file

* two unit tests

* Unit tests

* missing files

* remove inner state

* fix tests

* fix interface

* fix constructor

* Update interface

* cleanup

* format

* Update

* cleanup

* Add timestamp and source file to schema

* set file uri on record

* format

* comment

* reset

* notes

* delete dead code

* format

* remove dead code

* remove dead code

* warning if history is not complete

* always set is_history_partial in the state

* rename

* Add a readme

* format

* Update

* rename

* rename

* missing files

* get instead of compute

* sort alphabetically, and sync everthing if the history is not partial

* unit tests

* Update airbyte-cdk/python/airbyte_cdk/sources/file_based/README.md

Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com>

* Update docs

* reset

* Test to verify we remove files sorted (datetime, alphabetically)

* comment

* Update scenario

* Rename method to get_state

* If the file's ts is equal to the earliest ts, only sync it if its alphabetically greater than the file

* add missing test

* rename

* rename and update comments

* Update comment for clarity

* inject the cursor

* add interface

* comment

* Handle the case where the file has been modified since it was synced

* Only inject from AbstractFileSource

* keep the remote files in the stream slices

* Use file_based typedefs

* format

* Update the comment

* simplify the logic, update comment, and add a test

* Add a comment

* slightly cleaner

* clean up

* typing

* comment

* I think this is simpler to reason about

* create the cursor in the source

* update

* Remove methods from FiledBasedStreamReader and AbstractFileBasedStream interface (#27736)

* update the interface

* Add a comment

* rename

---------

Co-authored-by: Catherine Noll <noll.catherine@gmail.com>
Co-authored-by: clnoll <clnoll@users.noreply.github.com>
Co-authored-by: Sherif A. Nada <snadalive@gmail.com>

2023-06-27 15:58:26 -07:00

19 lines

312 B

Python

Raw Blame History

 #
 # Copyright (c) 2023 Airbyte, Inc., all rights reserved.
 #
 from datetime import datetime
 from typing import Optional
 from pydantic import BaseModel
 class RemoteFile(BaseModel):
     """
     A file in a file-based stream.
     """
     uri: str
     last_modified: datetime
     file_type: Optional[str] = None