* New file-based CDK module scaffolding * Address code review comments * Formatting * Automated Commit - Formatting Changes * Apply suggestions from code review Co-authored-by: Sherif A. Nada <snadalive@gmail.com> Co-authored-by: Alexandre Girard <alexandre@airbyte.io> * Automated Commit - Formatting Changes * address CR comments * Update tests to use builder pattern * Move max files for schema inference onto the discovery policy * Reorganize stream & its dependencies * File CDK: error handling for CSV parser (#27176) * file url and updated_at timestamp is added to state's history field * Address CR comments * Address CR comments * Use stream_slice to determine which files to sync * fix * test with no input state * test with multiple files * filter out older files * group by timestamp * Add another test * comment * use min time * skip files that are already in the history * move the code around * include files that are not in the history * remove start_timestamp * cleanup * sync misisng recent files even if history is more recent * remove old files if history is full * resync files if history is incomplete * sync recent files * comment * configurable history size * configurable days to sync if history is full * move to a stateful object * Only update state once per file * two unit tests * Unit tests * missing files * remove inner state * fix tests * fix interface * fix constructor * Update interface * cleanup * format * Update * cleanup * Add timestamp and source file to schema * set file uri on record * format * comment * reset * notes * delete dead code * format * remove dead code * remove dead code * warning if history is not complete * always set is_history_partial in the state * rename * Add a readme * format * Update * rename * rename * missing files * get instead of compute * sort alphabetically, and sync everthing if the history is not partial * unit tests * Update airbyte-cdk/python/airbyte_cdk/sources/file_based/README.md Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com> * Update docs * reset * Test to verify we remove files sorted (datetime, alphabetically) * comment * Update scenario * Rename method to get_state * If the file's ts is equal to the earliest ts, only sync it if its alphabetically greater than the file * add missing test * rename * rename and update comments * Update comment for clarity * inject the cursor * add interface * comment * Handle the case where the file has been modified since it was synced * Only inject from AbstractFileSource * keep the remote files in the stream slices * Use file_based typedefs * format * Update the comment * simplify the logic, update comment, and add a test * Add a comment * slightly cleaner * clean up * typing * comment * I think this is simpler to reason about * create the cursor in the source * update * Remove methods from FiledBasedStreamReader and AbstractFileBasedStream interface (#27736) * update the interface * Add a comment * rename --------- Co-authored-by: Catherine Noll <noll.catherine@gmail.com> Co-authored-by: clnoll <clnoll@users.noreply.github.com> Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
19 lines
312 B
Python
19 lines
312 B
Python
#
|
|
# Copyright (c) 2023 Airbyte, Inc., all rights reserved.
|
|
#
|
|
|
|
from datetime import datetime
|
|
from typing import Optional
|
|
|
|
from pydantic import BaseModel
|
|
|
|
|
|
class RemoteFile(BaseModel):
|
|
"""
|
|
A file in a file-based stream.
|
|
"""
|
|
|
|
uri: str
|
|
last_modified: datetime
|
|
file_type: Optional[str] = None
|