* New file-based CDK module scaffolding * Address code review comments * Formatting * Automated Commit - Formatting Changes * Apply suggestions from code review Co-authored-by: Sherif A. Nada <snadalive@gmail.com> Co-authored-by: Alexandre Girard <alexandre@airbyte.io> * Automated Commit - Formatting Changes * address CR comments * Update tests to use builder pattern * Move max files for schema inference onto the discovery policy * Reorganize stream & its dependencies * File CDK: error handling for CSV parser (#27176) * file url and updated_at timestamp is added to state's history field * Address CR comments * Address CR comments * Use stream_slice to determine which files to sync * fix * test with no input state * test with multiple files * filter out older files * group by timestamp * Add another test * comment * use min time * skip files that are already in the history * move the code around * include files that are not in the history * remove start_timestamp * cleanup * sync misisng recent files even if history is more recent * remove old files if history is full * resync files if history is incomplete * sync recent files * comment * configurable history size * configurable days to sync if history is full * move to a stateful object * Only update state once per file * two unit tests * Unit tests * missing files * remove inner state * fix tests * fix interface * fix constructor * Update interface * cleanup * format * Update * cleanup * Add timestamp and source file to schema * set file uri on record * format * comment * reset * notes * delete dead code * format * remove dead code * remove dead code * warning if history is not complete * always set is_history_partial in the state * rename * Add a readme * format * Update * rename * rename * missing files * get instead of compute * sort alphabetically, and sync everthing if the history is not partial * unit tests * Update airbyte-cdk/python/airbyte_cdk/sources/file_based/README.md Co-authored-by: Catherine Noll <clnoll@users.noreply.github.com> * Update docs * reset * Test to verify we remove files sorted (datetime, alphabetically) * comment * Update scenario * Rename method to get_state * If the file's ts is equal to the earliest ts, only sync it if its alphabetically greater than the file * add missing test * rename * rename and update comments * Update comment for clarity * inject the cursor * add interface * comment * Handle the case where the file has been modified since it was synced * Only inject from AbstractFileSource * keep the remote files in the stream slices * Use file_based typedefs * format * Update the comment * simplify the logic, update comment, and add a test * Add a comment * slightly cleaner * clean up * typing * comment * I think this is simpler to reason about * create the cursor in the source * update * Remove methods from FiledBasedStreamReader and AbstractFileBasedStream interface (#27736) * update the interface * Add a comment * rename --------- Co-authored-by: Catherine Noll <noll.catherine@gmail.com> Co-authored-by: clnoll <clnoll@users.noreply.github.com> Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
Data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes
We believe that only an open-source solution to data movement can cover the long tail of data sources while empowering data engineers to customize existing connectors. Our ultimate vision is to help you move data from any source to any destination. Airbyte already provides the largest catalog of 300+ connectors for APIs, databases, data warehouses, and data lakes.
Screenshot taken from Airbyte Cloud.
Getting Started
- Deploy Airbyte Open Source or set up Airbyte Cloud to start centralizing your data.
- Create connectors in minutes with our no-code Connector Builder or low-code CDK.
- Explore popular use cases in our tutorials.
- Orchestrate Airbyte syncs with Airflow, Prefect, Dagster or the Airbyte API.
- Easily transform loaded data with SQL or dbt.
Try it out yourself with our demo app, visit our full documentation and learn more about recent announcements. See our registry for a full list of connectors already available in Airbyte or Airbyte Cloud.
Join the Airbyte Community
The Airbyte community can be found in the Airbyte Community Slack, where you can ask questions and voice ideas. You can also ask for help in our Discourse forum, or join our office hours. Airbyte's roadmap is publicly viewable on GitHub.
For videos and blogs on data engineering and building your data stack, check out Airbyte's Content Hub, Youtube, and sign up for our newsletter.
Dedicated support with direct access to our team is also available for Open Source users. If you are interested, please fill out this form.
Contributing
If you've found a problem with Airbyte, please open a GitHub issue. To contribute to Airbyte and see our Code of Conduct, please see the contributing guide. We have a list of good first issues that contain bugs that have a relatively limited scope. This is a great place to get started, gain experience, and get familiar with our contribution process.
Security
Airbyte takes security issues very seriously. Please do not file GitHub issues or post on our public forum for security vulnerabilities. Email security@airbyte.io if you believe you have uncovered a vulnerability. In the message, try to provide a description of the issue and ideally a way of reproducing it. The security team will get back to you as soon as possible.
Airbyte Enterprise also offers additional security features (among others) on top of Airbyte Open Source.
License
See the LICENSE file for licensing information, and our FAQ for any questions you may have on that topic.
Thank You
Airbyte would not be possible without the support and assistance of other open-source tools and companies. Visit our thank you page to learn more about how we build Airbyte.