1
0
mirror of synced 2025-12-22 03:21:25 -05:00
Files
airbyte/docs/connector-development/tutorials/building-a-python-source.md

390 lines
17 KiB
Markdown

# Building a Python Source
## Summary
This article provides a checklist for how to create a python source. Each step in the checklist has
a link to a more detailed explanation below.
## Requirements
Docker, Python, and Java with the versions listed in the
[tech stack section](../../understanding-airbyte/tech-stack.md).
:::info
All the commands below assume that `python` points to a version of python >3.7. On some systems,
`python` points to a Python2 installation and `python3` points to Python3. If this is the case on
your machine, substitute all `python` commands in this guide with `python3` . Otherwise, make sure
to install Python 3 before beginning.
:::
## Checklist
### Creating a Source
- Step 1: Create the source using template
- Step 2: Build the newly generated source
- Step 3: Set up your Airbyte development environment
- Step 4: Implement `spec` \(and define the specification for the source
`airbyte-integrations/connectors/source-<source-name>/spec.yaml`\)
- Step 5: Implement `check`
- Step 6: Implement `discover`
- Step 7: Implement `read`
- Step 8: Set up Connector Acceptance Tests
- Step 9: Write unit tests or integration tests
- Step 10: Update the `README.md` \(If API credentials are required to run the integration, please
document how they can be obtained or link to a how-to guide.\)
- Step 11: Update the `metadata.yaml` file with accurate information about your connector. These
metadata will be used to add the connector to Airbyte's connector registry.
- Step 12: Add docs \(in `docs/integrations/sources/<source-name>.md`\)
:::info
Each step of the Creating a Source checklist is explained in more detail below.
:::
:::info
All `./gradlew` commands must be run from the root of the airbyte project.
:::
### Submitting a Source to Airbyte
- If you need help with any step of the process, feel free to submit a PR with your progress and any
questions you have.
- Submit a PR.
- To run integration tests, Airbyte needs access to a test account/environment. Coordinate with an
Airbyte engineer \(via the PR\) to add test credentials so that we can run tests for the
integration in the CI. \(We will create our own test account once you let us know what source we
need to create it for.\)
- Once the config is stored in Github Secrets, edit `.github/workflows/test-command.yml` and
`.github/workflows/publish-command.yml` to inject the config into the build environment.
- Edit the `airbyte/tools/bin/ci_credentials.sh` script to pull the script from the build
environment and write it to `secrets/config.json` during the build.
:::info
If you have a question about a step the Submitting a Source to Airbyte checklist include it
in your PR or ask it on
[#help-connector-development channel on Slack](https://airbytehq.slack.com/archives/C027KKE4BCZ).
:::
## Explaining Each Step
### Step 1: Create the source using template
Airbyte provides a code generator which bootstraps the scaffolding for our connector.
```bash
$ cd airbyte-integrations/connector-templates/generator # assumes you are starting from the root of the Airbyte project.
$ ./generate.sh
```
Select the `python` template and then input the name of your connector. For this walk through we
will refer to our source as `example-python`
### Step 2: Install the newly generated source
Install the source by running:
```bash
cd airbyte-integrations/connectors/source-<name>
poetry install
```
### Step 3: Set up your Airbyte development environment
The generator creates a file `source_<source_name>/source.py`. This will be where you implement the
logic for your source. The templated `source.py` contains extensive comments explaining each method
that needs to be implemented. Briefly here is an overview of each of these methods.
1. `spec`: declares the user-provided credentials or configuration needed to run the connector
2. `check`: tests if with the user-provided configuration the connector can connect with the
underlying data source.
3. `discover`: declares the different streams of data that this connector can output
4. `read`: reads data from the underlying data source \(The stock ticker API\)
#### Dependencies
Python dependencies for your source should be declared in
`airbyte-integrations/connectors/source-<source-name>/setup.py` in the `install_requires` field. You
will notice that a couple of Airbyte dependencies are already declared there. Do not remove these;
they give your source access to the helper interface that is provided by the generator.
You may notice that there is a `requirements.txt` in your source's directory as well. Do not touch
this. It is autogenerated and used to provide Airbyte dependencies. All your dependencies should be
declared in `setup.py`.
#### Development Environment
The commands we ran above created a virtual environment for your source. If you want your IDE to
auto complete and resolve dependencies properly, point it at the virtual env
`airbyte-integrations/connectors/source-<source-name>/.venv`. Also anytime you change the
dependencies in the `setup.py` make sure to re-run the build command. The build system will handle
installing all dependencies in the `setup.py` into the virtual environment.
Pretty much all it takes to create a source is to implement the `Source` interface. The template
fills in a lot of information for you and has extensive docstrings describing what you need to do to
implement each method. The next 4 steps are just implementing that interface.
:::info
All logging should be done through the `logger` object passed into each method. Otherwise,
logs will not be shown in the Airbyte UI.
:::
#### Iterating on your implementation
Everyone develops differently but here are 3 ways that we recommend iterating on a source. Consider
using whichever one matches your style.
**Run the source using python**
You'll notice in your source's directory that there is a python file called `main.py`. This file
exists as convenience for development. You can call it from within the virtual environment mentioned
above `. ./.venv/bin/activate` to test out that your source works.
```bash
# from airbyte-integrations/connectors/source-<source-name>
poetry run source-<source-name> spec
poetry run source-<source-name> check --config secrets/config.json
poetry run source-<source-name> discover --config secrets/config.json
poetry run source-<source-name> read --config secrets/config.json --catalog sample_files/configured_catalog.json
```
The nice thing about this approach is that you can iterate completely within in python. The downside
is that you are not quite running your source as it will actually be run by Airbyte. Specifically
you're not running it from within the docker container that will house it.
**Build the source docker image**
You have to build a docker image for your connector if you want to run your source exactly as it
will be run by Airbyte.
**Option A: Building the docker image with `airbyte-ci`**
This is the preferred method for building and testing connectors.
If you want to open source your connector we encourage you to use our
[`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md)
tool to build your connector. It will not use a Dockerfile but will build the connector image from
our
[base image](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/base_images/README.md)
and use our internal build logic to build an image from your Python connector code.
Running `airbyte-ci connectors --name source-<source-name> build` will build your connector image.
Once the command is done, you will find your connector image in your local docker host:
`airbyte/source-<source-name>:dev`.
**Option B: Building the docker image with a Dockerfile**
If you don't want to rely on `airbyte-ci` to build your connector, you can build the docker image
using your own Dockerfile. This method is not preferred, and is not supported for certified
connectors.
Create a `Dockerfile` in the root of your connector directory. The `Dockerfile` should look
something like this:
```Dockerfile
FROM airbyte/python-connector-base:1.1.0
COPY . ./airbyte/integration_code
RUN pip install ./airbyte/integration_code
# The entrypoint and default env vars are already set in the base image
# ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
# ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]
```
Please use this as an example. This is not optimized.
Build your image:
```bash
docker build . -t airbyte/source-example-python:dev
```
**Run the source docker image**
```bash
docker run --rm airbyte/source-example-python:dev spec
docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-example-python:dev check --config /secrets/config.json
docker run --rm -v $(pwd)/secrets:/secrets airbyte/source-example-python:dev discover --config /secrets/config.json
docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/sample_files:/sample_files airbyte/source-example-python:dev read --config /secrets/config.json --catalog /sample_files/configured_catalog.json
```
:::info
Each time you make a change to your implementation you need to re-build the connector image.
This ensures the new python code is added into the docker container.
:::
The nice thing about this approach is that you are running your source exactly as it will be run by
Airbyte. The tradeoff is that iteration is slightly slower, because you need to re-build the
connector between each change.
**Detailed Debug Messages**
During development of your connector, you can enable the printing of detailed debug information
during a sync by specifying the `--debug` flag. This will allow you to get a better picture of what
is happening during each step of your sync.
```bash
poetry run source-<source-name> read --config secrets/config.json --catalog sample_files/configured_catalog.json --debug
```
In addition to the preset CDK debug statements, you can also emit custom debug information from your
connector by introducing your own debug statements:
```python
self.logger.debug(
"your debug message here",
extra={
"debug_field": self.value,
"custom_field": your_object.field
}
)
```
**TDD using acceptance tests & integration tests**
Airbyte provides an acceptance test suite that is run against every source. The objective of these
tests is to provide some "free" tests that can sanity check that the basic functionality of the
source works. One approach to developing your connector is to simply run the tests between each
change and use the feedback from them to guide your development.
If you want to try out this approach, check out Step 8 which describes what you need to do to set up
the standard tests for your source.
The nice thing about this approach is that you are running your source exactly as Airbyte will run
it in the CI. The downside is that the tests do not run very quickly.
### Step 4: Implement `spec`
Each source contains a specification that describes what inputs it needs in order for it to pull
data. This file can be found in `airbyte-integrations/connectors/source-<source-name>/spec.yaml`.
This is a good place to start when developing your source. Using JsonSchema define what the inputs
are \(e.g. username and password\). Here's
[an example](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/spec.yaml)
of what the `spec.yaml` looks like for the stripe source.
For more details on what the spec is, you can read about the Airbyte Protocol
[here](../../understanding-airbyte/airbyte-protocol.md).
The generated code that Airbyte provides, handles implementing the `spec` method for you. It assumes
that there will be a file called `spec.yaml` in the same directory as `source.py`. If you have
declared the necessary JsonSchema in `spec.yaml` you should be done with this step.
### Step 5: Implement `check`
As described in the template code, this method takes in a json object called config that has the
values described in the `spec.yaml` filled in. In other words if the `spec.yaml` said that the
source requires a `username` and `password` the config object might be
`{ "username": "airbyte", "password": "password123" }`. It returns a json object that reports, given
the credentials in the config, whether we were able to connect to the source. For example, with the
given credentials could the source connect to the database server.
While developing, we recommend storing this object in `secrets/config.json`. The `secrets` directory
is gitignored by default.
### Step 6: Implement `discover`
As described in the template code, this method takes in the same config object as `check`. It then
returns a json object called a `catalog` that describes what data is available and metadata on what
options are available for how to replicate it.
For a brief overview on the catalog check out
[Beginner's Guide to the Airbyte Catalog](../../understanding-airbyte/beginners-guide-to-catalog.md).
### Step 7: Implement `read`
As described in the template code, this method takes in the same config object as the previous
methods. It also takes in a "configured catalog". This object wraps the catalog emitted by the
`discover` step and includes configuration on how the data should be replicated. For a brief
overview on the configured catalog check out
[Beginner's Guide to the Airbyte Catalog](../../understanding-airbyte/beginners-guide-to-catalog.md).
It then returns a generator which returns each record in the stream.
### Step 8: Set up Connector Acceptance Tests (CATs)
The Connector Acceptance Tests are a set of tests that run against all sources. These tests are run
in the Airbyte CI to prevent regressions. They also can help you sanity check that your source works
as expected. The following [article](../testing-connectors/connector-acceptance-tests-reference.md)
explains Connector Acceptance Tests and how to run them.
You can run the tests using [`airbyte-ci`](https://github.com/airbytehq/airbyte/blob/master/airbyte-ci/connectors/pipelines/README.md):
`airbyte-ci connectors --name source-<source-name> test --only-step=acceptance`
:::info
In some rare cases we make exceptions and allow a source to not need to pass all the
standard tests. If for some reason you think your source cannot reasonably pass one of the tests
cases, reach out to us on github or slack, and we can determine whether there's a change we can make
so that the test will pass or if we should skip that test for your source.
:::
### Step 9: Write unit tests and/or integration tests
The connector acceptance tests are meant to cover the basic functionality of a source. Think of it
as the bare minimum required for us to add a source to Airbyte. In case you need to test additional
functionality of your source, write unit or integration tests.
#### Unit Tests
Add any relevant unit tests to the `tests/unit_tests` directory. Unit tests should _not_ depend on any secrets.
You can run the tests using `poetry run pytest tests/unit_tests`
#### Integration Tests
Place any integration tests in the `integration_tests` directory such that they can be
[discovered by pytest](https://docs.pytest.org/en/6.2.x/goodpractices.html#conventions-for-python-test-discovery).
You can run the tests using `poetry run pytest tests/integration_tests`
### Step 10: Update the `README.md`
The template fills in most of the information for the readme for you. Unless there is a special
case, the only piece of information you need to add is how one can get the credentials required to
run the source. e.g. Where one can find the relevant API key, etc.
### Step 11: Add the connector to the API/UI
There are multiple ways to use the connector you have built.
If you are self hosting Airbyte (OSS) you are able to use the Custom Connector feature. This feature
allows you to run any Docker container that implements the Airbye protocol. You can read more about
it [here](https://docs.airbyte.com/integrations/custom-connectors/).
If you are using Airbyte Cloud (or OSS), you can submit a PR to add your connector to the Airbyte
repository. Once the PR is merged, the connector will be available to all Airbyte Cloud users. You
can read more about it
[here](https://docs.airbyte.com/contributing-to-airbyte/submit-new-connector).
Note that when submitting an Airbyte connector, you will need to ensure that
1. The connector passes the CAT suite. See
[Set up Connector Acceptance Tests](<#step-8-set-up-connector-acceptance-tests-(cats)>).
2. The metadata.yaml file (created by our generator) is filed out and valid. See
[Connector Metadata File](https://docs.airbyte.com/connector-development/connector-metadata-file).
3. You have created appropriate documentation for the connector. See [Add docs](#step-12-add-docs).
### Step 12: Add docs
Each connector has its own documentation page. By convention, that page should have the following
path: in `docs/integrations/sources/<source-name>.md`. For the documentation to get packaged with
the docs, make sure to add a link to it in `docs/SUMMARY.md`. You can pattern match doing that from
existing connectors.
## Related tutorials
For additional examples of how to use the Python CDK to build an Airbyte source connector, see the
following tutorials:
- [Python CDK Speedrun: Creating a Source](https://docs.airbyte.com/connector-development/tutorials/cdk-speedrun)
- [Build a connector to extract data from the Webflow API](https://airbyte.com/tutorials/extract-data-from-the-webflow-api)