15 KiB
15 KiB
Airbyte Databases Data Catalog
Config Database
workspace- Each record represents a logical workspace for an Airbyte user. In the open-source version of the product, only one workspace is allowed.
actor_definition- Each record represents a connector that Airbyte supports, e.g. Postgres. This table represents all the connectors that is supported by the current running platform.
- The
actor_typecolumn tells us whether the record represents a Source or a Destination. - The
speccolumn is a JSON blob. The schema of this JSON blob matches the spec model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. - The
support_leveldescribes the support level of the connector (e.g.community,certified, orarchived).- In the product UI, the Marketplace tab contains connectors with
communitysupport level, and Airbyte Connectors tab containscertifiedconnectors. support_level: archivedsignals that this connector is no longer supported, and it's not available to install for any new connections.
- In the product UI, the Marketplace tab contains connectors with
- The
docker_repositoryfield is the name of the docker image associated with the connector definition.docker_image_tagis the tag of the docker image and the version of the connector definition. - The
source_typefield is only used for Sources, and represents the category of the connector definition (e.g. API, Database). - The
resource_requirementsfield sets a default resource requirement for any connector of this type. This overrides the default we set for all connector definitions, and it can be overridden by a connection-specific resource requirement. The column is a JSON blob with the schema defined in ActorDefinitionResourceRequirements.yaml - The
publicboolean column, describes if a connector is available to all workspaces or not. For non,publicconnector definitions, they can be provisioned to a workspace using theactor_definition_workspace_granttable.custommeans that the connector is written by a user of the platform (and not packaged into the Airbyte product). - Each record contains additional metadata and display data about a connector (e.g.
nameandicon), and we should add additional metadata here over time.
actor_definition_workspace_grant- Each record represents provisioning a non
publicconnector definition to a workspace. - todo (cgardens) - should this table have a
created_atcolumn?
- Each record represents provisioning a non
actor- Each record represents a configured connector. e.g. A Postgres connector configured to pull data from my database.
- The
actor_typecolumn tells us whether the record represents a Source or a Destination. - The
actor_definition_idcolumn is a foreign key to the connector definition that this record is implementing. - The
configurationcolumn is a JSON blob. The schema of this JSON blob matches the schema specified in thespeccolumn in theconnectionSpecificationfield of the JSON blob. Keep in mind this schema is specific to each connector (e.g. the schema of Postgres and Salesforce are different), which is why this column has to be a JSON blob.
actor_catalog- Each record contains a catalog for an actor. The records in this table are meant to be immutable.
- The
catalogcolumn is a JSON blob. The schema of this JSON blob matches the catalog model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. Thecatalog_hashcolumn is a 32-bit murmur3 hash ( x86 variant) of thecatalogfield to make comparisons easier. - todo (cgardens) - should we remove the
modified_atcolumn? These records should be immutable.
actor_catalog_fetch_event- Each record represents an attempt to fetch the catalog for an actor. The records in this table are meant to be immutable.
- The
actor_idcolumn represents the actor that the catalog is being fetched for. Theconfig_hashrepresents a hash (32-bit murmur3 hash - x86 variant) of theconfigurationcolumn of that actor, at the time the attempt to fetch occurred. - The
catalog_idis a foreign key to theactor_catalogtable. It represents the catalog fetched by this attempt. We use the foreign key, because the catalogs are often large and often multiple fetch events result in retrieving the same catalog. Also understanding how often the same catalog is fetched is interesting from a product analytics point of view. - The
actor_versioncolumn represents theactor_definitionversion that was in use when the fetch event happened. This column is needed, because while we can infer theactor_definitionfrom the foreign key relationship with theactortable, we cannot do the same for the version, as that can change over time. - todo (cgardens) - should we remove the
modified_atcolumn? These records should be immutable.
connection- Each record in this table configures a connection (
source_id,destination_id, and relevant configuration). - The
resource_requirementsfield sets a default resource requirement for the connection. This overrides the default we set for all connector definitions and the default set for the connector definitions. The column is a JSON blob with the schema defined in ResourceRequirements.yaml. - The
source_catalog_idcolumn is a foreign key that refers toidcolumn inactor_catalogtable and represents the catalog that was used to configure the connection. This should not be confused with thecatalogcolumn which contains the ConfiguredCatalog for the connection. - The
schedule_typecolumn defines what type of schedule is being used. If thetypeis manual, thenschedule_datawill be null. Otherwise,schedule_datacolumn is a JSON blob with the schema of StandardSync#scheduleData that defines the actual schedule. The columnsmanualandscheduleare deprecated and should be ignored (they will be dropped soon). - The
namespace_typecolumn configures whether the namespace for the connection should use that defined by the source, the destination, or a user-defined format (custom). Ifcustomthenamespace_formatcolumn defines the string that will be used as the namespace. - The
statuscolumn describes the activity level of the connector:active- current schedule is respected,inactive- current schedule is ignored (the connection does not run) but it could be switched back to active, anddeprecated- the connection is permanently off (cannot be moved to active or inactive).
- Each record in this table configures a connection (
state- The
statetable represents the current (last) state for a connection. For a connection withstreamstate, there will be a record per stream. For a connection withglobalstate, there will be a record per stream and an additional record to store the shared (global) state. For a connection withlegacystate, there will be one record per connection. - In the
streamandglobalstate cases, thestream_nameandnamespacecolumns contains the name of the stream whose state is represented by that record. For the shared state in globalstream_nameandnamespacewill be null. - The
statecolumn contains the state JSON blob. Depending on the type of the connection, the schema of the blob will be different.stream- for this type, this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it.global- for this type, this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it. This is true for both the states for each stream and the shared state.legacy- for this type, this column is a JSON blob with a top-level key calledstate. Within thatstateis a blackbox to the platform and known only to the connector that generated it.
- The
typecolumn describes the type of the state of the row. type can beSTREAM,GLOBALorLEGACY. - The connection_id is a foreign key to the connection for which we are tracking state.
- The
stream_reset- Each record in this table represents a stream in a connection that is enqueued to be reset or is currently being reset. It can be thought of as a queue. Once the stream is reset, the record is removed from the table.
operation- The
operationtable transformations for a connection beyond the raw output produced by the destination. The two options are:normalization, which outputs Airbyte's basic normalization. The second isdbt, which allows a user to configure their own custom dbt transformation. A connection can have multiple operations (e.g. it can donormalizationanddbt). - If the
operationisdbt, then theoperator_dbtcolumn will be populated with a JSON blob with the schema from OperatorDbt. - If the
operationisnormalization, then theoperator_dbtcolumn will be populated with a JSON blob with the scehma from OperatorNormalization. - Operations are scoped by workspace, using the
workspace_idcolumn.
- The
connection_operation- This table joins the
operationtable to theconnectionfor which it is configured.
- This table joins the
workspace_service_account- This table is a WIP for an unfinished feature.
actor_oauth_parameter- The name of this table is misleading. It refers to parameters to be used for any instance of an
actor_definition(not anactor) within a given workspace. For OAuth, the model is that a user is provisioning access to their data to a third party tool (in this case the Airbyte Platform). Each record represents information (e.g. client id, client secret) for that third party that is getting access. - These parameters can be scoped by workspace. If
workspace_idis not present, then the scope of the parameters is to the whole deployment of the platform (e.g. all workspaces). - The
actor_typecolumn tells us whether the record represents a Source or a Destination. - The
configurationcolumn is a JSON blob. The schema of this JSON blob matches the schema specified in thespeccolumn in theadvanced_authfield of the JSON blob. Keep in mind this schema is specific to each connector (e.g. the schema of Hubspot and Salesforce are different), which is why this column has to be a JSON blob.
- The name of this table is misleading. It refers to parameters to be used for any instance of an
secrets- This table is used to store secrets in open-source versions of the platform that have not set some other secrets store. This table allows us to use the same code path for secrets handling regardless of whether an external secrets store is set or not. This table is used by default for the open-source product.
airbyte_configs_migrationsis metadata table used by Flyway (our database migration tool). It is not used for any application use cases.airbyte_configs- Legacy table for config storage. Should be dropped.
Jobs Database
jobs- Each record in this table represents a job.
- The
config_typecolumn captures the type of job. We only make jobs forsyncandreset(we do not use them forspec,check,discover). - A job represents an attempt to use a connector (or a pair of connectors). The goal of this model is to capture the input of that run. A job can have multiple attempts (see the
attemptstable). The guarantee across all attempts is that the input into each attempt will be the same. - That input is captured in the
configcolumn. This column is a JSON Blob with the schema of a JobConfig. OnlysyncandresetConnectionare ever used in that model.- The other top-level fields are vestigial from when
spec,check,discoverwere used in this model (we will eventually remove them).
- The other top-level fields are vestigial from when
- The
scopecolumn contains theconnection_idfor the relevant connection of the job.- Context: It is called
scopeand notconnection_id, because, this table was originally used forspec,check, anddiscover, and in those cases thescopereferred to the relevant actor or actor definition. At this point the scope is always aconnection_id.
- Context: It is called
- The
statuscolumn contains the job status. The lifecycle of a job is explained in detail in the Jobs & Workers documentation.
attempts- Each record in this table represents an attempt.
- Each attempt belongs to a job--this is captured by the
job_idcolumn. All attempts for a job will run on the same input. - The
idcolumn is a unique id across all attempts while theattempt_numberis an ascending number of the attempts for a job. - The output of each attempt, however, can be different. The
outputcolumn is a JSON blob with the schema of a JobOutput. Onlysyncis used in that model. Reset jobs will also use thesyncfield, because under the hoodresetjobs end up just doing asyncwith special inputs. This object contains all the output info for a sync including stats on how much data was moved.- The other top-level fields are vestigial from when
spec,check,discoverwere used in this model (we will eventually remove them).
- The other top-level fields are vestigial from when
- The
statuscolumn contains the attempt status. The lifecycle of a job / attempt is explained in detail in the Jobs & Workers documentation. - If the attempt fails, the
failure_summarycolumn will be populated. The column is a JSON blob with the schema of AttemptFailureReason. - The
log_pathcolumn captures where logs for the attempt will be written. created_at,started_at, andended_attrack the run time.- The
temporal_workflow_idcolumn keeps track of what temporal execution is associated with the attempt.
airbyte_metadata- This table is a key-value store for various metadata about the platform. It is used to track information about what version the platform is currently on as well as tracking the upgrade history.
- Logically it does not make a lot of sense that it is in the jobs db. It would make sense if it were either in its own dbs or in the config dbs.
- The only two columns are
keyandvalue. It is truly just a key-value store.
airbyte_jobs_migrationsis metadata table used by Flyway (our database migration tool). It is not used for any application use cases.