1
0
mirror of synced 2025-12-20 10:32:35 -05:00
Files
airbyte/docs/platform/enterprise-setup/scaling-airbyte.md
Ian Alton 4f97ee2dab docs: update plan names in docs content (#67573)
## What
<!--
* Describe what the change is solving. Link all GitHub issues related to
this change.
-->

This pull request updates plan names across most of the Documentation so
they're consistent with our current plan names.

## How
<!--
* Describe how code changes achieve the solution.
-->

My original plan was to convert free text to MDX variables so we only
had to make future updates to names in one place. While broadly
successful, there were numerous edge cases that made rolling this out
almost impossible. There were too many ways and places you couldn't use
variables due to a variety of limitations in Docusaurus and Airbyte's
internal MarkDown processor. Explaining how to properly use them made me
realize how prohibitively insufficient this was. In the end, I opted to
return to using free text for plan names.

Scope is now broadly reduced. This PR:

- Converts remaining instances of old plan names to new plan names. In
most cases, I replaced old plan names with new plan names directly. In
some cases, sentences were rewritten to make a bit more sense or be more
maintainable in the future.

- Removes previously added preprocessor variables from Docusaurus
configuration.

- Update Vale styles or various artifacts of content based on linter
findings.

## Review guide
<!--
1. `x.py`
2. `y.py`
-->

Spot check updated pages to ensure plan names appear appropriately. It's
probably not necessary to check every single instance in detail.

For Platform docs, changes only apply to the Next/Cloud version. After
merging, I'll regenerate 2.0 docs based on this. 1.8 and before won't be
updated.

## User Impact
<!--
* What is the end result perceived by the user?
* If there are negative side effects, please list them. 
-->

People can see correct plan names in docs content.

## Can this PR be safely reverted and rolled back?
<!--
* If unsure, leave it blank.
-->
- [x] YES 💚
- [ ] NO 

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-10-17 17:34:29 -07:00

283 lines
8.2 KiB
Markdown

---
products: oss-enterprise
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Scaling Airbyte After Installation
Once you've completed the initial installation of Airbyte Self-Managed Enterprise, the next crucial step is scaling your setup as needed to ensure optimal performance and reliability as your data integration needs grow. This guide walks you through best practices and strategies for scaling Airbyte in an enterprise environment.
## Concurrent Syncs
The primary driver of increased resource usage in Airbyte is the number of concurrent syncs running at any given time. Each concurrent sync requires at least 3 additional connector pods to be running at once (`orchestrator`, `read`, `write`). For example, 10 concurrent syncs require 30 additional pods in your namespace. Connector pods last only for the duration of a sync, and will be appended by the ID of the ongoing job.
If your deployment of Airbyte is intended to run many concurrent syncs at once (e.g. an overnight backfill), you are likely to require an increased number of instances to run all syncs.
### Connector CPU & Memory Settings
Some connectors are memory and CPU intensive, while others are not. Using an infrastructure monitoring tool, we recommend measuring the following at all times:
* Requested CPU %
* CPU Usage %
* Requested Memory %
* Memory Usage %
If your nodes are under high CPU or Memory usage, we recommend scaling up your Airbyte deployment to a larger number of nodes, or reducing the maximum resource usage by any given connector pod. If high _requested_ CPU or memory usage is blocking new pods from being scheduled, while _used_ CPU or memory is low, you may modify connector pod provisioning defaults in your `values.yaml` file:
```yaml title="values.yaml"
global:
edition: "enterprise"
# ...
jobs:
resources:
limits:
cpu: ## e.g. 250m
memory: ## e.g. 500m
requests:
cpu: ## e.g. 75m
memory: ## e.g. 150m
```
If your Airbyte deployment is under-provisioned, you may notice occasional 'stuck jobs' that remain in-progress for long periods, with eventual failures related to unavailable pods. Increasing job CPU and memory limits may also allow for increased sync speeds. For help and best practices, see [Configuring connector resources](../operator-guides/configuring-connector-resources).
### Concurrent Sync Limits
To help rightsize Airbyte deployments and reduce the likelihood of stuck syncs, there are configurable limits to the number of syncs that can be run at once:
<Tabs groupId="helm-chart-version">
<TabItem value='helm-1' label='Helm chart V1' default>
```yaml title="values.yaml"
worker:
extraEnv: ## We recommend setting both environment variables with a single, shared value.
- name: MAX_SYNC_WORKERS
value: ## e.g. 5
- name: MAX_CHECK_WORKERS
value: ## e.g. 5
```
</TabItem>
<TabItem value='helm-2' label='Helm chart V2' default>
```yaml title="values.yaml"
worker:
maxSyncWorkers: ## e.g. 5
maxCheckWorkers: ## e.g. 5
```
</TabItem>
</Tabs>
If you intend to run many syncs at the same time, you may also want to increase the number of worker replicas that run in your Airbyte instance:
<Tabs groupId="helm-chart-version">
<TabItem value='helm-1' label='Helm chart V1' default>
```yaml title="values.yaml"
worker:
replicaCount: ## e.g. 2
```
</TabItem>
<TabItem value='helm-2' label='Helm chart V2' default>
```yaml title="values.yaml"
worker:
replicaCount: ## e.g. 2
```
</TabItem>
</Tabs>
## Multiple Node Groups
To reduce the blast radius of an underprovisioned Airbyte deployment, place 'static' workloads (`server`, etc.) on one Kubernetes node group, while placing job-related workloads (connector pods) on a different Kubernetes node group. This ensures that UI or API availability is unlikely to be impacted by the number of concurrent syncs.
<details>
<summary>Configure Airbyte Self-Managed Enterprise to run in two node groups</summary>
<Tabs groupId="helm-chart-version">
<TabItem value='helm-1' label='Helm chart V1' default>
```yaml title="values.yaml"
airbyte-bootloader:
nodeSelector:
type: static
server:
nodeSelector:
type: static
keycloak:
nodeSelector:
type: static
keycloak-setup:
nodeSelector:
type: static
temporal:
nodeSelector:
type: static
worker:
nodeSelector:
type: jobs
workload-launcher:
nodeSelector:
type: static
## Pods spun up by the workload launcher will run in the 'jobs' node group.
extraEnv:
- name: JOB_KUBE_NODE_SELECTORS
value: type=jobs
- name: SPEC_JOB_KUBE_NODE_SELECTORS
value: type=jobs
- name: CHECK_JOB_KUBE_NODE_SELECTORS
value: type=jobs
- name: DISCOVER_JOB_KUBE_NODE_SELECTORS
value: type=jobs
orchestrator:
nodeSelector:
type: jobs
workload-api-server:
nodeSelector:
type: jobs
```
</TabItem>
<TabItem value='helm-2' label='Helm chart V2' default>
```yaml title="values.yaml"
global:
jobs:
kube:
nodeSelector:
type: jobs
scheduling:
check:
nodeSelectors:
type: jobs
discover:
nodeSelectors:
type: jobs
spec:
nodeSelectors:
type: jobs
airbyteBootloader:
nodeSelector:
type: static
server:
nodeSelector:
type: static
keycloak:
nodeSelector:
type: static
keycloakSetup:
nodeSelector:
type: static
temporal:
nodeSelector:
type: static
workloadLauncher:
nodeSelector:
type: static
worker:
nodeSelector:
type: jobs
workloadApiServer:
nodeSelector:
type: jobs
```
</TabItem>
</Tabs>
</details>
## High Availability
You may wish to implement high availability (HA) to minimize downtime and ensure continuous data integration processes. Please note that this requires provisioning Airbyte on a larger number of Nodes, which may increase your licensing fees. For a typical HA deployment, you will want a VPC with subnets in at least two (and preferably three) availability zones (AZs).
We particularly recommend having multiple instances of `worker` and `server` pods:
<Tabs groupId="helm-chart-version">
<TabItem value='helm-1' label='Helm chart V1' default>
```yaml title="values.yaml"
worker:
replicaCount: 2
server:
replicaCount: 2
```
</TabItem>
<TabItem value='helm-2' label='Helm chart V2' default>
```yaml title="values.yaml"
worker:
replicaCount: 2
server:
replicaCount: 2
```
</TabItem>
</Tabs>
Furthermore, you may want to implement a primary-replica setup for the database (e.g., PostgreSQL) used by Airbyte. The primary database handles write operations, while replicas handle read operations, ensuring data availability even if the primary fails.
## Disaster Recovery (DR) Regions
For business-critical applications of Airbyte, you may want to configure a Disaster Recovery (DR) cluster for Airbyte. We do not support assisting customers with DR deployments at this time. However, we offer a few high level suggestions:
1. Airbyte strongly recommends configuring an external database, external log storage and external connector secret management.
2. Airbyte strongly recommends that your DR cluster is also an instance of Self-Managed Enterprise, kept at the same version as your prod instance.
## DEBUG Logs
We recommend turning off `DEBUG` logs for any non-testing use of Self-Managed Airbyte. Failing to do while running at-scale syncs may result in the `server` pod being overloaded, preventing most of the deployment for operating as normal.
## Schema Discovery Timeouts
While configuring a database source connector with hundreds to thousands of tables, each with many columns, the one-time `discover` mechanism - by which we discover the topology of your source - may run for a long time and exceed Airbyte's timeout duration. Should this be the case, you may increase Airbyte's timeout limit as follows:
<Tabs groupId="helm-chart-version">
<TabItem value='helm-1' label='Helm chart V1' default>
```yaml title="values.yaml"
server:
extraEnv:
- name: HTTP_IDLE_TIMEOUT
value: 20m
- name: READ_TIMEOUT
value: 30m
```
</TabItem>
<TabItem value='helm-2' label='Helm chart V2' default>
```yaml title="values.yaml"
server:
httpIdleTimeout: 20m
extraEnv:
- name: READ_TIMEOUT
value: 30m
```
</TabItem>
</Tabs>