On Kubernetes (Beta)

Airbyte allows scaling sync workloads horizontally using Kubernetes. The core components (api server, scheduler, etc) run as deployments while the scheduler launches connector-related pods on different nodes.

If you don't want to configure your own K8s cluster and Airbyte instance, you can use the free, open-source project Plural to bring up a K8s cluster and Airbyte for you. Use this guide to get started.

For local testing we recommend following one of the following setup guides:
For testing on EKS you can install eksctl and run eksctl create cluster to create an EKS cluster/VPC/subnets/etc. This process should take 10-15 minutes.
For production, Airbyte should function on most clusters v1.19 and above. We have tested support on GKE and EKS. If you run into a problem starting Airbyte, please reach out on the #troubleshooting channel on our Slack or create an issue on GitHub.

If you do not already have the CLI tool kubectl installed, please follow these instructions to install.

Configure kubectl to connect to your cluster by using kubectl use-context my-cluster-name.
  • For GKE
    • Configure gcloud with gcloud auth login.
    • On the Google Cloud Console, the cluster page will have a Connect button, which will give a command to run locally that looks like
      gcloud container clusters get-credentials CLUSTER_NAME --zone ZONE_NAME --project PROJECT_NAME.
    • Use kubectl config get-contexts to show the contexts available.
    • Run kubectl config use-context <gke context> to access the cluster from kubectl.
  • For EKS
    • Configure your AWS CLI to connect to your project.
    • Install eksctl
    • Run eksctl utils write-kubeconfig --cluster=<CLUSTER NAME> to make the context available to kubectl
    • Use kubectl config get-contexts to show the contexts available.
    • Run kubectl config use-context <eks context> to access the cluster with kubectl.

Both dev and stable versions of Airbyte include a stand-alone Minio deployment. Airbyte publishes logs to this Minio deployment by default. This means Airbyte comes as a self-contained Kubernetes deployment - no other configuration is required.
So if you just want logs to be sent to the local Minio deployment, you do not need to change the values of any environment variables from what is currently on master.

Alternatively, if you want logs to be sent to a custom location, Airbyte currently supports logging to Minio, S3 or GCS. The following instructions are for users wishing to log to their own Minio layer, S3 bucket or GCS bucket.
The provided credentials require both read and write permissions. The logger attempts to create the log bucket if it does not exist.
Configuring Custom Minio Log Location
To write to a custom minio log location, replace the following variables in the .env file in the kube/overlays/stable directory:
The S3_PATH_STYLE_ACCESS variable should remain true. The S3_LOG_BUCKET_REGION variable should remain empty.
Configuring Custom S3 Log Location
To write to a custom S3 log location, replace the following variables in the .env file in the kube/overlays/stable directory:
# Set this to empty.
# Set this to empty.
See here for instructions on creating an S3 bucket and here for instructions on creating AWS credentials.
Configuring Custom GCS Log Location
Create the GCP service account with read/write permission to the GCS log bucket.
1) Base64 encode the GCP json secret.
# The output of this command will be a Base64 string.
$ cat gcp.json | base64
2) Populate the gcs-log-creds secrets with the Base64-encoded credential. This is as simple as taking the encoded credential from the previous step and adding it to the secret-gcs-log-creds,yaml file.
apiVersion: v1
kind: Secret
name: gcs-log-creds
namespace: default
gcp.json: <base64-encoded-string>
3) Replace the following variables in the .env file in the kube/overlays/stable directory:
4) Modify the .secrets file in the kube/overlays/stable directory
# The path the GCS creds are written to. Unless you know what you are doing, use the below default value.
See here for instruction on creating a GCS bucket and here for instruction on creating GCP credentials.

Run the following commands to launch Airbyte:
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
kubectl apply -k kube/overlays/stable
After 2-5 minutes, kubectl get pods | grep airbyte should show Running as the status for all the core Airbyte pods. This may take longer on Kubernetes clusters with slow internet connections.
Run kubectl port-forward svc/airbyte-webapp-svc 8000:80 to allow access to the UI/API.
Now visit http://localhost:8000 in your browser and start moving some data!

  • Core container pods
    • Instead of launching Airbyte with kubectl apply -k kube/overlays/stable, you can run with kubectl apply -k kube/overlays/stable-with-resource-limits.
    • The kube/overlays/stable-with-resource-limits/set-resource-limits.yaml file can be modified to provide different resource requirements for core pods.
  • Connector pods
    • By default, connector pods launch without resource limits.
    • To add resource limits, configure the "Docker Resource Limits" section of the .env file in the overlay folder you're using.
  • Volume sizes
    • You can modify kube/resources/volume-* files to specify different volume sizes for the persistent volumes backing Airbyte.

The number of simultaneous jobs (getting specs, checking connections, discovering schemas, and performing syncs) is limited by a few factors. First of all, the SUBMITTER_NUM_THREADS (set in the .env file for your Kustimization overlay) provides a global limit on the number of simultaneous jobs that can run across all worker pods.
The number of worker pods can be changed by increasing the number of replicas for the airbyte-worker deployment. An example of a Kustomization patch that increases this number can be seen in airbyte/kube/overlays/dev-integration-test/kustomization.yaml and airbyte/kube/overlays/dev-integration-test/parallelize-worker.yaml. The number of simultaneous jobs on a specific worker pod is also limited by the number of ports exposed by the worker deployment and set by TEMPORAL_WORKER_PORTS in your .env file. Without additional ports used to communicate to connector pods, jobs will start to run but will hang until ports become available.
You can also tune environment variables for the max simultaneous job types that can run on the worker pod by setting MAX_SPEC_WORKERS, MAX_CHECK_WORKERS, MAX_DISCOVER_WORKERS, MAX_SYNC_WORKERS for the worker pod deployment (not in the .env file). These values can be used if you want to create separate worker deployments for separate types of workers with different resource allocations.

Airbyte writes logs to two directories. App logs, including server and scheduler logs, are written to the app-logging directory. Job logs are written to the job-logging directory. Both directories live at the top-level e.g., the app-logging directory lives at s3://log-bucket/app-logging etc. These paths can change, so we recommend having a dedicated log bucket, and to not use this bucket for other purposes.
Airbyte publishes logs every minute. This means it is normal to see minute-long log delays. Each publish creates it's own log file, since Cloud Storages do not support append operations. This also mean it is normal to see hundreds of files in your log bucket.
Each log file is named {yyyyMMddHH24mmss}_{podname}_{UUID} and is not compressed. Users can view logs simply by navigating to the relevant folder and downloading the file for the time period in question.
See the Known Issues section for planned logging improvements.

After Issue #3605 is completed, users will be able to configure custom dbs instead of a simple postgres container running directly in Kubernetes. This separate instance (preferable on a system like AWS RDS or Google Cloud SQL) should be easier and safer to maintain than Postgres on your cluster.

As we improve our Kubernetes offering, we would like to point out some common pain points. We are working on improving these. Please let us know if there are any other issues blocking your adoption of Airbyte or if you would like to contribute fixes to address any of these issues.
  • Some UI operations have higher latency on Kubernetes than Docker-Compose. (#4233)
  • Logging to Azure Storage is not supported. (#4200)
  • Large log files might take a while to load. (#4201)
  • UI does not include configured buckets in the displayed log path. (#4204)
  • Logs are not reset when Airbyte is re-deployed. (#4235)
  • File sources reading from and file destinations writing to local mounts are not supported on Kubernetes.

We use Kustomize to allow overrides for different environments. Our shared resources are in the kube/resources directory, and we define overlays for each environment. We recommend creating your own overlay if you want to customize your deployments. This overlay can live in your own VCS.
Example kustomization.yaml file:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
- https://github.com/airbytehq/airbyte.git/kube/overlays/stable?ref=master

For a specific overlay, you can run kubectl kustomize kube/overlays/stable to view the manifests that Kustomize will apply to your Kubernetes cluster. This is useful for debugging because it will show the exact resources you are defining.

Check out the Helm Chart Readme

kubectl logs deployments/airbyte-server to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.

kubectl logs deployments/airbyte-scheduler to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.

Although all logs can be accessed by viewing the scheduler logs, connector container logs may be easier to understand when isolated by accessing from the Airbyte UI or the Airbyte API for a specific job attempt. Connector pods launched by Airbyte will not relay logs directly to Kubernetes logging. You must access these logs through Airbyte.

To resize a volume, change the .spec.resources.requests.storage value. After re-applying, the mount should be extended if that operation is supported for your type of mount. For a production deployment, it's useful to track the usage of volumes to ensure they don't run out of space.

See the documentation for kubectl cp.

kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx ls /tmp/workspace/8

kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx cat /tmp/workspace/8/0/logs.log

Running Airbyte on GKE regional cluster requires enabling persistent regional storage. To do so, enable CSI driver on GKE. After enabling, add storageClassName: standard-rwo to the volume-configs yaml.
volume-configs.yaml example:
apiVersion: v1
kind: PersistentVolumeClaim
name: airbyte-volume-configs
airbyte: volume-configs
- ReadWriteOnce
storage: 500Mi
storageClassName: standard-rwo

If you run into any problems operating Airbyte on Kubernetes, please reach out on the #issues channel on our Slack or create an issue on GitHub.

Copy link
On this page
Getting Started
Cluster Setup
Install kubectl
Configure kubectl
Configure Logs
Launch Airbyte
Production Airbyte on Kubernetes
Setting resource limits
Increasing job parallelism
Cloud logging
Using an external DB
Known Issues
Customizing Airbyte Manifests
View Raw Manifests
Helm Charts
Operator Guide
View API Server Logs
View Scheduler or Job Logs
Connector Container Logs
Upgrading Airbyte Kube
Resizing Volumes
Copy Files To/From Volumes
Listing Files
Reading Files
Persistent storage on GKE regional cluster
Developing Airbyte on Kubernetes