Redshift

The Airbyte Redshift destination allows you to sync data to Redshift.
This Redshift destination connector has two replication strategies:
  1. 1.
    INSERT: Replicates data via SQL INSERT queries. This is built on top of the destination-jdbc code base and is configured to rely on JDBC 4.2 standard drivers provided by Amazon via Mulesoft here as described in Redshift documentation here. Not recommended for production workloads as this does not scale well.
  2. 2.
    COPY: Replicates data by first uploading data to an S3 bucket and issuing a COPY command. This is the recommended loading approach described by Redshift best practices. Requires an S3 bucket and credentials.
Airbyte automatically picks an approach depending on the given configuration - if S3 configuration is present, Airbyte will use the COPY strategy and vice versa.
We recommend users use INSERT for testing, to avoid any additional setup, and switch to COPY for production workloads.

Each stream will be output into its own raw table in Redshift. Each table will contain 3 columns:
  • _airbyte_ab_id: a uuid assigned by Airbyte to each event that is processed. The column type in Redshift is VARCHAR.
  • _airbyte_emitted_at: a timestamp representing when the event was pulled from the data source. The column type in Redshift is TIMESTAMP WITH TIME ZONE.
  • _airbyte_data: a json blob representing with the event data. The column type in Redshift is VARCHAR but can be be parsed with JSON functions.

Feature
Supported?(Yes/No)
Notes
Full Refresh Sync
Yes
Incremental - Append Sync
Yes
Incremental - Deduped History
Yes
Namespaces
Yes
SSL Support
Yes

You will need to choose an existing database or create a new database that will be used to store synced data from Airbyte.

  1. 1.
    Active Redshift cluster
  2. 2.
    Allow connections from Airbyte to your Redshift cluster (if they exist in separate VPCs)
  3. 3.
    A staging S3 bucket with credentials (for the COPY strategy).
Even if your Airbyte instance is running on a server in the same VPC as your Redshift cluster, you may need to place them in the same security group to allow connections between the two.

This is dependent on your networking setup. The easiest way to verify if Airbyte is able to connect to your Redshift cluster is via the check connection tool in the UI. You can check AWS Redshift documentation with a tutorial on how to properly configure your cluster's access here

Next is to provide the necessary information on how to connect to your cluster such as the host whcih is part of the connection string or Endpoint accessible here without the port and database name (it typically includes the cluster-id, region and end with .redshift.amazonaws.com).
You should have all the requirements needed to configure Redshift as a destination in the UI. You'll need the following information to configure the destination:
  • Host
  • Port
  • Username
  • Password
  • Schema
  • Database
    • This database needs to exist within the cluster provided.

Provide the required S3 info.
  • S3 Bucket Name
    • See this to create an S3 bucket.
  • S3 Bucket Region
    • Place the S3 bucket and the Redshift cluster in the same region to save on networking costs.
  • Access Key Id
    • See this on how to generate an access key.
    • We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the staging bucket.
  • Secret Access Key
    • Corresponding key to the above key id.
  • Part Size
    • Affects the size limit of an individual Redshift table. Optional. Increase this if syncing tables larger than 100GB. Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default table limit of 100GB. Note, a larger part size will result in larger memory requirements. A rule of thumb is to multiply the part size by 10 to get the memory requirement. Modify this with care.
Optional parameters:
  • Bucket Path
    • The directory within the S3 bucket to place the staging data. For example, if you set this to yourFavoriteSubdirectory, we will place the staging data inside s3://yourBucket/yourFavoriteSubdirectory. If not provided, defaults to the root directory.
  • Purge Staging Data
    • Whether to delete the staging files from S3 after completing the sync. Specifically, the connector will create CSV files named bucketPath/namespace/streamName/syncDate_epochMillis_randomUuid.csv containing three columns (ab_id, data, emitted_at). Normally these files are deleted after the COPY command completes; if you want to keep them for other purposes, set purge_staging_data to false.

  • Begin with an ASCII single-byte alphabetic character or underscore character, or a UTF-8 multibyte character two to four bytes long.
  • Subsequent characters can be ASCII single-byte alphanumeric characters, underscores, or dollar signs, or UTF-8 multibyte characters two to four bytes long.
  • Be between 1 and 127 bytes in length, not including quotation marks for delimited identifiers.
  • Contain no quotation marks and no spaces.

Delimited identifiers (also known as quoted identifiers) begin and end with double quotation marks ("). If you use a delimited identifier, you must use the double quotation marks for every reference to that object. The identifier can contain any standard UTF-8 printable characters other than the double quotation mark itself. Therefore, you can create column or table names that include otherwise illegal characters, such as spaces or the percent symbol. ASCII letters in delimited identifiers are case-insensitive and are folded to lowercase. To use a double quotation mark in a string, you must precede it with another double quotation mark character.
Therefore, Airbyte Redshift destination will create tables and schemas using the Unquoted identifiers when possible or fallback to Quoted Identifiers if the names are containing special characters.

Redshift specifies a maximum limit of 65535 bytes to store the raw JSON record data. Thus, when a row is too big to fit, the Redshift destination fails to load such data and currently ignores that record.

Redshift specifies a maximum limit of 65535 bytes to store the raw JSON record data. Thus, when a row is too big to fit, the Redshift destination fails to load such data and currently ignores that record. See docs

All Redshift connections are encrypted using SSL

Version
Date
Pull Request
Subject
0.3.28
2022-03-18
#11254
Fixed missing records during S3 staging
0.3.27
2022-02-25
10421
Refactor JDBC parameters handling
0.3.25
2022-02-14
#9920
Updated the size of staging files for S3 staging. Also, added closure of S3 writers to staging files when data has been written to an staging file.
0.3.24
2022-02-14
10256
Add -XX:+ExitOnOutOfMemoryError JVM option
0.3.23
2021-12-16
#8855
Add purgeStagingData option to enable/disable deleting the staging data
0.3.22
2021-12-15
#8607
Accept a path for the staging data
0.3.21
2021-12-10
#8562
Moving classes around for better dependency management
0.3.20
2021-11-08
#7719
Improve handling of wide rows by buffering records based on their byte size rather than their count
0.3.19
2021-10-21
7234
Allow SSL traffic only
0.3.17
2021-10-12
6965
Added SSL Support
0.3.16
2021-10-11
6949
Each stream was split into files of 10,000 records each for copying using S3 or GCS
0.3.14
2021-10-08
5924
Fixed AWS S3 Staging COPY is writing records from different table in the same raw table
0.3.13
2021-09-02
5745
Disable STATUPDATE flag when using S3 staging to speed up performance
0.3.12
2021-07-21
3555
Enable partial checkpointing for halfway syncs
0.3.11
2021-07-20
4874
allow additionalProperties in connector spec
Last modified 6mo ago
Copy link
On this page
Overview
Sync overview
Getting started
Requirements
Setup guide
Notes about Redshift Naming Conventions
Standard Identifiers
Delimited Identifiers
Data Size Limitations
Data Size Limitations
Encryption
Changelog