S3

Feature
Support
Notes
Full Refresh Sync
Warning: this mode deletes all previously synced data in the configured bucket path.
Incremental - Append Sync
Incremental - Deduped History
As this connector does not support dbt, we don't support this sync mode on this destination.
Namespaces
Setting a specific bucket path is equivalent to having separate namespaces.
The Airbyte S3 destination allows you to sync data to AWS S3 or Minio S3. Each stream is written to its own directory under the bucket.

Check out common troubleshooting issues for the S3 destination connector on our Discourse here.

Parameter
Type
Notes
S3 Endpoint
string
URL to S3, If using AWS S3 just leave blank.
S3 Bucket Name
string
Name of the bucket to sync data into.
S3 Bucket Path
string
Subdirectory under the above bucket to sync the data into.
S3 Bucket Format
string
Additional subdirectories format under S3 Bucket Path. Default value is ${NAMESPACE}/${STREAM_NAME}/ and this can be further customized with variables such as ${YEAR}, ${MONTH}, ${DAY}, ${HOUR} etc referring to the writing datetime.
S3 Region
string
See here for all region codes.
Access Key ID
string
AWS/Minio credential.
Secret Access Key
string
AWS/Minio credential.
Format
object
Format specific configuration. See the spec for details.
⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️
The full path of the output data with S3 path format ${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY} is:
<bucket-name>/<source-namespace-if-exists>/<stream-name>/<upload-date>/<partition-uuid>.<format-extension>
For example:
testing_bucket/data_output_path/public/users/2021_01_01/123e4567-e89b-12d3-a456-426614174000.csv.gz
↑ ↑ ↑ ↑ ↑ ↑ ↑
| | | | | | format extension
| | | | | |
| | | | | uuid
| | | | upload date in YYYY_MM_DD
| | | stream name
| | source namespace (if it exists)
| bucket path
bucket name
Please note that the stream name may contain a prefix, if it is configured on the connection. A data sync may create multiple files as the output files can be partitioned by size (targeting a size of 200MB compressed or lower) .

Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.
  • Under Full Refresh Sync mode, old output files will be purged before new files are created.
  • Under Incremental - Append Sync mode, new output files will be added that only contain the new data.

Apache Avro serializes data in a compact binary format. Currently, the Airbyte S3 Avro connector always uses the binary encoding, and assumes that all data records follow the same schema.

Here is the available compression codecs:
  • No compression
  • deflate
    • Compression level
      • Range [0, 9]. Default to 0.
      • Level 0: no compression & fastest.
      • Level 9: best compression & slowest.
  • bzip2
  • xz
    • Compression level
      • Range [0, 9]. Default to 6.
      • Level 0-3 are fast with medium compression.
      • Level 4-6 are fairly slow with high compression.
      • Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
  • zstandard
    • Compression level
      • Range [-5, 22]. Default to 3.
      • Negative levels are 'fast' modes akin to lz4 or snappy.
      • Levels above 9 are generally for archival purposes.
      • Levels above 18 use a lot of memory.
    • Include checksum
      • If set to true, a checksum will be included in each data block.
  • snappy

Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record. Because the data stream can come from any data source, the Json to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.
Column
Condition
Description
_airbyte_ab_id
Always exists
A uuid assigned by Airbyte to each processed record.
_airbyte_emitted_at
Always exists.
A timestamp representing when the event was pulled from the data source.
_airbyte_data
When no normalization (flattening) is needed, all data reside under this column as a json blob.
root level fields
When root level normalization (flattening) is selected, the root level fields are expanded.
For example, given the following json object from a source:
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
}
With no normalization, the output CSV is:
_airbyte_ab_id
_airbyte_emitted_at
_airbyte_data
26d73cde-7eb1-4e1e-b7db-a4c03b4cf206
1622135805000
{ "user_id": 123, name: { "first": "John", "last": "Doe" } }
With root level normalization, the output CSV is:
_airbyte_ab_id
_airbyte_emitted_at
user_id
name
26d73cde-7eb1-4e1e-b7db-a4c03b4cf206
1622135805000
123
{ "first": "John", "last": "Doe" }
Output CSV files will always be compressed using GZIP compression.

Json Lines is a text format with one JSON per line. Each line has a structure as follows:
{
"_airbyte_ab_id": "<uuid>",
"_airbyte_emitted_at": "<timestamp-in-millis>",
"_airbyte_data": "<json-data-from-source>"
}
For example, given the following two json objects from a source:
[
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
},
{
"user_id": 456,
"name": {
"first": "Jane",
"last": "Roe"
}
}
]
They will be like this in the output file:
{ "_airbyte_ab_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_emitted_at": "1622135805000", "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }
Output JSONL files will always be compressed using GZIP compression.

The following configuration is available to configure the Parquet output:
Parameter
Type
Default
Description
compression_codec
enum
UNCOMPRESSED
Compression algorithm. Available candidates are: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, and ZSTD.
block_size_mb
integer
128 (MB)
Block size (row group size) in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing.
max_padding_size_mb
integer
8 (MB)
Max padding size in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group.
page_size_kb
integer
1024 (KB)
Page size in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate.
dictionary_page_size_kb
integer
1024 (KB)
Dictionary Page Size in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.
dictionary_encoding
boolean
true
Dictionary encoding. This parameter controls whether dictionary encoding is turned on.
These parameters are related to the ParquetOutputFormat. See the Java doc for more details. Also see Parquet documentation for their recommended configurations (512 - 1024 MB block size, 8 KB page size).

Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. Because the data stream can come from any data source, the Json to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

  1. 1.
    Allow connections from Airbyte server to your AWS S3/ Minio S3 cluster (if they exist in separate VPCs).
  2. 2.
    An S3 bucket with credentials or an instanceprofile with read/write permissions configured for the host (ec2, eks).

  • Fill up S3 info
    • S3 Endpoint
      • Leave empty if using AWS S3, fill in S3 URL if using Minio S3.
    • S3 Bucket Name
      • See this to create an S3 bucket.
    • S3 Bucket Region
    • Access Key Id
      • See this on how to generate an access key.
      • See this on how to create a instanceprofile.
      • We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the staging bucket.
      • If the Access Key and Secret Access Key are not provided, the authentication will rely on the instanceprofile.
    • Secret Access Key
      • Corresponding key to the above key id.
  • Make sure your S3 bucket is accessible from the machine running Airbyte.
    • This depends on your networking setup.
    • You can check AWS S3 documentation with a tutorial on how to properly configure your S3's access here.
    • If you will use instance profile authentication, make sure the role has permission to read/write on the bucket.
    • The easiest way to verify if Airbyte is able to connect to your S3 bucket is via the check connection tool in the UI.

Version
Date
Pull Request
Subject
0.2.13
2022-03-29
#11496
Fix S3 bucket path to be included with S3 bucket format
0.2.12
2022-03-28
#11294
Change to serialized buffering strategy to reduce memory consumption
0.2.11
2022-03-23
#11173
Added support for AWS Glue crawler
0.2.10
2022-03-07
#10856
check method now tests for listObjects permissions on the target bucket
0.2.7
2022-02-14
#10318
Prevented double slashes in S3 destination path
0.2.6
2022-02-14
10256
Add -XX:+ExitOnOutOfMemoryError JVM option
0.2.5
2022-01-13
#9399
Use instance profile authentication if credentials are not provided
0.2.4
2022-01-12
#9415
BigQuery Destination : Fix GCS processing of Facebook data
0.2.3
2022-01-11
#9367
Avro & Parquet: support array field with unknown item type; default any improperly typed field to string.
0.2.2
2021-12-21
#8574
Added namespace to Avro and Parquet record types
0.2.1
2021-12-20
#8974
Release a new version to ensure there is no excessive logging.
0.2.0
2021-12-15
#8607
Change the output filename for CSV files - it's now bucketPath/namespace/streamName/timestamp_epochMillis_randomUuid.csv
0.1.16
2021-12-10
#8562
Swap dependencies with destination-jdbc.
0.1.15
2021-12-03
#8501
Remove excessive logging for Avro and Parquet invalid date strings.
0.1.14
2021-11-09
#7732
Support timestamp in Avro and Parquet
0.1.13
2021-11-03
#7288
Support Json additionalProperties.
0.1.12
2021-09-13
#5720
Added configurable block size for stream. Each stream is limited to 10,000 by S3
0.1.11
2021-09-10
#5729
For field names that start with a digit, a _ will be appended at the beginning for theParquet and Avro formats.
0.1.10
2021-08-17
#4699
Added json config validator
0.1.9
2021-07-12
#4666
Fix MinIO output for Parquet format.
0.1.8
2021-07-07
#4613
Patched schema converter to support combined restrictions.
0.1.7
2021-06-23
#4227
Added Avro and JSONL output.
0.1.6
2021-06-16
#4130
Patched the check to verify prefix access instead of full-bucket access.
0.1.5
2021-06-14
#3908
Fixed default max_padding_size_mb in spec.json.
0.1.4
2021-06-14
#3908
Added Parquet output.
0.1.3
2021-06-13
#4038
Added support for alternative S3.
0.1.2
2021-06-10
#4029
Fixed _airbyte_emitted_at field to be a UTC instead of local timestamp for consistency.
0.1.1
2021-06-09
#3973
Added AIRBYTE_ENTRYPOINT in base Docker image for Kubernetes support.
0.1.0
2021-06-03
#3672
Initial release with CSV output.
Last modified 6mo ago
Copy link
On this page
Features
Troubleshooting
Configuration
Output Schema
Avro
CSV
JSON Lines (JSONL)
Parquet
Getting Started (Airbyte Open-Source / Airbyte Cloud)
CHANGELOG