Json to Avro Conversion
When an Airbyte data stream is synced to the Avro or Parquet format (e.g. Parquet on S3), the source Json schema is converted to an Avro schema, then the Json object is converted to an Avro record based on the Avro schema (and further to Parquet if necessary). Because the data stream can come from any data source, the Json to Avro conversion process has the following rules and limitations.

Json schema types are mapped to Avro types as follows:
Json Data Type
Avro Data Type
string
string
number
double
integer
int
boolean
boolean
null
null
object
record
array
array

The following built-in Json formats will be mapped to Avro logical types.
Json Type
Json Built-in Format
Avro Type
Avro Logical Type
Meaning
string
date
int
date
Number of epoch days from 1970-01-01 (reference).
string
time
long
time-micros
Number of microseconds after midnight (reference).
string
date-time
long
timestamp-micros
Number of microseconds from 1970-01-01T00:00:00Z (reference).
Date
The date logical type represents a date within the calendar, with no reference to a particular time zone or time of day.
A date logical type annotates an Avro int, where the int stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).
{
"type": "string",
"format": "date"
}
will become in Avro schema:
{
"type": "int",
"logicalType": "date"
}
Time (microsecond precision)
The time-micros logical type represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one microsecond.
A time-micros logical type annotates an Avro long, where the long stores the number of microseconds after midnight, 00:00:00.000000.
{
"type": "string",
"format": "time"
}
will become in Avro schema:
{
"type": "long",
"logicalType": "time-micros"
}
Timestamp (microsecond precision)
The timestamp-micros logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one microsecond.
A timestamp-micros logical type annotates an Avro long, where the long stores the number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC.
{
"type": "string",
"format": "date-time"
}
will become in Avro schema:
{
"type": "long",
"logicalType": "timestamp-micros"
}

Combined restrictions (allOf, anyOf, and oneOf) will be converted to type unions. The corresponding Avro schema can be less stringent. For example, the following Json schema
{
"oneOf": [
{"type": "string"},
{"type": "integer"}
]
}
will become this in Avro schema:
{
"type": ["null", "string", "int"]
}

Keyword not is not supported, as there is no equivalent validation mechanism in Avro schema.

Only alphanumeric characters and underscores (/a-zA-Z0-9_/) are allowed in a stream or field name. Any special character will be converted to an alphabet or underscore. For example, spécial:character_names will become special_character_names. The original names will be stored in the docproperty in this format: _airbyte_original_name:<original-name>.
Field name cannot start with a number, so an underscore will be added to those field names at the beginning.

All field will be nullable. For example, a string Json field will be typed as ["null", "string"] in Avro. This is necessary because the incoming data stream may have optional fields.

For array fields in Json schema, when the items property is an array, it means that each element in the array should follow its own schema sequentially. For example, the following specification means the first item in the array should be a string, and the second a number.
{
"array_field": {
"type": "array",
"items": [
{"type": "string"},
{"type": "number"}
]
}
}
This is not supported in Avro schema. As a compromise, the converter creates a union, ["null", "string", "number"], which is less stringent:
{
"name": "array_field",
"type": [
"null",
{
"type": "array",
"items": ["null", "string", "number"]
}
],
"default": null
}
If the Json array has multiple object items, these objects will be recursively merged into one Avro record. For example, the following Json array expects two different objects. The first object has an id field, and second has an id and message field. Their id fields have slightly different types.
Json schema:
{
"array_field": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"id": {
"type": "object",
"properties": {
"id_part_1": { "type": "integer" },
"id_part_2": { "type": "string" }
}
}
}
},
{
"type": "object",
"properties": {
"id": {
"type": "object",
"properties": {
"id_part_1": { "type": "string" },
"id_part_2": { "type": "integer" }
}
},
"message": {
"type": "string"
}
}
}
]
}
}
Json object:
{
"array_field": [
{
"id": {
"id_part_1": 1000,
"id_part_2": "abcde"
}
}, {
"id": {
"id_part_1": "wxyz",
"id_part_2": 2000
},
"message": "test message"
}
]
}
After conversion, the two object schemas will be merged into one. Furthermore, the fields under the id record, id_part_1 and id_part_2, will also be merged. In this way, all possible valid elements from the Json array can be converted to Avro records.
Avro schema:
{
"name": "array_field",
"type": [
"null",
{
"type": "array",
"items": [
"boolean",
{
"type": "record",
"name": "array_field",
"fields": [
{
"name": "id",
"type": [
"null",
{
"type": "record",
"name": "id",
"fields": [
{
"name": "id_part_1",
"type": ["null", "int", "string"],
"default": null
},
{
"name": "id_part_2",
"type": ["null", "string", "int"],
"default": null
}
]
}
],
"default": null
},
{
"name": "message",
"type": ["null", "string"],
"default": null
}
]
}
]
}
],
"default": null
}
Note that id_part_1 is a union of int and string, which comes from the first and second id definitions, respectively, in the original Json items specification.
Avro object:
{
"array_field": [
{
"id": {
"id_part_1": 1000,
"id_part_2": "abcde"
},
"message": null
},
{
"id": {
"id_part_1": "wxyz",
"id_part_2": 2000
},
"message": "test message"
}
]
}
Note that the first object in array_field originally does not have a message field. However, because its schema is merged with the second object definition, it has a null message field in the Avro record.

When a Json array field has no items, the element in that array field may have any type. However, Avro requires that each array has a clear type specification. To solve this problem, the elements in the array are forced to be strings.
For example, given the following Json schema and object:
{
"type": "object",
"properties": {
"identifier": {
"type": "array"
}
}
}
{
"identifier": ["151", 152, true, {"id": 153}, null]
}
the corresponding Avro schema and object will be:
{
"type": "record",
"fields": [
{
"name": "identifier",
"type": [
"null",
{
"type": "array",
"items": ["null", "string"]
}
],
"default": null
}
]
}
{
"identifier": ["151", "152", "true", "{\"id\": 153}", null]
}
Note that every non-null element inside the identifier array field is converted to string.

Three Airbyte specific fields will be added to each Avro record:
Field
Schema
Document
_airbyte_ab_id
uuid
link
_airbyte_emitted_at
timestamp-millis
link
_airbyte_additional_properties
map of string
See explanation below.

A Json object can have additional properties of unknown types, which is not compatible with the Avro schema. To solve this problem during Json to Avro object conversion, we introduce a special field: _airbyte_additional_properties typed as a nullable map from string to string:
{
"name": "_airbyte_additional_properties",
"type": ["null", {"type": "map", "values": "string"}],
"default": null
}
For example, given the following Json schema:
{
"type": "object",
"properties": {
"username": {
"type": ["null", "string"]
}
}
}
this Json object
{
"username": "admin",
"active": true,
"age": 21,
"auth": {
"auth_type": "ssl",
"api_key": "abcdefg/012345",
"admin": false,
"id": 1000
}
}
will be converted to the following Avro object:
{
"username": "admin",
"_airbyte_additional_properties": {
"active": "true",
"age": "21",
"auth": "{\"auth_type\":\"ssl\",\"api_key\":\"abcdefg/012345\",\"admin\":false,\"id\":1000}"
}
}
Note that all fields other than the username is moved under _ab_additional_properties as serialized strings, including the original object auth.

If an object field has no properties specification, all fields within this object will be put into the aforementioned _airbyte_additional_properties field.
For example, given the following Json schema and object:
{
"type": "object"
}
{
"username": "343-guilty-spark",
"password": 1439,
"active": true
}
the corresponding Avro schema and record will be:
{
"type": "record",
"name": "record_without_properties",
"fields": [
{
"name": "_airbyte_additional_properties",
"type": ["null", {"type": "map", "values": "string"}],
"default": null
}
]
}
{
"_airbyte_additional_properties": {
"username": "343-guilty-spark",
"password": "1439",
"active": "true"
}
}

Any field without property type specification will default to a string field, and its value will be serialized to string.

Based on the above rules, here is an overall example. Given the following Json schema:
{
"type": "object",
"$schema": "http://json-schema.org/draft-07/schema#",
"properties": {
"id": {
"type": "integer"
},
"user": {
"type": ["null", "object"],
"properties": {
"id": {
"type": "integer"
},
"field_with_spécial_character": {
"type": "integer"
}
}
},
"created_at": {
"type": ["null", "string"],
"format": "date-time"
}
}
}
Its corresponding Avro schema will be:
{
"name": "stream_name",
"type": "record",
"fields": [
{
"name": "_airbyte_ab_id",
"type": {
"type": "string",
"logicalType": "uuid"
}
},
{
"name": "_airbyte_emitted_at",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
},
{
"name": "id",
"type": ["null", "int"],
"default": null
},
{
"name": "user",
"type": [
"null",
{
"type": "record",
"name": "user",
"fields": [
{
"name": "id",
"type": ["null", "int"],
"default": null
},
{
"name": "field_with_special_character",
"type": ["null", "int"],
"doc": "_airbyte_original_name:field_with_spécial_character",
"default": null
},
{
"name": "_airbyte_additional_properties",
"type": ["null", {"type": "map", "values": "string"}],
"default": null
}
]
}
],
"default": null
},
{
"name": "created_at",
"type": [
"null",
{"type": "long", "logicalType": "timestamp-micros"},
"string"
],
"default": null
},
{
"name": "_airbyte_additional_properties",
"type": ["null", {"type": "map", "values": "string"}],
"default": null
}
]
}
More examples can be found in the Json to Avro conversion test cases.

Copy link
On this page
Conversion Rules
Type Mapping
Built-in Formats
Combined Restrictions
Keyword not
Filed Name
Nullable Fields
Array Types
Untyped Array
Airbyte-Specific Fields
Additional Properties
Untyped Object
Untyped Field
Example
Implementation