layout | title | permalink |
---|---|---|
docs |
Column Encodings |
/docs/encodings.html |
All of the 16, 32, and 64 bit integer column types use the same set of potential encodings, which is basically whether they use RLE v1 or v2. If the PRESENT stream is not included, all of the values are present. For values that have false bits in the present stream, no values are included in the data stream.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | Signed Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2
Floating point types are stored using IEEE 754 floating point bit layout. Float columns use 4 bytes per value and double columns use 8 bytes.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | IEEE 754 floating point representation
String, char, and varchar columns may be encoded either using a dictionary encoding or a direct encoding. A direct encoding should be preferred when there are many distinct values. In all of the encodings, the PRESENT stream encodes whether the value is null. The Java ORC writer automatically picks the encoding after the first row group (10,000 rows).
For direct encoding the UTF-8 bytes are saved in the DATA stream and the length of each value is written into the LENGTH stream. In direct encoding, if the values were ["Nevada", "California"]; the DATA would be "NevadaCalifornia" and the LENGTH would be [6, 10].
For dictionary encodings the dictionary is sorted and UTF-8 bytes of each unique value are placed into DICTIONARY_DATA. The length of each item in the dictionary is put into the LENGTH stream. The DATA stream consists of the sequence of references to the dictionary elements.
In dictionary encoding, if the values were ["Nevada", "California", "Nevada", "California", and "Florida"]; the DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DICTIONARY | PRESENT | Yes | Boolean RLE | DATA | No | Unsigned Integer RLE v1 | DICTIONARY_DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v1 DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v2 DICTIONARY_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Unsigned Integer RLE v2 | DICTIONARY_DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v2
Boolean columns are rare, but have a simple encoding.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | Boolean RLE
TinyInt (byte) columns use byte run length encoding.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | Byte RLE
Binary data is encoded with a PRESENT stream, a DATA stream that records the contents, and a LENGTH stream that records the number of bytes per a value.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v2
Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 bits plus a sign bit. The current encoding of decimal columns stores the integer representation of the value as an unbounded length zigzag encoded base 128 varint. The scale is stored in the SECONDARY stream as an signed integer.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | Unbounded base 128 varints
| SECONDARY | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2
Date data is encoded with a PRESENT stream, a DATA stream that records the number of days after January 1, 1970 in UTC.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | Signed Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2
Timestamp records times down to nanoseconds as a PRESENT stream that records non-null values, a DATA stream that records the number of seconds after 1 January 2015, and a SECONDARY stream that records the number of nanoseconds.
Because the number of nanoseconds often has a large number of trailing zeros, the number has trailing decimal zero digits removed and the last three bits are used to record how many zeros were removed. Thus 1000 nanoseconds would be serialized as 0x0b and 100000 would be serialized as 0x0d.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DATA | No | Signed Integer RLE v1
| SECONDARY | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2 | SECONDARY | No | Unsigned Integer RLE v2
Structs have no data themselves and delegate everything to their child columns except for their PRESENT stream. They have a child column for each of the fields.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
Lists are encoded as the PRESENT stream and a length stream with number of items in each list. They have a single child column for the element values.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE | LENGTH | No | Unsigned Integer RLE v2
Maps are encoded as the PRESENT stream and a length stream with number of items in each list. They have a child column for the key and another child column for the value.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE | LENGTH | No | Unsigned Integer RLE v2
Unions are encoded as the PRESENT stream and a tag stream that controls which potential variant is used. They have a child column for each variant of the union. Currently ORC union types are limited to 256 variants, which matches the Hive type model.
Encoding | Stream Kind | Optional | Contents |
---|---|---|---|
DIRECT | PRESENT | Yes | Boolean RLE |
| DIRECT | No | Byte RLE