Skip to content

Latest commit

 

History

History
206 lines (163 loc) · 8.74 KB

encodings.md

File metadata and controls

206 lines (163 loc) · 8.74 KB
layout title permalink
docs
Column Encodings
/docs/encodings.html

SmallInt, Int, and BigInt Columns

All of the 16, 32, and 64 bit integer column types use the same set of potential encodings, which is basically whether they use RLE v1 or v2. If the PRESENT stream is not included, all of the values are present. For values that have false bits in the present stream, no values are included in the data stream.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
      | DATA        | No       | Signed Integer RLE v1

DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2

Float and Double Columns

Floating point types are stored using IEEE 754 floating point bit layout. Float columns use 4 bytes per value and double columns use 8 bytes.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
      | DATA        | No       | IEEE 754 floating point representation

String, Char, and VarChar Columns

String, char, and varchar columns may be encoded either using a dictionary encoding or a direct encoding. A direct encoding should be preferred when there are many distinct values. In all of the encodings, the PRESENT stream encodes whether the value is null. The Java ORC writer automatically picks the encoding after the first row group (10,000 rows).

For direct encoding the UTF-8 bytes are saved in the DATA stream and the length of each value is written into the LENGTH stream. In direct encoding, if the values were ["Nevada", "California"]; the DATA would be "NevadaCalifornia" and the LENGTH would be [6, 10].

For dictionary encodings the dictionary is sorted and UTF-8 bytes of each unique value are placed into DICTIONARY_DATA. The length of each item in the dictionary is put into the LENGTH stream. The DATA stream consists of the sequence of references to the dictionary elements.

In dictionary encoding, if the values were ["Nevada", "California", "Nevada", "California", and "Florida"]; the DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DATA            | No       | String contents
          | LENGTH          | No       | Unsigned Integer RLE v1

DICTIONARY | PRESENT | Yes | Boolean RLE | DATA | No | Unsigned Integer RLE v1 | DICTIONARY_DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v1 DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v2 DICTIONARY_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Unsigned Integer RLE v2 | DICTIONARY_DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v2

Boolean Columns

Boolean columns are rare, but have a simple encoding.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DATA            | No       | Boolean RLE

TinyInt Columns

TinyInt (byte) columns use byte run length encoding.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DATA            | No       | Byte RLE

Binary Columns

Binary data is encoded with a PRESENT stream, a DATA stream that records the contents, and a LENGTH stream that records the number of bytes per a value.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DATA            | No       | String contents
          | LENGTH          | No       | Unsigned Integer RLE v1

DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | String contents | LENGTH | No | Unsigned Integer RLE v2

Decimal Columns

Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 bits plus a sign bit. The current encoding of decimal columns stores the integer representation of the value as an unbounded length zigzag encoded base 128 varint. The scale is stored in the SECONDARY stream as an signed integer.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DATA            | No       | Unbounded base 128 varints
          | SECONDARY       | No       | Unsigned Integer RLE v1

DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2

Date Columns

Date data is encoded with a PRESENT stream, a DATA stream that records the number of days after January 1, 1970 in UTC.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DATA            | No       | Signed Integer RLE v1

DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2

Timestamp Columns

Timestamp records times down to nanoseconds as a PRESENT stream that records non-null values, a DATA stream that records the number of seconds after 1 January 2015, and a SECONDARY stream that records the number of nanoseconds.

Because the number of nanoseconds often has a large number of trailing zeros, the number has trailing decimal zero digits removed and the last three bits are used to record how many zeros were removed. Thus 1000 nanoseconds would be serialized as 0x0b and 100000 would be serialized as 0x0d.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DATA            | No       | Signed Integer RLE v1
          | SECONDARY       | No       | Unsigned Integer RLE v1

DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2 | SECONDARY | No | Unsigned Integer RLE v2

Struct Columns

Structs have no data themselves and delegate everything to their child columns except for their PRESENT stream. They have a child column for each of the fields.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE

List Columns

Lists are encoded as the PRESENT stream and a length stream with number of items in each list. They have a single child column for the element values.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | LENGTH          | No       | Unsigned Integer RLE v1

DIRECT_V2 | PRESENT | Yes | Boolean RLE | LENGTH | No | Unsigned Integer RLE v2

Map Columns

Maps are encoded as the PRESENT stream and a length stream with number of items in each list. They have a child column for the key and another child column for the value.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | LENGTH          | No       | Unsigned Integer RLE v1

DIRECT_V2 | PRESENT | Yes | Boolean RLE | LENGTH | No | Unsigned Integer RLE v2

Union Columns

Unions are encoded as the PRESENT stream and a tag stream that controls which potential variant is used. They have a child column for each variant of the union. Currently ORC union types are limited to 256 variants, which matches the Hive type model.

Encoding Stream Kind Optional Contents
DIRECT PRESENT Yes Boolean RLE
          | DIRECT          | No       | Byte RLE