1、Parquet Update/UDFs in Impala Nong Li So:ware Engineer, Cloudera 1Agenda •Parquet •File format descripBon •Benchmark Results in Impala •Parquet 2.0 •UDF/UDAs 2Parquet 34567Data Pages •Values are stored in data pages as a triple: DefiniBon Level, RepeBBon Level and Va
2、lue. •These are stored conBguous on disk => 1 seek to read a column regardless of nesBng. •Data pages are stored with different encodings: •Bit packing and Run Length Encoding (RLE) •DicBonary for strings •Extended to all types in Parquet 1.1 •Plain (liWle endian enc
3、oding) for naBve types. 8Parquet 2.0 •AddiBonal Encodings •Group VarInt (for small ints) •Improved string storage format •Delta Encoding (for strings and ints) •AddiBonal Metadata •Sorted files •Page/Column/File StaBsBcs •Expected to further reduce on disk size and a