IMPALA-9515: Full ACID Milestone 3: Read support for "original files"

"Original files" are files that don't have full ACID schema. We can see
such files if we upgrade a non-ACID table to full ACID. Also, the LOAD
DATA statement can load non-ACID files into full ACID tables. So such
files don't store special ACID columns, that means we need
to auto-generate their values. These are (operation,
originalTransaction, bucket, rowid, and currentTransaction).

With the exception of 'rowid', all of them can be calculated based on
the file path, so I add their values to the scanner's template tuple.

'rowid' is the ordinal number of the row inside a bucket inside a
directory. For now Impala only allows one file per bucket per
directory. Therefore we can generate row ids for each file
independently.

Multiple files in a single bucket in a directory can only be present if
the table was non-transactional earlier and we upgraded it to full ACID
table. After the first compaction we should only see one original file
per bucket per directory.

In HdfsOrcScanner we calculate the first row id for our split then
the OrcStructReader fills the rowid slot with the proper values.

Testing:
 * added e2e tests to check if the generated values are correct
 * added e2e test to reject tables that have multiple files per bucket
 * added unit tests to the new auxiliary functions

Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953
Reviewed-on: http://gerrit.cloudera.org:8080/16001
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
Zoltan Borok-Nagy
2020-05-19 11:47:08 +02:00
committed by Impala Public Jenkins
parent 94c1b6354b
commit 930264afbd
16 changed files with 715 additions and 80 deletions

View File

@@ -374,6 +374,31 @@ TBLPROPERTIES("hbase.table.name" = "functional_hbase.hbasealltypeserror");
---- DATASET
functional
---- BASE_TABLE_NAME
alltypes_promoted
---- PARTITION_COLUMNS
year int
month int
---- COLUMNS
id int COMMENT 'Add a comment'
bool_col boolean
tinyint_col tinyint
smallint_col smallint
int_col int
bigint_col bigint
float_col float
double_col double
date_string_col string
string_col string
timestamp_col timestamp
---- DEPENDENT_LOAD_HIVE
INSERT INTO TABLE {db_name}{db_suffix}.{table_name} SELECT * FROM {db_name}{db_suffix}.alltypes;
ALTER TABLE {db_name}{db_suffix}.{table_name} SET tblproperties('EXTERNAL'='FALSE','transactional'='true');
---- TABLE_PROPERTIES
transactional=false
====
---- DATASET
functional
---- BASE_TABLE_NAME
hbasecolumnfamilies
---- HBASE_COLUMN_FAMILIES
0