impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 15:00:38 -05:00

Author	SHA1	Message	Date
Andrew Sherman	ee03727971	IMPALA-11025: Transactional tables should use /test-warehouse/managed/databasename.db Recent Hive releases seem to be enforcing that data for a managed table is stored under the hive.metastore.warehouse.dir path property in a folder path similar to databasename.db/tablename - see https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables Use this form /test-warehouse/managed/databasename.db in generate-schema-statements.py when creating transactional tables. Testing: - A few small changes to tests that verify filesystem changes for acid tables. - Exhaustive tests pass. Change-Id: Ib870ca802c9fa180e6be7a6f65bef35b227772db Reviewed-on: http://gerrit.cloudera.org:8080/18046 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-11-23 03:24:08 +00:00
Zoltan Borok-Nagy	f8015ff68d	IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list Minor compactions can compact several delta directories into a single delta directory. The current directory filtering algorithm had to be modified to handle minor compacted directories and prefer those over plain delta directories. This happens in the Frontend, mostly in AcidUtils.java. Hive Streaming Ingestion writes similar delta directories, but they might contain rows Impala cannot see based on its valid write id list. E.g. we can have the following delta directory: full_acid/delta_0000001_0000010/0000 # minWriteId: 1 # maxWriteId: 10 This delta dir contains rows with write ids between 1 and 10. But maybe we are only allowed to see write ids less than 5. Therefore we need to check the ACID write id column (named originalTransaction) to determine which rows are valid. Delta directories written by Hive Streaming don't have a visibility txn id, so we can recognize them based on the directory name. If there's a visibilityTxnId and it is committed => every row is valid: full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId # every row is valid If there's no visibilityTxnId then it was created via Hive Streaming, therefore we need to validate rows. Fortunately Hive Streaming writes rows with different write ids into different ORC stripes, therefore we don't need to validate the write id per row. If we had statistics, we could validate per stripe, but since Hive Streaming doesn't write statistics we validate the write id per ORC row batch (an alternative could be to do a 2-pass read, first we'd read a single value from each stripe's 'currentTransaction' field, then we'd read the stripe if the write id is valid). Testing * the frontend logic is tested in AcidUtilsTest * the backend row validation is tested in test_acid_row_validation Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d Reviewed-on: http://gerrit.cloudera.org:8080/15818 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 21:00:44 +00:00
Zoltan Borok-Nagy	8aa0652871	IMPALA-9484: Full ACID Milestone 1: properly scan files that has full ACID schema Full ACID row format looks like this: { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1, "row": {"i": 1} } User columns are nested under "row". In the frontend we need to create slot descriptors that correspond to the file schema. In the catalog we could mimic the file schema but that would introduce several complexities and corner cases in column resolution. Also in query results the heading of the above user column would be "row.i". Star expansion should also be modified, etc. Because of that in the Catalog I create the exact opposite of the above schema: { "row__id": { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1 } "i": 1 } This way very little modification is needed in the frontend. And the hidden columns can be easily retrieved via 'SELECT row__id.' when we need those for debugging/testing. We only need to change Path.getAbsolutePath() to return a schema path that corresponds to the file schema. Also in the backend we need some extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the table schema path from the file schema path. Testing: I changed data loading to load ORC files in full ACID format by default. With this change we should be able to scan full ACID tables that are not minor-compacted, don't have deleted rows, and don't have original files. Newly added Tests: specific queries about hidden columns (full-acid-rowid.test) * SHOW CREATE TABLE (show-create-table-full-acid.test) * DESCRIBE [FORMATTED] TABLE (describe-path.test) * INSERT should be forbidden (acid-negative.test) * added tests for column masking ( ranger_column_masking_complex_types.test) Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb Reviewed-on: http://gerrit.cloudera.org:8080/15395 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-02 12:01:41 +00:00
Gabor Kaszab	050dcf912e	IMPALA-9093: Change ACID tests to upgrade external tables Due to Hive-22158 all non-ACID tables are treated as external tables instead of being managed tables. The ACID tests occasionally upgrade non-ACID tables to ACID tables but that is not allowed for external tables. Since all non-ACID tables are external due to HIVE-22158 some of the ACID tests started to fail after a CDP_BUILD_NUMBER bump that brought in a Hive version containing the mentioned change. The fix is to set 'EXTERNAL' table property to false in the same step when upgrading the table to ACID. Also in the tests this step is executed from HIVE instead of Impala. Tested with the original CDP_BUILD_NUMBER in bin/impala-config.sh and also tested after bumping that number to 1579022. Change-Id: I796403e04b3f06c99131db593473d5438446d5fd Reviewed-on: http://gerrit.cloudera.org:8080/14633 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2019-11-07 14:30:47 +00:00
Gabor Kaszab	e1b93f27f3	IMPALA-8823: DROP TABLE support for insert-only ACID tables Enhances Impala to be able to drop insert-only transactional tables. In order to do this Impala acquires an exclusive table lock in HMS before performing the drop operation and releases the lock once dropping the table finished. INSERT statement does the locking and heartbeating on coordinator side but for DROP TABLE all of these are done from Catalog side. This means that alongside Impala coordinators now Catalog also does heartbeating towards HMS. Testing: - E2E test: Dropped a table, re-created it and dropped again to check if no locks remained in HMS. - E2E test: After dropping a table from Impala checked if Hive also sees it being dropped. - Manual test: With a hacked Impala that runs a drop table long enough I checked that there is a table lock entry in HMS during the execution and disappears once the query finishes. Change-Id: Ic41ca73268c4b75af5a08fe3dd1ada1df3f6fd34 Reviewed-on: http://gerrit.cloudera.org:8080/14038 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-13 18:07:21 +00:00
Gabor Kaszab	2d81965511	IMPALA-8600: Refresh transactional tables Refreshing a subset of partitions in a transactional table might lead us to an inconsistent state of that transactional table. As a fix user initiated partition refreshes are no longer allowed on ACID tables. Additionally, a refresh partition Metastore event actually triggers a refresh on the whole ACID table. An optimisation is implemented to check the locally latest table level writeId, fetch the same from HMS and do a refresh only if they don't match. This couldn't be done for partitioned tables as apparently Hive doesn't update the table level writeId if the transactional table is partitioned. Similarly, checking the writeId for each partition and refresh only the ones where the writeId is not up to date is not feasible either as there is no writeId update when Hive makes schema changes like adding a column neither on table level or on partition level. So after a adding a column in Hive to a partitioned ACID table and refreshing that table in Impala, still Impala wouldn't see the new column. Hence, I unconditionally refresh the whole table if it's ACID and partitioned. Note, that for non-partitioned ACID tables Hive updates the table level writeId even for schema changes. Change-Id: I1851da22452074dbe253bcdd97145e06c7552cd3 Reviewed-on: http://gerrit.cloudera.org:8080/13938 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-01 17:23:38 +00:00
Csaba Ringhofer	5faf1745b0	IMPALA-8585: Fix for upgraded + compacted acid tables Tables that already had data before altered to be an ACID table keep the old data in their root table/partition directory if hive.mm.allow.originals == true. These files should be merged to the base file during the first compaction, so should be read only if there is no valid base yet. Also added EE tests for upgraded tables. Change-Id: I062d8e76f90e0da1b954bf156208c0afb424deb1 Reviewed-on: http://gerrit.cloudera.org:8080/13427 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-27 13:08:45 +00:00
arorasudhanshu	9ee4a5e194	acid: Filter unwanted files based on ACID state. - Added new functionality in AcidUtils to filter out files in uncommitted directories, and to find the latest valid base data and filter out files corresponding to older deltas or bases. - Changed Table loading to only load writeIds for transactional tables, and enabled a previously-ignored unit test. - Modified Hive configuration to enable support for compactions: -- Need to pass Tez on the HMS classpath, since HMS actually schedules compactions rather than HS2. -- Had to configure a worker thread for the compactor, or else compactions wouldn't proceed even when manually triggered. Testing: - New unit tests (AcidUtilsTest) for filtering logic. - New e2e test to read data written by Hive in an insert-only table, with INSERT, INSERT OVERWRITE, and compaction. Also tests negative cases e2e. To enable the e2e test, this adds support for a 'HIVE_QUERY' section to the test script files. To make it reasonably fast, this uses Thrift to connect to HS2 rather than shelling out to beeline. In order for this to work properly, a bit of extra special-casing had to be added to the test utility. This commit was co-authored by Sudhanshu Arora and Todd Lipcon. Change-Id: Icf0aeb36e10c827ead59ed7f67e731199394fe8e Reviewed-on: http://gerrit.cloudera.org:8080/13334 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-24 07:35:29 +00:00

8 Commits