impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 18:02:33 -05:00

Author	SHA1	Message	Date
skyyws	fb6d96e001	IMPALA-9741: Support querying Iceberg table by impala This patch mainly realizes the querying of iceberg table through impala, we can use the following sql to create an external iceberg table: CREATE EXTERNAL TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); Or just including table name and location like this: CREATE EXTERNAL TABLE default.iceberg_test STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); 'iceberg_file_format' is the file format in iceberg, currently only support PARQUET, other format would be supported in the future. And if you don't specify this property in your SQL, default file format is PARQUET. We achieved this function by treating the iceberg table as normal unpartitioned hdfs table. When querying iceberg table, we pushdown partition column predicates to iceberg to decide which data files need to be scanned, and then transfer this information to BE to do the real scan operation. Testing: - Unit test for Iceberg in FileMetadataLoaderTest - Create table tests in functional_schema_template.sql - Iceberg table query test in test_scanners.py Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006 Reviewed-on: http://gerrit.cloudera.org:8080/16143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-06 02:12:07 +00:00
Aman Sinha	5e9f10d34c	IMPALA-10064: Support constant propagation for eligible range predicates This patch adds support for constant propagation of range predicates involving date and timestamp constants. Previously, only equality predicates were considered for propagation. The new type of propagation is shown by the following example: Before constant propagation: WHERE date_col = CAST(timestamp_col as DATE) AND timestamp_col BETWEEN '2019-01-01' AND '2020-01-01' After constant propagation: WHERE date_col >= '2019-01-01' AND date_col <= '2020-01-01' AND timestamp_col >= '2019-01-01' AND timestamp_col <= '2020-01-01' AND date_col = CAST(timestamp_col as DATE) As a consequence, since Impala supports table partitioning by date columns but not timestamp columns, the above propagation enables partition pruning based on timestamp ranges. Existing code for equality based constant propagation was refactored and consolidated into a new class which handles both equality and range based constant propagation. Range based propagation is only applied to date and timestamp columns. Testing: - Added new range constant propagation tests to PlannerTest. - Added e2e test for range constant propagation based on a newly added date partitioned table. - Ran precommit tests. Change-Id: I811a1f8d605c27c7704d7fc759a91510c6db3c2b Reviewed-on: http://gerrit.cloudera.org:8080/16346 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 22:57:55 +00:00
Zoltan Borok-Nagy	da34d34a42	IMPALA-9859: Full ACID Milestone 4: Part 2 Reading modified tables (complex types) This implements scanning full ACID tables that contain complex types. The same technique works that we use for primitive types. I.e. we add a LEFT ANTI JOIN on top of the Hdfs scan node in order to subtract the deleted rows from the inserted rows. However, there were some types of queries where we couldn't do that. These are the queries that scan the nested collection items directly. E.g.: SELECT item FROM complextypestbl.int_array; The above query only creates a single tuple descriptor that holds the collection items. Since this tuple descriptor is not at the table-level, we cannot add slot references to the hidden ACID column which are at the top level of the table schema. To resolve this I added a statement rewriter that rewrites the above statement to the following: SELECT item FROM complextypestbl $a$1, $a$1.int_array; Now in this example we'll have two tuple descriptors, one for the table-level, and one for the collection item. So we can add the ACID slot refs to the table-level tuple descriptor. The rewrite is implemented by the new AcidRewriter class. Performance I executed the following query with num_nodes=1 on a non-transactional table (without the rewrite), and on an ACID table (with the rewrite): select count() from customer_nested.c_orders.o_lineitems; Without the rewrite: Fetched 1 row(s) in 0.41s +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ \| Operator \| #Hosts \| #Inst \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ \| F00:ROOT \| 1 \| 1 \| 13.61us \| 13.61us \| \| \| 0 B \| 0 B \| \| \| 01:AGGREGATE \| 1 \| 1 \| 3.68ms \| 3.68ms \| 1 \| 1 \| 16.00 KB \| 10.00 MB \| FINALIZE \| \| 00:SCAN HDFS \| 1 \| 1 \| 280.47ms \| 280.47ms \| 6.00M \| 15.00M \| 56.98 MB \| 8.00 MB \| tpch_nested_orc_def.customer.c_orders.o_lineitems \| +--------------+--------+-------+----------+----------+-------+------------+----------+---------------+---------------------------------------------------+ With the rewrite: Fetched 1 row(s) in 0.42s +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ \| Operator \| #Hosts \| #Inst \| Avg Time \| Max Time \| #Rows \| Est. #Rows \| Peak Mem \| Est. Peak Mem \| Detail \| +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ \| F00:ROOT \| 1 \| 1 \| 25.16us \| 25.16us \| \| \| 0 B \| 0 B \| \| \| 05:AGGREGATE \| 1 \| 1 \| 3.44ms \| 3.44ms \| 1 \| 1 \| 63.00 KB \| 10.00 MB \| FINALIZE \| \| 01:SUBPLAN \| 1 \| 1 \| 16.52ms \| 16.52ms \| 6.00M \| 125.92M \| 47.00 KB \| 0 B \| \| \| \|--04:NESTED LOOP JOIN \| 1 \| 1 \| 188.47ms \| 188.47ms \| 0 \| 10 \| 24.00 KB \| 12 B \| CROSS JOIN \| \| \| \|--02:SINGULAR ROW SRC \| 1 \| 1 \| 0ns \| 0ns \| 0 \| 1 \| 0 B \| 0 B \| \| \| \| 03:UNNEST \| 1 \| 1 \| 25.37ms \| 25.37ms \| 0 \| 10 \| 0 B \| 0 B \| $a$1.c_orders.o_lineitems o_lineitems \| \| 00:SCAN HDFS \| 1 \| 1 \| 96.26ms \| 96.26ms \| 100.00K \| 12.59M \| 38.19 MB \| 72.00 MB \| default.customer_nested $a$1 \| +---------------------------+--------+-------+----------+----------+---------+------------+----------+---------------+---------------------------------------+ So the overhead is very little. Testing Added planner tests to PlannerTest/acid-scans.test * E2E query tests to QueryTest/full-acid-complex-type-scans.test * E2E tests for rowid-generation: QueryTest/full-acid-rowid.test Change-Id: I8b2c6cd3d87c452c5b96a913b14c90ada78d4c6f Reviewed-on: http://gerrit.cloudera.org:8080/16228 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-12 17:45:50 +00:00
Zoltan Borok-Nagy	f602c3f80f	IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_0000001_0000001_0000/0000_0 * full_acid/delta_0000002_0000002_0000/0000_0 * full_acid/delete_delta_0000003_0000003_0000/0000_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Reviewed-on: http://gerrit.cloudera.org:8080/16082 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-14 12:53:51 +00:00
Zoltan Borok-Nagy	930264afbd	IMPALA-9515: Full ACID Milestone 3: Read support for "original files" "Original files" are files that don't have full ACID schema. We can see such files if we upgrade a non-ACID table to full ACID. Also, the LOAD DATA statement can load non-ACID files into full ACID tables. So such files don't store special ACID columns, that means we need to auto-generate their values. These are (operation, originalTransaction, bucket, rowid, and currentTransaction). With the exception of 'rowid', all of them can be calculated based on the file path, so I add their values to the scanner's template tuple. 'rowid' is the ordinal number of the row inside a bucket inside a directory. For now Impala only allows one file per bucket per directory. Therefore we can generate row ids for each file independently. Multiple files in a single bucket in a directory can only be present if the table was non-transactional earlier and we upgraded it to full ACID table. After the first compaction we should only see one original file per bucket per directory. In HdfsOrcScanner we calculate the first row id for our split then the OrcStructReader fills the rowid slot with the proper values. Testing: * added e2e tests to check if the generated values are correct * added e2e test to reject tables that have multiple files per bucket * added unit tests to the new auxiliary functions Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Reviewed-on: http://gerrit.cloudera.org:8080/16001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-29 21:00:05 +00:00
wzhou-code	c7ce4fa109	IMPALA-9691: Support Kudu Timestamp and Date bloom filter Impala save timestamp as 12 bytes of structure TimestampValue with time in nano seconds. Kudu store timestamp as 8 bytes of Unix Time microseconds. To avoid the data truncation issue in the bloom filter, add FunctionCallExpr with 'utc_to_unix_micros' as the root of source expression of bloom filter to convert timestamp values to microseconds when building timestamp bloom filter for Kudu. Generated functional date_tbl table in Kudu format for unit-test. Added new test cases for Kudu Timestamp and Date bloom filters. Testing: Passed all core tests. Change-Id: I3c1e9bcc9fd6d79a39f25eaa3396188fc0a52a48 Reviewed-on: http://gerrit.cloudera.org:8080/16094 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-26 06:56:16 +00:00
Joe McDonnell	f15a311065	IMPALA-9709: Remove Impala-lzo from the development environment This removes Impala-lzo from the Impala development environment. Impala-lzo is not built as part of the Impala build. The LZO plugin is no longer loaded. LZO tables are not loaded during dataload, and LZO is no longer tested. This removes some obsolete scan APIs that were only used by Impala-lzo. With this commit, Impala-lzo would require code changes to build against Impala. The plugin infrastructure is not removed, and this leaves some LZO support code in place. If someone were to decide to revive Impala-lzo, they would still be able to load it as a plugin and get the same functionality as before. This plugin support may be removed later. Testing: - Dryrun of GVO - Modified TestPartitionMetadataUncompressedTextOnly's test_unsupported_text_compression() to add LZO case Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e Reviewed-on: http://gerrit.cloudera.org:8080/15814 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-06-15 23:42:12 +00:00
xiaomeng	d45e3a50b0	IMPALA-9673: Add external warehouse dir variable in E2E test Updated CDP build to 7.2.1.0-57 to include new Hive features such as HIVE-22995. In minicluster, we have default values of hive.create.as.acid and hive.create.as.insert.only which are false. So by default hive creates external type table located in external warehouse directory. Due to HIVE-22995, desc db returns external warehouse directory. With above reasons, we need use external warehouse dir in some tests. Also add a new test for "CREATE DATABASE ... LOCATION". Tested: Re-run failed test in minicluster. Run exhaustive tests. Change-Id: I57926babf4caebfd365e6be65a399f12ea68687f Reviewed-on: http://gerrit.cloudera.org:8080/15990 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-05 23:48:53 +00:00
Zoltan Borok-Nagy	f8015ff68d	IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list Minor compactions can compact several delta directories into a single delta directory. The current directory filtering algorithm had to be modified to handle minor compacted directories and prefer those over plain delta directories. This happens in the Frontend, mostly in AcidUtils.java. Hive Streaming Ingestion writes similar delta directories, but they might contain rows Impala cannot see based on its valid write id list. E.g. we can have the following delta directory: full_acid/delta_0000001_0000010/0000 # minWriteId: 1 # maxWriteId: 10 This delta dir contains rows with write ids between 1 and 10. But maybe we are only allowed to see write ids less than 5. Therefore we need to check the ACID write id column (named originalTransaction) to determine which rows are valid. Delta directories written by Hive Streaming don't have a visibility txn id, so we can recognize them based on the directory name. If there's a visibilityTxnId and it is committed => every row is valid: full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId # every row is valid If there's no visibilityTxnId then it was created via Hive Streaming, therefore we need to validate rows. Fortunately Hive Streaming writes rows with different write ids into different ORC stripes, therefore we don't need to validate the write id per row. If we had statistics, we could validate per stripe, but since Hive Streaming doesn't write statistics we validate the write id per ORC row batch (an alternative could be to do a 2-pass read, first we'd read a single value from each stripe's 'currentTransaction' field, then we'd read the stripe if the write id is valid). Testing * the frontend logic is tested in AcidUtilsTest * the backend row validation is tested in test_acid_row_validation Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d Reviewed-on: http://gerrit.cloudera.org:8080/15818 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 21:00:44 +00:00
Adam Tamas	c32849a391	IMPALA-8980: Remove functional*.alltypesinsert from EE tests -Modified the ‘test_insert.py’ so the tests can run parallel. -Every test will create its own temporary tables for insert testing. -Swapped out the SETUP tags to Truncate table QUERY statement. -Becouse the SETUP tag is not used anymore, the correspondig code was removed. -A test query in ‘insert.test’. The test was incorrect so modified to test for the right behavior. Testing: -tests/run-tests.py query_test/test_insert.py -impala-py.test tests/query_test/test_insert.py -the same for test_insert_permutation.py and test_load.py Change-Id: I257e936868917a2fcc6c030f6c855b247e8a0eea Reviewed-on: http://gerrit.cloudera.org:8080/15529 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-14 12:18:21 +00:00
Zoltan Borok-Nagy	b770d2d378	Put transactional tables into 'managed' directory HIVE-22794 disallows ACID tables outside of the 'managed' warehouse directory. This change updates data loading to make it conform to the new rules. The following tests had to be modified to use the new paths: * AnalyzeDDLTest.TestCreateTableLikeFileOrc() * create-table-like-file-orc.test Change-Id: Id3b65f56bf7f225b1d29aa397f987fdd7eb7176c Reviewed-on: http://gerrit.cloudera.org:8080/15708 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-11 00:36:56 +00:00
Zoltan Borok-Nagy	8aa0652871	IMPALA-9484: Full ACID Milestone 1: properly scan files that has full ACID schema Full ACID row format looks like this: { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1, "row": {"i": 1} } User columns are nested under "row". In the frontend we need to create slot descriptors that correspond to the file schema. In the catalog we could mimic the file schema but that would introduce several complexities and corner cases in column resolution. Also in query results the heading of the above user column would be "row.i". Star expansion should also be modified, etc. Because of that in the Catalog I create the exact opposite of the above schema: { "row__id": { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1 } "i": 1 } This way very little modification is needed in the frontend. And the hidden columns can be easily retrieved via 'SELECT row__id.' when we need those for debugging/testing. We only need to change Path.getAbsolutePath() to return a schema path that corresponds to the file schema. Also in the backend we need some extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the table schema path from the file schema path. Testing: I changed data loading to load ORC files in full ACID format by default. With this change we should be able to scan full ACID tables that are not minor-compacted, don't have deleted rows, and don't have original files. Newly added Tests: specific queries about hidden columns (full-acid-rowid.test) * SHOW CREATE TABLE (show-create-table-full-acid.test) * DESCRIBE [FORMATTED] TABLE (describe-path.test) * INSERT should be forbidden (acid-negative.test) * added tests for column masking ( ranger_column_masking_complex_types.test) Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb Reviewed-on: http://gerrit.cloudera.org:8080/15395 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-02 12:01:41 +00:00
stiga-huang	9672d94596	IMPALA-7784: Use unescaped string in partition pruning + fix duplicatedly unescaping strings String values from external systems (HDFS, Hive, Kudu, etc.) are already unescaped, the same as string values in Thrift objects deserialized in coordinators. We should mark needsUnescaping_ as false in creating StringLiterals for these values (in LiteralExpr#create()). When comparing StringLiterals in partition pruning, we should also use the unescaped values if needsUnescaping_ is true. Tests: - Add tests for partition pruning on unescaped strings. - Add test coverage for all existing code paths using LiteralExpr#create(). - Run core tests Change-Id: Iea8070f16a74f9aeade294504f2834abb8b3b38f Reviewed-on: http://gerrit.cloudera.org:8080/15278 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-09 06:29:35 +00:00
Joe McDonnell	90ab610d34	Convert dataload hdfs copy commands to LOAD DATA statements The schema file allows specifying a commandline command in several of the sections (LOAD, DEPENDENT_LOAD, etc). These are execute by testdata/bin/generate-schema-statements.py when it is creating the SQL files that are later executed for dataload. A fair number of tables use this flexibility to execute hdfs mkdir and copy commands via the command line. Unfortunately, this is very inefficient. HDFS command line commands require spinning up a JVM and can take over one second per command. These commands are executed during a serial part of dataload, and they can be executed multiple times. In short, these commands are a significant slowdown for loading the functional tables. This converts the hdfs command line statements to equivalent Hive LOAD DATA LOCAL statements. These are doing the copy from an already running JVM, so they do not need JVM startup. They also run in the parallel part of dataload, speeding up the SQL generation part. This speeds up generate-schema-statements.py significantly. On the functional dataset, it saves 7 minutes. Before: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 8m8.068s user 10m11.218s sys 0m44.932s After: time testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive -f real 0m35.800s user 0m42.536s sys 0m5.210s This is currently a long-pole in dataload, so it translates directly to an overall speedup of about 7 minutes. Testing: - Ran debug tests Change-Id: Icf17b85ff85618933716a80f1ccd6701b07f464c Reviewed-on: http://gerrit.cloudera.org:8080/15228 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-24 21:22:18 +00:00
Yanjia Li	ea0e1def61	IMPALA-8778: Support Apache Hudi Read Optimized Table Hudi Read Optimized Table contains multiple versions of parquet files, in order to load the table correctly, Impala needs to recognize Hudi Read Optimized Table as a HdfsTable and load the latest version of the file using HoodieROTablePathFilter. Tests - Unit test for Hudi in FileMetadataLoader - Create table tests in functional_schema_template.sql - Query tests in hudi-parquet.test Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf Reviewed-on: http://gerrit.cloudera.org:8080/14711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-11 15:08:39 +00:00
stiga-huang	443da2172c	IMPALA-6772: Enable test_scanners_fuzz for ORC Add test coverage for randomly corrupt ORC files by adding orc in tests of test_scanners_fuzz.py. Also add two additional queries for nested types. Tests: - Ran test_scanners_fuzz.py 780 rounds (took 43h). - Ran test_scanners_fuzz.py for orc/def/block 1081 rounds (took 24h). Change-Id: I3233e5d9f555029d954b5ddd5858ea194afc06bf Reviewed-on: http://gerrit.cloudera.org:8080/15062 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-31 18:39:02 +00:00
Anurag Mantripragada	0c0671e04e	IMPALA-9104: Support retrieval of PK/FK information through impala-hs2-server. The goal is to let JDBC clients get constraint information from Impala tables. We implement two new metadata operations in impala-hs2-server, GetPrimaryKeys and GetCrossReference, which are already implemented in Hive's HS2. The thrift definitions are copied from Hive's TCLIService.thrift. In FE, these two operations are implemented to get the information from tables in the catalog. Much like GetColumns(), tables need to be loaded in order to be able to get PK/FK information. We wait for the PK table/FK table to load. In the implementation, PK/FK information is returned ONLY if the user has access to ALL the columns involved in the PK/FK relationship. Testing: - Added three test tables to our test datasets since most of our FE tests relied on dummy tables or testdata. It was difficult to test PK/FK with these methods. Also, we can build on this testdata in future when we make optimizer improvements. - Added unit tests in AuthorizationTest and JDBCtest. - Added e2e test in test_hs2.py - This patch modifies AnalyzeDDLTests and ToSqlTests to rely on the newly added dataset instead of dummy tables for pk/fk tests. Caveats: - Ranger needs OWNER user information for authorization. Since this is HMS metadata that we do not aggresively load, this information is not available for IncompleteTables. Some foreign key tables (fact tables for example) might have FK/PK relationships with several PK tables some of which might not be loaded in catalog. Currently we have no way to check column previleges without owner user information tables. We do not return keys involving such columns. Therefore, when Ranger is used, there maybe missing PK/FK relationships for parent tables that are not loaded. This can be tracked in IMPALA-9172. - Retrieval of constraints is not yet supported in LocalCatalog mode. See IMPALA-9158. Change-Id: I8942dfbbd4a3be244eed1c61ac2ce17069960477 Reviewed-on: http://gerrit.cloudera.org:8080/14720 Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-21 22:25:22 +00:00
Attila Jeges	27fa27e808	IMPALA-8198: DATE: Read from avro. This change is a follow-up to IMPALA-7368 and adds support for DATE type to the avro scanner. Similarly to parquet, avro uses DATE logical type for dates. DATE logical type annotates an INT32 that stores the number of days since the unix epoch, 1 January 1970. This representation introduces an avro interoperability issue between Impala and older versions of Hive: - Before version 3.1, Hive used Julian calendar to represent dates up to 1582-10-05 and Gregorian calendar for dates starting with 1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost. - Impala uses proleptic Gregorian calendar, extending the Gregorian calendar backward to dates preceding its official introduction in 1582-10-15. This means that pre-1582-10-15 dates written to an avro table by Hive will be read back incorrectly by Impala. Note that Hive 3.1 switched to proleptic Gregorian calendar too, so for Hive 3.1+ this is no longer an issue. Dependency changes: - BE uses avro 1.7.4-p5 from native-toolchain. Change-Id: I7a9d5b93a22cf3a00244037e187f8c145cacc959 Reviewed-on: http://gerrit.cloudera.org:8080/13944 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-27 17:18:35 +00:00
Zoltan Borok-Nagy	3e9cac0cac	IMPALA-8854: fix acid insert tests test_acid_nonacid_insert has been failing lately. HMS became more strict about checking the capabilities of its clients. Seems like the Python client doesn't set any capabilities for itself therefore HMS rejects its attempts of creating and dropping tables. Now instead of using the RESET utility from the e2e test framework (to drop and re-create tables), the test is using a unique database and creates the tables through Impala. Different file formats are exercised with the help of the DEFAULT_FILE_FORMAT query option. Change-Id: I3a82338a7820d0ee748c961c8656fa3319c3929c Reviewed-on: http://gerrit.cloudera.org:8080/14064 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-15 13:02:55 +00:00
Zoltan Borok-Nagy	6360657cb4	IMPALA-8636: Implement INSERT for insert-only ACID tables This commit adds INSERT support for insert-only ACID tables. The Frontend opens a transaction for INSERT statements when the target table is transactional. It also allocates a write ID for the target table. The Frontend aborts the transaction if an error occurs during analysis/planning. The Backend gets the transaction id and the write id in TFinalizeParams. The write id is also set the for the HDFS table sinks. The sinks write the files at their final destination which is an ACID base or delta directory. There is no need for finalization of transactional INSERTS. When the sinks finished with writing the data, the Coordinator invokes updateCatalog() on catalogd which also commits the transaction if everything went well, otherwise the Coordinator aborts the transaction. Testing: * added new tables during dataload * added acid-insert.test file with INSERT statements against the new tables * test insertions between ACID and non-ACID tables * test error scenarios via debug actions * added integration test with Hive to test_hms_integration.py. The test inserts data with Impala and reads with Hive. (These integration tests only run with exhaustive exploration strategy) TODO in following commits: * add locks and heartbeats (without heartbeats long-running transactions might be aborted by HMS) * implement TRUNCATE * CTAS creates files in the 'root' directory of the table/partition. It is handled correctly during SELECT, but would be better to create a base directory from the beginning. Hive creates a delta directory for CTAS. Change-Id: Id6c36fa6902676f06b4e38730f737becfc7c06ad Reviewed-on: http://gerrit.cloudera.org:8080/13559 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-27 13:45:51 +00:00
Yongzhi Chen	23855c8e62	IMPALA-8593: Support table capabilities handling with Hive 3 This patch adds a method to check if a table bucketed. For Hive 3, integrates with HMS translation layer for capabilities checks. Implements methods ensureTableWriteSupported and ensureTableReadSupported. Set default capabilities for tables. Tests: Added unit tests to ParserTest and AnalyzerTest. Added bucketed tables which are required by IMPALA-8439. Ran core tests(Hive 2 and Hive 3) ToDo: Integrate checking bucketed tables capabilities and creating error messages with HMS translation after Hive provides the required functions. Enable capabilities checking for Kudu tables. When upgrade tables from non-acid to acid, the default capabilities should be changed too. Currently, use the workaround by explicitly setting tblproperties OBJCAPABILITIES with the acid properties. Change-Id: Ia08d01168660830b6e0d08b55a95eac129889cec Reviewed-on: http://gerrit.cloudera.org:8080/13558 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-16 15:32:02 +00:00
arorasudhanshu	fcaec278cc	IMPALA-8436: Prohibit write/alter operations on materialized view Instead of creating an in memory instance of View, we were creating instance of HdfsTable. Modified the code to create instance of View for materialized view. Testing Done: - Added tests in AnalyzerTest. Change-Id: Idcd619303e19b5a2551876a63d67569c76bd22f0 Reviewed-on: http://gerrit.cloudera.org:8080/13503 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2019-06-12 13:21:58 +00:00
arorasudhanshu	67015004cf	IMPALA-8435. Prohibit operations on transactional table. Copied some code from Hive to identify if the table is transactional, insert only table. Also modified code to prohibit write operations on insert only table. That code will be reverted once we add support for write operations on insert only table. Testing Done: - Added a new test in AnalyzerTest Change-Id: I740dc4ce0dbbc0c2e042b01832e606cc1ac4132a Reviewed-on: http://gerrit.cloudera.org:8080/13311 Tested-by: Todd Lipcon <todd@apache.org> Reviewed-by: Sudhanshu Arora <sudhanshu@cloudera.com> Reviewed-by: Todd Lipcon <todd@apache.org>	2019-05-22 18:09:42 +00:00
Todd Lipcon	3567a2b5d4	IMPALA-8369 (part 4): Hive 3: fixes for functional dataset loading This fixes three issues for functional dataset loading: - works around HIVE-21675, a bug in which 'CREATE VIEW IF NOT EXISTS' does not function correctly in our current Hive build. This has been fixed already, but the workaround is pretty simple, and actually the 'drop and recreate' pattern is used more widely for data-loading than the 'create if not exists' one. - Moves the creation of the 'hive_index' table from load-dependent-tables.sql to a new load-dependent-tables-hive2.sql file which is only executed on Hive 2. - Moving from MR to Tez execution changed the behavior of data loading by disabling the auto-merging of small files. With Hive-on-MR, this behavior defaulted to true, but with Hive-on-Tez it defaults false. The change is likely motivated by the fact that Tez automatically groups small splits on the _input_ side and thus is less likely to produce lots of small files. However, that grouping functionality doesn't work properly in localhost clusters (TEZ-3310) so we aren't seeing the benefit. So, this patch enables the post-process merging of small files. Prior to this change, the 'alltypesaggmultifilesnopart' test table was getting 40+ files inside it, which broke various planner tests. With the change, it gets the expected 4 files. Change-Id: Ic34930dc064da3136dde4e01a011d14db6a74ecd Reviewed-on: http://gerrit.cloudera.org:8080/13251 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-15 11:00:45 +00:00
Attila Jeges	b5805de3e6	IMPALA-7368: Add initial support for DATE type DATE values describe a particular year/month/day in the form yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a time of day component. The range of values supported for the DATE type is 0000-01-01 to 9999-12-31. This initial DATE type support covers TEXT and HBASE fileformats only. 'DateValue' is used as the internal type to represent DATE values. The changes are as follows: - Support for DATE literal syntax. - Explicit casting between DATE and other types (note that invalid casts will fail with an error just like invalid DECIMAL_V2 casts, while failed casts to other types do no lead to warning or error): - from STRING to DATE. The string value must be formatted as yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory, the time component is optional. If the time component is present, it will be truncated silently. - from DATE to STRING. The resulting string value is formatted as yyyy-MM-dd. - from TIMESTAMP to DATE. The source timestamp's time of day component is ignored. - from DATE to TIMESTAMP. The target timestamp's time of day component is set to 00:00:00. - Implicit casting between DATE and other types: - from STRING to DATE if the source string value is used in a context where a DATE value is expected. - from DATE to TIMESTAMP if the source date value is used in a context where a TIMESTAMP value is expected. - Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP implicit conversions are now all possible, the existing function overload resolution logic is not adequate anymore. For example, it resolves the if(false, '2011-01-01', DATE '1499-02-02') function call to the if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded function, instead of the if(BOOLEAN, DATE, DATE) version. This is clearly wrong, so the function overload resolution logic had to be changed to resolve function calls to the best-fit overloaded function definition if there are multiple applicable candidates. An overloaded function definition is an applicable candidate for a function call if each actual parameter in the function call either matches the corresponding formal parameter's type (without casting) or is implicitly castable to that type. When looking for the best-fit applicable candidate, a parameter match score (i.e. the number of actual parameters in the function call that match their corresponding formal parameter's type without casting) is calculated and the applicable candidate with the highest parameter match score is chosen. There's one more issue that the new resolution logic has to address: if two applicable candidates have the same parameter match score and the only difference between the two is that the first one requires a STRING -> TIMESTAMP implicit cast for some of its parameters while the second one requires a STRING -> DATE implicit cast for the same parameters then the first candidate has to be chosen not to break backward compatibility. E.g: year('2019-02-15') function call must resolve to year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not implemented yet, so this is not an issue at the moment but it will be in the future. When the resolution algorithm considers overloaded function definitions, first it orders them lexicographically by the types in their parameter lists. To ensure the backward compatible behavior Primitivetype.DATE enum value has to come after PrimitiveType.TIMESTAMP. - Codegen infrastructure changes for expression evaluation. - 'IS [NOT] NULL' and '[NOT] IN' predicates. - Common comparison operators (including the 'BETWEEN' operator). - Infrastructure changes for built-in functions. - Some built-in functions: conditional, aggregate, analytical and math functions. - C++ UDF/UDA support. - Support partitioning and grouping by DATE. - Beeswax, HiveServer2 support. These items are tightly coupled and it makes sense to implement them in one change-set. Testing: - A new partitioned TEXT table 'functional.date_tbl' (and the corresponding HBASE table 'functional_hbase.date_tbl') was introduced for DATE-related tests. - BE and FE tests were extended to cover DATE type. - E2E tests: - since DATE type is supported for TEXT and HBASE fileformats only, most DATE tests were implemented separately in tests/query_test/test_date_queries.py. Note, that this change-set is not a complete DATE type implementation, but it lays the foundation for future work: - Add date support to the random query generator. - Implement a complete set of built-in functions. - Add Parquet support. - Add Kudu support. - Optionally support Avro and ORC. For further details, see IMPALA-6169. Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Reviewed-on: http://gerrit.cloudera.org:8080/12481 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-23 13:33:57 +00:00
stiga-huang	9686545bfd	IMPALA-6503: Support reading complex types from ORC We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in io buffer read from DiskIoMgr). The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch). Then we transform values in orc::ColumnVectorBatch into impala::Tuples in hdfs-orc-scanner. We don't need to do anything about decoding/decompression since they are handled by the ORC lib. Fortunately, the ORC lib already supports complex types, we can still leverage it to support complex types. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need in the form required by the ORC lib (Get list of ORC type ids from tuple descriptors) 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation (Slots/Tuples/RowBatches) To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type and transforms outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. * Run exhaustive tests in DEBUG and RELEASE builds. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Reviewed-on: http://gerrit.cloudera.org:8080/12168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-08 04:39:08 +00:00
paul-rogers	4ce689e58a	IMPALA-8095: Detailed expression cardinality tests Cardinality is a critical input to the query planning process, especially join planning. Impala has many high-level end-to-end tests that implicitly test cardinality at the "wholesale" level: A test will produce a wrong result if the cardinality is badly wrong. This patch adds detailed unit tests for cardinality: * Table cardinality, NDV values and null count in metadata retrieved from HMS. * Table cardinality, NDV values and null counts in metadata presented to the query. * Expression NDV and selectivity values (which derive from table cardinality and column NDV.) The test illustrate a number of bugs. This patch simply identifies the bugs, comments out the tests that fail because of the bugs, and substitutes tests that pass with the current, incorrect, behavior. Future patches will fix the bugs. Reviewers can note the difference between the original, incorrect behavior shown here, and the revised behavior in those additional patches. Since none of the existing "functional" tables provide the level of detail needed for these tests, added a new test table specifically for this task. This set of tests was a good time to extend the test "fixture" framework created earlier. The FrontendTestBase class was refactored to use a new FrontendFixture which represents a (simulated) Impala and HMS cluster. The previous SessionFixture represents a single user session (with session options) and the QueryFixture represents a single query. As part of this refactoring, the fixture classes moved into "common" alongside FrontendTestBase. Testing: This patch includes only tests: no "production" code was changed. Change-Id: I3da58ee9b0beebeffb170b9430bd36d20dcd2401 Reviewed-on: http://gerrit.cloudera.org:8080/12248 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-09 02:56:52 +00:00
Janaki Lahorani	aacd5c35d3	IMPALA-6533: Add min-max filter for decimal types on kudu tables. The code mimics the code written for other min-max filters. Decimal data can be stored using 4 bytes, 8 bytes and 16 bytes. The code respectively handles these 3 storage configurations. The column definition states the precision and the precision determines the storage size. The minimum and maximum values are stored in a union. The precision from the column will come in as an input. Based on the precision the size will be found, and depending on the size appropriate variable will be used. The code in min-max-filter* follows the general convention of the file, hence uses macros. The test includes 24 decimal columns (as listed below) with the following joins: 1. Inner Join with broadcast (2 tables) 1a. 1 predicate 1b. 4 predicates - all results in decimal min-max filter 1c. 4 predicates - 3 results in decimal min=max filter; 1 doesn't 2. Inner Join with Shuffle (3 tables) 3. Right outer join (2 tables) 4. Left Semi join (2 tables) 5. Right Semi join (2 tables) Decimal Columns: 4bytes: (5,0), (5,1), (5,3), (5,5) (9,0), (9,1), (9,5), (9,9) 8 bytes: (14,0), (14,1), (14,7), (14,14) (18,0), (18,1), (18,9), (18,18) 16 bytes: (28,0), (28,1), (28,14), (28,28) (38,0), (38,1), (38,19), (38,38) The test aggregates the count of probe rows. This shows that the min-max filter is exercised, because the number of probe rows is less than the total number of rows in the probe side table. The count of probe rows is considered to be deterministic. But, it will be beneficial to look out for changes in Kudu that can change the way data is partitioned. Such a change could change the probe row count and in that case, the test will have to be updated. impala_test_suite.py and test_result_verifier.py are enhanced to support saving of aggregation using update_results. Change-Id: Ib7e7278e902160d7060f8097290bc172d9031f94 Reviewed-on: http://gerrit.cloudera.org:8080/12113 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-10 03:32:25 +00:00
Tim Armstrong	153663c22f	IMPALA-4123: Columnar decoding in Parquet The idea is to optimise the common case where there are long runs of NULL or non-NULL values (i.e. the def level is repeated). We can detect this cheaply by keying the decoding loop in the column reader off the state of the def level RLE decoder - if there's a long run of repeated levels, we can skip checking the def level for every value. We still fall back to decoding, caching and reading value-by-value a batch of def levels whenever the next def level is not in a repeated run. We still use the old approach for decoding rep levels. There might be some benefit to using the same approach for rep levels if repeated def and rep level runs line up. These changes should unlock further optimizations because more time is spent in simple kernel functions, e.g. UnpackAndDecode32Values() for dictionary decompression, which is very optimisable using SIMD etc. Snappy decompression now seems to be the main CPU bottleneck for decoding snappy-compressed Parquet. Perf: Running TPC-H scale factor 60 on uncompressed and snappy parquet both showed a ~4% speedup overall. Microbenchmarks on uncompressed parquet show scans only doing dictionary decoding on uncompressed Parquet is ~75% faster: set mt_dop=1; select min(l_returnflag) from lineitem; Testing: We have alltypes agg with a mix of null and non-null. Many tables have long runs of non-null values. Added new test data and coverage: * a test table manynulls with long runs of null values. * a large CHAR test table * missing coverage for materialising pos slot in flattened nested types scan. * Extended dict test to test longer runs. * A larger version of complextypestbl with interesting collection shapes - NULL collections, empty collections, etc, particularly runs of collections with the same shape. * Test interaction of timestamp validation with conversion * Ran code coverage build to confirm all code paths are tested * ASAN and exhaustive runs. Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef Reviewed-on: http://gerrit.cloudera.org:8080/8319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-17 01:48:05 +00:00
Tim Armstrong	95b56d0e2d	IMPALA-7586: fix predicate pushdown of escaped strings This fixes a class of bugs where the planner incorrectly uses the raw string from the parser instead of the unescaped string. This occurs in several places that push predicates down to the storage layer: * Kudu scans * HBase scans * Data source scans There are some more complex issues with escapes and the LIKE predicate that are tracked separately by IMPALA-2422. This also uncovered a different issue with RCFiles that is tracked by IMPALA-7778 and is worked around by the tests added. In order to make bugs like this more obvious in future, I renamed getValue() to getValueWithOriginalEscapes(). Testing: Added regression test that tests handling of backslash escapes on all file formats. I did not add a regression test for the data source bug since it seems to require some major modification of the data source test infrastructure. Change-Id: I53d6e20dd48ab6837ddd325db8a9d49ee04fed28 Reviewed-on: http://gerrit.cloudera.org:8080/11814 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-01 21:27:13 +00:00
Thomas Tauber-Marshall	bf2124bf30	IMPALA-6929: Support multi-column range partitions for Kudu Kudu allows specifying range partitions over multiple columns. Impala already has support for doing this when the partitions are specified with '=', but if the partitions are specified with '<' or '<=', the parser would return an error. This patch modifies the parser to allow for creating Kudu tables like: create table kudu_test (a int, b int, primary key(a, b)) partition by range(a, b) (partition (0, 0) <= values < (1, 1)); and similary to alter partitions like: alter table kudu_test add range partition (1, 1) <= values < (2, 2); Testing: - Modified functional_kudu.jointbl's schema so that we have a table in functional with a multi-column range partition to test things against. - Added FE and E2E tests for CREATE and ALTER. Change-Id: I0141dd3344a4f22b186f513b7406f286668ef1e7 Reviewed-on: http://gerrit.cloudera.org:8080/10441 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-13 00:10:13 +00:00
Joe McDonnell	9a5410570e	IMPALA-7061: Rework HBase splitting and assignment Some frontend PlannerTests rely on HBase tables being arranged in a deterministic way. Specifically, the HBase tables need to be split with specific region boundaries and those regions need to be assigned to specific HBase region servers. Currently, the tables are created without splits and testdata/bin/split-hbase.sh runs Java code in HBaseTestDataRegionAssignment to split and assign the tables. This runs during dataload via testdata/bin/create-load-data.sh and during tests with bin/run-all-tests.sh. There are problems with both parts of this process. The table splitting is flaky. Since significant time can pass between the assignments and the tests, rebalancing means the assignments are not always stable. This changes the process so that the HBase tables are created with the splits already specified via the HBase shell. The splits remain stable over time. PlannerTestBase runs the assignment code in HBaseTestDataRegionAssignment at the start of the PlannerTests. This makes the assignments deterministic. No other tests depends on the exact assignments, so this does not regress anything. Testing: - Local testing - Ran gerrit-verify-dryrun-external - Verified minicluster profile 2 compiles Change-Id: I3d639128a856254a6ccb93d6750f531974b5f897 Reviewed-on: http://gerrit.cloudera.org:8080/10447 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-25 00:28:18 +00:00
Joe McDonnell	d481cd4842	IMPALA-6372: Go parallel for Hive dataload This changes generate-schema-statements.py to produce separate SQL files for different file formats for Hive. This changes load-data.py to go parallel on these separate Hive SQL files. For correctness, the text version of all tables must be loaded before any of the other file formats. load-data.py runs DDLs to create the tables in Impala and goes parallel. Currently, there are some minor dependencies so that text tables must be created prior to creating the other table formats. This changes the definitions of some tables in testdata/datasets/functional/functional_schema_template.sql to remove these dependencies. Now, the DDLs for the text tables can run in parallel to the other file formats. To unify the parallelism for Impala and Hive, load-data.py now uses a single fixed-size pool of processes to run all SQL files rather than spawning a thread per SQL file. This also modifies the locations that do invalidate to use refresh where possible and eliminate global invalidates. For debuggability, different SQL executions output to different log files rather than to standard out. If an error occurs, this will point out the relevant log file. This saves about 10-15 minutes on dataload (including for GVO). Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Reviewed-on: http://gerrit.cloudera.org:8080/8894 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 00:16:26 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Joe McDonnell	0f33370b8b	IMPALA-6580: Use LOAD DATA LOCAL for decimal tables IMPALA-5752 added support for Kudu decimal. As a part of that, it added Kudu versions of decimal_tbl and decimal_tiny. Kudu tables are created and loaded even on local tests, so these tables are loaded when they previously weren't. The LOAD sections for these tables rely on executing HDFS commmands to copy data to appropriate locations. These HDFS commands cannot work on local tests, causing this failure. Untangling when to execute LOAD sections is complicated, so this simply switches the decimal_tbl and decimal_tiny to do LOAD DATA LOCAL calls, which do not rely on HDFS commands. Change-Id: I1f717917269d116c07a6f17944583f5e8faf2932 Reviewed-on: http://gerrit.cloudera.org:8080/9438 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-24 03:59:18 +00:00
Grant Henke	0c8eba076c	IMPALA-5752: Add support for DECIMAL on Kudu tables Adds support for the Kudu DECIMAL type introduced in Kudu 1.7.0. Note: Adding support for Kudu decimal min/max filters is tracked in IMPALA-6533. Tests: * Added Kudu create with decimal test to AnalyzeDDLTest.java * Added Kudu table_format to test_decimal_queries.py ** Both decimal.test and decimal-exprs.test workloads * Added decimal queries to the following Kudu workloads: kudu_create.test kudu_delete.test kudu_insert.test kudu_update.test ** kudu_upsert.test Change-Id: I3a9fe5acadc53ec198585d765a8cfb0abe56e199 Reviewed-on: http://gerrit.cloudera.org:8080/9368 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 00:03:54 +00:00
Vuk Ercegovac	db98dc6504	IMPALA-4993: extend dictionary filtering to collections Currently, top-level scalar columns in parquet files can be used at runtime to prune row-groups by evaluating certain conjuncts over the column's dictionary (if available). This change extends such pruning to scalar values that are stored in collection type columns. Currently, dictionary pruning works by finding eligible conjuncts for top-level slots. Since only top-level slots are supported, the slots are implicitly part of the scan node's tuple descriptor. With this change, we track eligible conjuncts by slot as well as the tuple that contains the slot (either top-level or nested collection). Since collection conjuncts are already managed by a map that associates tuple descriptors to a list of their conjuncts, this extension follows the existing representation. The frontend builds the mapping of SlotId to conjuncts that are dictionary filterable. This mapping now includes SlotId's that reference nested tuples. The backend is adjusted to use the same representation. In addition, collection readers are decomposed into scalar filterable columns and other, non-dictionary filterable readers. When filtering a row group using a conjunct associated to a (possibly) nested collection type, an additional tuple buffer is allocated per tuple descriptor. Testing: - e2e test extended to illustrate row-groups that are pruned by nested collection dictionary filters. Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd Reviewed-on: http://gerrit.cloudera.org:8080/8775 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-19 20:37:25 +00:00
Zachary Amsden	66704f915e	IMPALA-6068: Scale back fixing functional-types I re-created the original patch for IMPALA-6068, but only performed what I believe to be the limited legal transformation of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE. Any place that directly uploads via hadoop or hdfs commands was left alone as changing it can't be proven to be correct. Change-Id: I6c242cca209a7138b10ad517076707709b5cd204 Testing: Doing a full data load. I mistakenly changed a variable name causing the first two dry-runs to fail. Reviewed-on: http://gerrit.cloudera.org:8080/8690 Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Zach Amsden <zamsden@cloudera.com>	2017-12-04 23:46:44 +00:00
David Knupp	d1c9510001	Revert "IMPALA-6068: Fix dataload for complextypes_fileformat" This reverts commit `e4f585240a`. Among other things, that commit replaced hdfs command line calls with "LOAD DATA LOCAL INPATH" using Hive. However, doing so presumes that the minicluster is the only test environment. Sometimes though, the data load script is against a remote cluster, and those cases, the data load process is now broken. Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2 Reviewed-on: http://gerrit.cloudera.org:8080/8653 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 02:57:04 +00:00
Joe McDonnell	e4f585240a	IMPALA-6068: Fix dataload for complextypes_fileformat Dataload typically follows a pattern of loading data into a text version of a table, and then using an insert overwrite from the text table to populate the table for other file formats. This insert is always done in Impala for Parquet and Kudu. Otherwise it runs in Hive. Since Impala doesn't support writing nested data, the population of complextypes_fileformat tries to hack the insert to run in Hive by including it in the ALTER part of the table definition. ALTER runs immediately after CREATE and always runs in Hive. The problem is that ALTER also runs before the base table (functional.complextypes_fileformat) is populated. The insert succeeds, but it is inserting zero rows. This code change introduces a way to force the Parquet load to run using Hive. This lets complextypes_fileformat specify that the insert should happen in Hive and fixes the ordering so that the table is populated correctly. This is also useful for loading custom Parquet files into Parquet tables. Hive supports the DATA LOAD LOCAL syntax, which can read a file from the local filesystem. This means that several locations that currently use the hdfs commandline can be modified to use this SQL. This change speeds up dataload by a few minutes, as it avoids the overhead of the hdfs commandline. Any other location that could use DATA LOAD LOCAL is also switched over to use it. This includes the testescape* tables which now print the appropriate DATA LOAD commands as a result of text_delims_table.py. Any location that already uses DATA LOAD LOCAL is also switched to indicate that it must run in Hive. Any location that was doing an HDFS command in the LOAD section is moved to the LOAD_DEPENDENT_HIVE section. Testing: Ran dataload and core tests. Also verified that functional_parquet.complextypes_fileformat has rows. Change-Id: I7152306b2907198204a6d8d282a0bad561129b82 Reviewed-on: http://gerrit.cloudera.org:8080/8350 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 03:43:26 +00:00
Matthew Jacobs	922ee70317	IMPALA-5336: Fix partition pruning when column is cast Partition pruning has two mechanisms: 1) Simple predicates (e.g. binary predicates of the form <SlotRef> <op> <LiteralExpr>) can be used to derive lists of matching partition ids directly from the partition key values. This is handled directly in the FE and is very efficient for supported simple predicates. 2) General expr evaluation of predicates using the BE (via FeSupport). This works for all predicates, so is the mechanism used for predicates not supported by (1). The issue was that (1) was being used when a binary predicate contained an implicit cast on the SlotRef. While this is OK when being evaluated by the BE, the simple mechanism in (1) would not be able to match the partition key values with the predicate literal because the partition key values cannot be cast in the FE. The fix is to force binary predicates involving a cast to be evaluated in the BE. Testing: A planner test was added to demonstrate the expected partition pruning occurs. Some modifications were made to the functional schema table stringpartitionkey, so it will be necessary to reload those tables: load-data.py -w functional-query --table_names=stringpartitionkey Change-Id: I94f597a6589f5e34d2b74abcd29be77c4161cd99 Reviewed-on: http://gerrit.cloudera.org:8080/7521 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-31 21:49:17 +00:00
Matthew Jacobs	7c368999f8	IMPALA-5319: Fix test_hdfs_scan_node_errors failures The recent Kudu TIMESTAMP patch (IMPALA-5137) made an inadvertent change [1] to alltypeserror_tmp and alltypeserrornonulls_tmp, changing 'timestamp_col' from STRING to TIMESTAMP. This seems to cause failures on exhaustive jobs which run test_hdfs_scan_node_errors against all file-formats. I haven't been able to reproduce this failure myself, so cannot test whether this fixes the jobs that are failing, but this change to revert these tables seems warranted given they were changed inadvertently. 1: https://gerrit.cloudera.org/#/c/6526/11/testdata/datasets/functional/functional_schema_template.sql Change-Id: I533f1921662802ea6e076eefac973f50c014fcb5 Reviewed-on: http://gerrit.cloudera.org:8080/6891 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2017-05-17 16:34:14 +00:00
Matthew Jacobs	a16a0fa84d	IMPALA-5137: Support Kudu UNIXTIME_MICROS as Impala TIMESTAMP Adds Impala support for TIMESTAMP types stored in Kudu. Impala stores TIMESTAMP values in 96-bits and has nanosecond precision. Kudu's timestamp is a 64-bit microsecond delta from the Unix epoch (called UNIXTIME_MICROS), so a conversion is necessary. When writing to Kudu, TIMESTAMP values in nanoseconds are averaged to the nearest microsecond. When reading from Kudu, the KuduScanner returns UNIXTIME_MICROS with 8bytes of padding so Impala can convert the value to a TimestampValue in-line and copy the entire row. Testing: Updated the functional_kudu schema to use TIMESTAMPs instead of converting to STRING, so this provides some decent coverage. Some BE tests were added, and some EE tests as well. TODO: Support pushing down TIMESTAMP predicates TODO: Support TIMESTAMPs in range partitioning expressions Change-Id: Iae6ccfffb79118a9036fb2227dba3a55356c896d Reviewed-on: http://gerrit.cloudera.org:8080/6526 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-11 20:55:51 +00:00
Lars Volker	12f3ecceab	IMPALA-5287: Test skip.header.line.count on gzip This change fixed IMPALA-4873 by adding the capability to supply a dict 'test_file_vars' to run_test_case(). Keys in this dict will be replaced with their values inside test queries before they are executed. Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70 Reviewed-on: http://gerrit.cloudera.org:8080/6817 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 01:36:46 +00:00
Dan Hecht	bf2e897209	IMPALA-4810: add DECIMAL test case to strict_mode tests The string parsing code already errors if the decimal column either overflows or underflows (i.e. loses scale). Let's just add a test case. Change-Id: Idd66c0fb5a4d201919d39f73dea08b87339d6469 Reviewed-on: http://gerrit.cloudera.org:8080/6150 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-03 01:43:42 +00:00
Dan Burkert	f83652c1da	Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and `BUCKETS` keywords that were going to be newly released in Impala 2.6, but are now unused. Additionally, a few remaining uses of the `DISTRIBUTE BY` syntax has been switched to `PARTITION BY`. Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922 Reviewed-on: http://gerrit.cloudera.org:8080/5382 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-07 07:31:16 +00:00
Dimitris Tsirogiannis	cba93f1ac3	IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2 Reviewed-on: http://gerrit.cloudera.org:8080/5317 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 10:41:53 +00:00
Dimitris Tsirogiannis	af67b2fef4	IMPALA-4514: Fix broken exhaustive builds caused by non-nullable columns This commit fixes the broken exhaustive Impala builds. The issue was caused by Kudu table that didn't properly specify nullability constraints. Hence, some rows were rejected during data loading causing some tests to fail. Change-Id: Ib6f4b4c88ef18b1731b7c9789aad602880e18035 Reviewed-on: http://gerrit.cloudera.org:8080/5157 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-21 22:24:25 +00:00
Dimitris Tsirogiannis	3db5ced4ce	IMPALA-3726: Add support for Kudu-specific column options This commit adds support for Kudu-specific column options in CREATE TABLE statements. The syntax is: CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [, ....]) where option is: \| NULL \| NOT NULL \| ENCODING encoding_val \| COMPRESSION compression_algorithm \| DEFAULT expr \| BLOCK_SIZE num The output of the SHOW CREATE TABLE statement was altered to include all the specified column options for Kudu tables. Change-Id: I727b9ae1b7b2387db752b58081398dd3f3449c02 Reviewed-on: http://gerrit.cloudera.org:8080/5026 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-18 11:41:01 +00:00
Taras Bobrovytsky	eb8120d218	IMPALA-3812: Fix error message for unsupported types Before this patch an unclear error message was returned if DATE or DATETIME appeared in the select list after a star expansion. This was because DATE and DATETIME PrimitiveType was serialized as INVALID_TYPE. This is fixed by serializing correctly. Change-Id: I9019b4bfd219f94e554c795befd3ff5e39706ea9 Reviewed-on: http://gerrit.cloudera.org:8080/4859 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 05:31:34 +00:00

1 2 3 4

160 Commits