impala

mirror of https://github.com/apache/impala.git synced 2026-01-24 06:00:49 -05:00

Author	SHA1	Message	Date
Jim Apple	57fcbf7a28	IMPALA-4171: Remove JAR from repo. By ASF rules, we can't have JARs in releases. The releases are just tarballs of the repo. This patch removes from the repo the single JAR there, which was a version of a JAR that is built during data load, with one string changed. The JAR is used only for testing. Instead of building that jar with the different string and saving the result in git, daa loading will now build the jar twice, with one Java source file slightly changed. Change-Id: Icee7b8c32b08e064dea4a14624acff6021ef5ce1 Reviewed-on: http://gerrit.cloudera.org:8080/4499 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-22 02:00:50 +00:00
Alex Behm	3aa4351625	IMPALA-4170: Fix identifier quoting in COMPUTE INCREMENTAL STATS. The SQL statements generated from COMPUTE INCREMENTAL STATS did not properly quote identifiers when incrementally updating the stats for newly added partitions. Our existing tests did not catch this case because the code paths for doing the initial stats computation and the incremental stats computation are different, in particular, the code for generating the SQL statements. Change-Id: I63adcc45dc964ce769107bf4139fc4566937bb96 Reviewed-on: http://gerrit.cloudera.org:8080/4479 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-09-21 01:24:53 +00:00
Henry Robinson	19de09ab7d	IMPALA-4160: Remove Llama support. Alas, poor Llama! I knew him, Impala: a system of infinite jest, of most excellent fancy: we hath borne him on our back a thousand times; and now, how abhorred in my imagination it is! Done: * Removed QueryResourceMgr, ResourceBroker, CGroupsMgr * Removed untested 'offline' mode and NM failure detection from ImpalaServer * Removed all Llama-related Thrift files * Removed RM-related arguments to MemTracker constructors * Deprecated all RM-related flags, printing a warning if enable_rm is set * Removed expansion logic from MemTracker * Removed VCore logic from QuerySchedule * Removed all reservation-related logic from Scheduler * Removed RM metric descriptions * Various misc. small class changes Not done: * Remove RM flags (--enable_rm etc.) * Remove RM query options * Changes to RequestPoolService (see IMPALA-4159) * Remove estimates of VCores / memory from plan Change-Id: Icfb14209e31f6608bb7b8a33789e00411a6447ef Reviewed-on: http://gerrit.cloudera.org:8080/4445 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-09-20 23:50:43 +00:00
Yuanhao Luo	6e4064d942	IMPALA-4074: Configuration items duplicate in template of YARN Remove duplicate configuration items "yarn.nodemanager.local-dirs" and "yarn.nodemanager.log-dirs" from template configuration of YARN. Change-Id: I81d6d019d4982cb35932b1d45c376b215ec5bcc6 Reviewed-on: http://gerrit.cloudera.org:8080/4311 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-09-16 00:38:16 +00:00
Matthew Jacobs	32199105f7	Bump Kudu version to 1.0-RC1 and add support for new OSes Change-Id: Ibbe554d6782212f91db07757f429c5571a7a44da Reviewed-on: http://gerrit.cloudera.org:8080/4420 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-16 00:14:15 +00:00
David Knupp	a42d18dcc3	IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh We used to include a step in run-hbase.sh for calling a python script that queried Zookeeper to see if the HBase master was up. The original script was problematic, so we stopped using it during our mini-cluster HBase start up procedure. HBase start up issues continue to plague us, however. This patch reintroduces a Zookeeper check, with the following updates: - replace the original script with check-hbase-nodes.py - query the correct node /hbase/master, not just /hbase/rs - use the python Zookeeper library kazoo, rather than calling out to the shell and parsing the return string - since we are moving toward testing on a remote cluster, also add the capability to pass in the address for the host that provides the Zookeeper and HBase services - add an additional check that the HDFS service is running, because of an edge case where the HBase master can briefly start without a cluster running. In addition to the expected tests, this script was also tested under the conditions of IMPALA-4088, whereby the HBase RegionServer is running, but the master fails because another listening process has already taken its TCP port (60010) during startup. Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3 Reviewed-on: http://gerrit.cloudera.org:8080/4348 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-15 00:02:22 +00:00
Matthew Jacobs	c7fa03286b	IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:11:04 +00:00
Jim Apple	bd2947329e	IMPALA-4110: Clean up issues found by Apache RAT. Change-Id: I5bfe77f9a871018e7a67553ed270e2df53006962 Reviewed-on: http://gerrit.cloudera.org:8080/4361 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:09:24 +00:00
Alex Behm	d379875806	IMPALA-3491: Use unique db in test_scanners.py and test_aggregation.py Testing: Ran the tests locally in a loop on exhaustive. Did a private debug/exhaustive run. Change-Id: Ided0848c138bdc1d43694a12222010c48e23ee1c Reviewed-on: http://gerrit.cloudera.org:8080/4339 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-13 21:57:36 +00:00
Alex Behm	8c37bf3543	IMPALA-3491: Use unique database fixture in test_partitioning.py Testing: Ran the test locally in a loop on exhaustive. Did a private debug/exhaustive/hdfs test run. Change-Id: Ib1b33d9977a98894288662a711805e9a54329ec8 Reviewed-on: http://gerrit.cloudera.org:8080/4316 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 04:31:27 +00:00
Alex Behm	f0ffbca2c3	IMPALA-3491: Use unique database fixture in test_insert_parquet.py Testing: Ran the test locally in a loop. Did a private debug/core/hdfs build. Change-Id: I790b2ed5236640c7263826d1d2a74b64d43ac6f7 Reviewed-on: http://gerrit.cloudera.org:8080/4317 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 03:25:29 +00:00
Matthew Jacobs	157c80056c	IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Reviewed-on: http://gerrit.cloudera.org:8080/4120 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 01:50:51 +00:00
Zoltan Ivanfi	b35689d7d9	Minor enhancements to helper scripts. - run-all-tests.sh: survive non-fatal failures when calling ulimit. - copy-udfs-udas.sh: respect $MAKE_CMD instead of blindly using make. Change-Id: Ic90bd0048786c799a8ac435de4303ed399ac1223 Reviewed-on: http://gerrit.cloudera.org:8080/4304 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-09-05 15:17:22 +00:00
Alex Behm	c8f3d40efc	IMPALA-3491: Use unique database fixture in test_nested_types.py Testing: Ran the tests locally in a loop. Did a core/debug/hdfs private build. Change-Id: I0c56df0c6a5f771222dedb69353f8bebe01d5a90 Reviewed-on: http://gerrit.cloudera.org:8080/4302 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-03 00:39:07 +00:00
Yuanhao Luo	052d3cc8dd	IMPALA-4056: Fix toSql() of DistributeParam This commit fixes two issues in toSql() of DistributeParam: 1. string literals were not quoted 2. range partition split rows were not printed. Besides, this commit fixes a small issue in run-hive-server.sh Change-Id: I984a63a24f02670347b0e1efceb864d265d1f931 Reviewed-on: http://gerrit.cloudera.org:8080/4195 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 20:11:27 +00:00
Alex Behm	ab9e54bc42	IMPALA-3491: Use unique database fixture in test_ddl.py. Adds new parametrization to the unique database fixture: - num_dbs: allows creating multiple unique databases at once; the 2nd, 3rd, etc. datbase name is generated by appending "2", "3", etc., to the first database name - sync_ddl: allows creating the dabatases(s) with sync_ddl which is needed by most tests in test_ddl.py Testing: I ran debug/core and debug/exhaustive on HDFS and core/debug on S3. Also ran the test locally in a loop on exhaustive. Change-Id: Idf667dd5e960768879c019e2037cf48ad4e4241b Reviewed-on: http://gerrit.cloudera.org:8080/4155 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 02:47:02 +00:00
Alex Behm	16f1c8d8de	IMPALA-4054: Remove serial test workarounds for IMPALA-2479. The underlying issue IMPALA-2479 has been fixed, so it should be safe to execute these tests in parallel again: - test_runtime_filters.py (all tests) - test_scanners.py::TestParquet::test_multiple_blocks - test_scanners.py::testParquet::test_multiple_blocks_one_row_group Testing: Ran the tests locally in a loop. Did a private core/hdfs run. Change-Id: I8f046e67eb1de1c6ff87980f906870ec9816f551 Reviewed-on: http://gerrit.cloudera.org:8080/4291 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 02:19:52 +00:00
Jim Apple	20ef3b016e	IMPALA-4058: ByteSwap256 assumed memory was 16-byte aligned. This changes the code to use the lddqu and movdqu instructions (via Intel intrinsics) to allow unaligned memory access. Change-Id: I39b2b47bb717d5ac9727512a24fcf8a8a6a8dcc6 Reviewed-on: http://gerrit.cloudera.org:8080/4205 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 01:47:08 +00:00
Henry Robinson	24869d40fd	IMPALA-3610: Account for memory used by filters in the coordinator Before this patch, Impala would not account for the memory used to aggregate runtime filters together in the coordinator. Impala's memory could therefore be silently overcommitted. This patch accounts for aggregated filter memory in a new filter memtracker that is attached to the coordinator's query_mem_tracker(). If the query memory limit is exceeded when a filter update arrives, that update is discarded. If the filter is from a partitioned join, the entire filter can therefore be discarded immediately (to alleviate memory pressure) and a dummy 'always true' filter is sent to backends to unblock them. If the filter is from a broadcast join, no aggregation is done, so there is no tracking. The Thrift input and output filter data structures are not tracked (as we generally don't track RPC objects, but plan to in the future). The filter payload is moved from the input request structure to the output broadcast structure without copying. Memory that is added to a memtracker must always be released. To do this, we need to signal to the coordinator that it is finished, and that there is no point trying to process any future updates that might arrive concurrently. This patch adds Coordinator::Done() which is called from QueryExecState::Done(), and which releases memory from all in-process runtime filters. Finally, this patch increases the upper limit for runtime filters to 512MB. This allows testing on very large datasets. The default maximum is still 16MB, per RUNTIME_FILTER_MAX_SIZE. Testing: Added a new test that triggers the OOM condition on the coordinator. All existing runtime filter tests pass. Change-Id: I3c52c8a1c2e79ef370c77bf264885fc859678d1b Reviewed-on: http://gerrit.cloudera.org:8080/4066 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-09-01 02:35:41 +00:00
Tim Armstrong	1350c34763	IMPALA-4049: fix empty batch handling NLJ build side Memory from the build side of a nested loop join is referenced by its output batches, so accumulated memory build side resources must be transferred to the caller. Special-cased handling of empty batches did not transfer the memory. The fix is to accumulate empty batches and transfer their resources in the same way as non-empty batches. The iterator required changes to handle empty batches in the list. Testing: Added a unit test that exercises the bug RowBatchList. Add a query test that causes a crash in the ASAN build and incorrect results in the debug build. Change-Id: I3cb19e536b87bbb4d4ae82d1636ba1463a422789 Reviewed-on: http://gerrit.cloudera.org:8080/4182 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-31 21:20:29 +00:00
Alex Behm	5adedc6a1a	IMPALA-3930,IMPALA-2570: Fix shuffle insert hint with constant partition exprs. Fixes inserts into partitioned tables that have a shuffle hint and only constant partition exprs. The rows to be inserted are merged at the coordinator where the table sink is executed. There is no need to hash exchange rows. Now accepts insert hints when inserting into unpartitioned tables. The shuffle hint leads to a plan where all rows are merged at the coordinator where the table sink is executed. Change-Id: I1084d49c95b7d867eeac3297fd2016daff0ab687 Reviewed-on: http://gerrit.cloudera.org:8080/4162 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-08-31 09:59:00 +00:00
Alex Behm	df830901de	IMPALA-3491: Use unique database fixture in test_join_queries.py. Testing: Ran the core/exhaustive on hdfs. Change-Id: Ib639ff8a37dbf64840606f88badff8f2590587b6 Reviewed-on: http://gerrit.cloudera.org:8080/4169 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-31 03:12:30 +00:00
Alex Behm	12496c7fbf	IMPALA-1657: Rework detection and reporting of corrupt table stats. 1. Minor fixes for cardinality estimation of unpartitioned tables. 2. Reworks handling of corrupt table stats as follows: The stats of a table or partition are reported as corrupt if the numRows < -1, or if numRows == 0 but the table size is positive. 3. Removes the Preconditions check reported in IMPALA-1657 in favor or issuing a corrupt table stats warning. 4. Fixes a few tests to set numRows together with STATS_GENERATED_VIA_STATS_TASK so that the numRows is definitely set in the HMS. Change-Id: I1d3305791d96e1c23a901af7b7c109af9352bb44 Reviewed-on: http://gerrit.cloudera.org:8080/4166 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-31 00:58:03 +00:00
Thomas Tauber-Marshall	d72353d0c9	IMPALA-2932: Extend DistributedPlanner to account for hash table build cost When deciding between a broadcast or repartition join, Impala calculates the cost of each join as the total amount of data that is sent over the network. This ignores some relevant costs, and can lead to bad plans. One such relevant cost is the work to create the hash table used in the join. This patch accounts for this by adding the amount of data inserted into the hash table (the size of the right side of the join) to the previous cost. This generally increases the estimated cost of broadcast joins relative to repartitioning joins, as the broadcast join must build the hash table on each node the data was broadcast to, so its effect will be to make repartitioning joins more likely to be chosen, especially in large clusters. This patch has not yet been performance tested. Change-Id: I03a0f56f69c8deae68d48dfdb9dc95b71aec11f1 Reviewed-on: http://gerrit.cloudera.org:8080/4098 Tested-by: Internal Jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com>	2016-08-29 16:44:22 +00:00
Lars Volker	0e886618e2	IMPALA-3776: fix 'describe formatted' for Avro tables For Avro tables the column information in the underlying database of the Hive metastore can be different from what is specified in the avro schema. HIVE-6308 aimed to improve upon this, but for older tables the two don't necessarily align. There are two possible cases: 1) Hive's underlying database contains a column which is not present in the Avro schema file. In this case we encounter a NullPointerException in DescribeResultFactory.java#L189 when trying to look up the column in the internal table object. 2) The Avro schema contains a column, which is not present in the underlying database. In this case the column will not be displayed in describe formatted. In addition to the automatic tests I verified this manually by creating an Avro table with an external schema file in Hive. This populated the underlying database with the column information. I then either removed a column from the Avro schema file (case 1) or cleared the column information from the "COLUMNS_V2" table in the underlying database (case 2) and verified that the change fixed both cases. Change-Id: Ieb69d3678e662465d40aee80ba23132ea13871a0 Reviewed-on: http://gerrit.cloudera.org:8080/4126 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins Reviewed-by: Jim Apple <jbapple@cloudera.com>	2016-08-26 17:20:10 +00:00
Henry Robinson	34b5f1c416	IMPALA-(3895,3859): Don't log file data on parse errors Logging file or table data is a bad idea, and doing it by default is particularly bad. This patch changes HdfsScanNode::LogRowParseError() to log a file and offset only. Testing: See rewritten tests. To support testing this change, we also fix IMPALA-3895, by introducing a canonical string __HDFS_FILENAME__ that all Hadoop filenames in the ERROR output are replaced with before comparing with the expected results. This fixes a number of issues with the old way of matching filenames which purported to be a regex, but really wasn't. In particular, we can now match the rest of an ERROR line after the filename, which was not possible before. In some cases, we don't want to substitute filenames because the ERROR output is looking for a very specific output. In that case we can write: $NAMENODE/<filename> and this patch will not perform _any_ filename substitutions on ERROR sections that contain the $NAMENODE string. Finally, this patch fixes a bug where a test that had an ERRORS section but no RESULTS section would silently pass without testing anything. Change-Id: I5a604f8784a9ff7b4bf878f82ee7f56697df3272 Reviewed-on: http://gerrit.cloudera.org:8080/4020 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-08-25 10:20:36 +00:00
Alex Behm	1a430fe40c	IMPALA-2540: Add regression test. The bug was also fixed by the fix for IMPALA-3063: `532b1fe118` This patch adds a regression test for IMPALA-2540. Change-Id: I7c7dececfee90540fe7d5f8a606381ec50a3b241 Reviewed-on: http://gerrit.cloudera.org:8080/4071 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-25 00:43:05 +00:00
Attila Jeges	211f60d831	IMPALA-1731,IMPALA-3868: Float values are not parsed correctly Fixed StringToFloatInternal() not to parse strings like "1.23inf" and "infinite" with leading/trailing garbage as Infinity. These strings are now rejected with PARSE_FAILURE. Only "inf" and "infinity" are accepted, parsing is case-insensitive. "NaN" values are handled similarly: strings with leading/trailing garbage like "nana" are rejected, parsing is case-insensitive. Other changes: - StringToFloatInternal() was cleaned up a bit. Parsing inf and NaN strings was moved out of the main loop. - Use std::numeric_limits<T>::infinity() instead of INFINITY macro and std::numeric_limits<T>::quiet_NaN() instead of NAN macro. - Fixed another minor bug: multiple dots are allowed when parsing float values (e.g. "2.1..6" is interpreted as 2.16). - New BE and E2E tests were added. Change-Id: I9e17d0f051b300a22a520ce34e276c2d4460d35e Reviewed-on: http://gerrit.cloudera.org:8080/3791 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-08-24 03:34:01 +00:00
Tim Armstrong	f613dcd02d	Add functional and targeted perf tests for joins with empty builds I wrote these tests for my IMPALA-3987 patch, but other issues block that optimisations. These tests exercise an interesting corner case so I split them out into a separate patch. The functional tests exercise every join mode for nested loop join and hash join with an empty build side. The perf test exercises hash join with an empty build side. Testing: Made sure the tests passed with both partitioned and non-partitioned hash join implementations. Ran the targeted perf query through the single node perf run script to make sure it worked. Change-Id: I0a68cafec32011a47c569b254979601237e7f2a5 Reviewed-on: http://gerrit.cloudera.org:8080/4051 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-08-19 06:04:18 +00:00
Alex Behm	1bbd667fd3	IMPALA-3828: Enable inversion for inner joins. Testing: Ran the FE planner tests. Examined all the changed plans to verify that the changes are benefitial according to our cardinality estimates. Still need to do a real perf run. Change-Id: I8ba903f1df2446350cca7e71fdb13f550bf9de72 Reviewed-on: http://gerrit.cloudera.org:8080/4035 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-19 05:40:01 +00:00
Matthew Jacobs	d113205cee	IMPALA-3650: DISTRIBUTE BY required for managed Kudu tables As of Kudu 0.9, DISTRIBUTE BY is now required when creating a new Kudu table. Create table analysis, data loading, and tests are updated to reflect this. This also bumps the Kudu version to 0.10.0. Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e Reviewed-on: http://gerrit.cloudera.org:8080/3987 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-08-19 02:14:39 +00:00
Matthew Jacobs	0983da92ba	IMPALA-3856,IMPALA-3871: Fix BinaryPredicate normalization for Kudu Change-Id: Iae7612433a2e27f8887abe6624f9ee0f4867b934 Reviewed-on: http://gerrit.cloudera.org:8080/3986 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-08-18 04:03:00 +00:00
Alex Behm	532b1fe118	IMPALA-3063: Separate join inversion from join ordering. Before this change joins were inverted while doing join ordering. That approach was unnecessarily complex because it required modifying the global analysis state for correct conjunct placement, etc. However, join inversion is independent of join ordering, and the existing approach could lead to generating invalid plans with distributed non-equi right outer/semi joins, which we cannot execute in the backend. After this change joins are inverted in a separate pass over the single-node plan. This simplifies the inversion logic and allows us to avoid generating those invalid plans. Note that this change is not only a separation of functionality for the following reasons: 1. Our join cardinality estimation is not symmetric, i.e., A JOIN B may not give the same estimate as B JOIN A due to our FK/PK detection heuristic. In the context of this patch this means that an inverted join may have a different cardinality estimate, so plans may change depending on whether the inversion is done during join ordering of after. 2. We currently only invert outer/semi/anti joins based on the rhs table ref join op. In this patch I want to preserve the existing behavior as much as possible, but when doing the join ordering in a separate pass we may see a join opn in a JoinNode that is different from the rhs table ref. So in some situations the inversion behavior based on the join op could be different and there are some examples in this patch. This patch also moves the logic of converting hash joins to nested-loop joins into a separate pass over the single-node plan. Change-Id: If86db7753fc585bb4c69612745ec0103278888a4 Reviewed-on: http://gerrit.cloudera.org:8080/3846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-18 03:25:16 +00:00
Christopher Channing	90a6b3206e	IMPALA-3964: Fix crash when a count() is performed on a nested collection. The Bug: Prior to this patch, a DCHECK was used to verify that the underlying memory pool for the scratch batch was empty in a count based scenario. For IMPALA-3964 (where a count() is performed on a nested collection), if a Parquet column chunk is compressed, upon reading each new data page it would be decompressed and eventually placed in to the underlying scratch batch memory pool causing the aforementioned DCHECK to fail. This was not picked up in the test suite as the TPCH nested Parquet data is not compressed. The Fix: Removed the erroneous DCHECK. Added logic to determine if any memory in the scratch batch needs to be freed (due to the transfer that occurs from the decompressed data pool), if so, it will be done. Augmented the load_nested.py script to snappy compress each of the tables within the 'tpch_nested_parquet' database. This is consistent with how the flat TPCH Parquet data set is stored. Regarding test coverage, there are already a number of tests that will perform nested collection counts against the tables in the 'tpch_nested_parquet' database. For uncompressed nested Parquet, the 'test_nested_types.py' test suite leverages the 'ComplexTypesTbl' table to provide good coverage. Change-Id: Id0955c85d18dfba4bd29a35ec95d0355da050607 Reviewed-on: http://gerrit.cloudera.org:8080/3940 Reviewed-by: Michael Ho <kwho@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-16 11:54:04 +00:00
Tim Armstrong	5afd9f7df7	IMPALA-3764,3914: fuzz test HDFS scanners and fix parquet bugs found This adds a test that performs some simple fuzz testing of HDFS scanners. It creates a copy of a given HDFS table, with each file in the table corrupted in a random way: either a single byte is set to a random value, or the file is truncated to a random length. It then runs a query that scans the whole table with several different batch_size settings. I made some effort to make the failures reproducible by explicitly seeding the random number generator, and providing a mechanism to override the seed. The fuzzer has found crashes resulting from corrupted or truncated input files for RCFile, SequenceFile, Parquet, and Text LZO so far. Avro only had a small buffer read overrun detected by ASAN. Includes fixes for Parquet crashes found by the fuzzer, a small buffer overrun in Avro, and a DCHECK in MemPool. Initially it is only enabled for Avro, Parquet, and uncompressed text. As follow-up work we should fix the bugs in the other scanners and enable the test for them. We also don't implement abort_on_error=0 correctly in Parquet: for some file formats, corrupt headers result in the query being aborted, so an exception will xfail the test. Testing: Ran the test with exploration_strategy=exhaustive in a loop locally with both DEBUG and ASAN builds for a couple of days over a weekend. Also ran exhaustive private build. Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b Reviewed-on: http://gerrit.cloudera.org:8080/3833 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-08-11 08:42:41 +00:00
Alex Behm	286da59219	IMPALA-3940: Fix getting column stats through views. The bug: During join ordering we rely on the column stats of join predicates for estimating the join cardinality. We have code that tries to find the stats of a column through views but there was a bug in identifying slots that belong to base table scans. The bug lead us to incorrectly accept slots of view references which do not have stats. This patch fixes the above issue and adds new test infrastructure for creating test-local views. It adds a TPCH-equivalent database that contains views of the form "select * from tpch_basetbl" for all TPCH tables and add tests the plans of all TPCH queries on the view database. Change-Id: Ie3b62a5e7e7d0e84850749108c13991647cedce6 Reviewed-on: http://gerrit.cloudera.org:8080/3865 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-11 08:22:30 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Alex Behm	ac1215fd31	IMPALA-3861: Replace BetweenPredicates with their equivalent CompoundPredicate. The bug: Our BetweenPredicate has a complex object structure that is unlike most other Exprs because we generate an equivalent CompoundPredicate during analysis and replace the original children. Keeping the various members in sync and preserving the object structure during clone() and substitute() is very difficult and error prone. In particular, subquery rewriting is difficult because we extract and replace correlated BinaryPredicates. Substituting BinaryPredicates in a BetweenPredicate's children is not equivalent to a substitution on the BetweenPredicat's original children, so keeping the various redundant members in sync is quite difficult. The fix is to replace BetweenPredicates with their equivalent CompoundPredicates before performing subquery rewrites. We ultimately still want to fix clone() and substitute() for BetweenPredicates, but an elegant solution is likely to more involved. Change-Id: I0838b30444ed9704ce6a058d30718a24caa7444a Reviewed-on: http://gerrit.cloudera.org:8080/3804 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-07-30 07:23:52 +00:00
Matthew Jacobs	b9f6392e84	IMPALA-3939: Data loading may fail on tpch kudu Change-Id: I7f229a9f74fa4dceb14335914b0dde9bf607264e Reviewed-on: http://gerrit.cloudera.org:8080/3818 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-07-30 03:59:29 +00:00
Bikramjeet Vig	36b4ea6f65	IMPALA-1683: Allow REFRESH on a single partition Currently the only way to refresh metadata for a partition was to refresh the whole table. This is a relatively time consuming process especially if there are many partitions and only one is to be refreshed. This patch allows the client to REFRESH on a single partition by using the following syntax: REFRESH [database_name.]table_name PARTITION (partition_spec) Testing: Added parsing and authorization tests in ParserTest.java and AuthorizationTest.java respectively. A new test file "test_refresh_partition.py" was added for testing functionality. Performance: For a table with 10000 partitions and 1 file per partition execResetMetadata() Total Execution Time Refresh Table 3795 ms 4630 ms Refersh Partition 42 ms 680 ms We see that the time to refresh improves by a factor of 90x but due to significant overhead of about 640ms in this case the effective improvement is about 7x. As the size of the table and number of partitions increase, this improvement would be more significant. Change-Id: Ia9aa25d190ada367fbebaca47ae8b2cafbea16fb Reviewed-on: http://gerrit.cloudera.org:8080/3813 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-29 23:57:50 +00:00
Alex Behm	c77fb628f7	IMPALA-3915: Register privilege and audit requests when analyzing resolved table refs. The bug: We used to register privilege requests for table refs in TableRef.analyze() which only got called for unresolved TableRefs. As a result, a reference to a view that contains a subquery did not get properly authorized, explained as follows. 1. In the first analysis pass the view is replaced by a an InlineViewRef and we correctly register an authorizarion request. 2. We rewrite the subquery via the StmtRewriter and wipe the analysis state, but preserve the InlineViewRef that replaces the view reference. 3. The rewritten statement is analyzed again, but since an InlineViewRef is considered to be resolved, we never call TableRef.analyze(), and hence never register an authorization event for the view. The fix: We now register authorization and auditing events when calling analyze() on a resolved TableRef (BaseTableRef, InlineViewRef, CollectionTableRef). Change-Id: I18fa8af9a94ce190c5a3c29c3221c659a2ace659 Reviewed-on: http://gerrit.cloudera.org:8080/3783 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-07-29 05:16:23 +00:00
Dimitris Tsirogiannis	0e88f0d7aa	Add TPC-H based planner tests for Kudu tables This commit adds a set of planner tests for Kudu tables based on the 22 TPC-H queries. Change-Id: I6c40534b72b9aa1ee582b9679c2a63cad52df703 Reviewed-on: http://gerrit.cloudera.org:8080/3790 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-29 02:49:50 +00:00
Tim Armstrong	c1d70f814e	IMPALA-3227: generate test TPC data sets during data load The generated data is identical to the pregenerated tpch.tar.gz and tpcds.tar.gz data that was used previously and were not publically accessible. This adds a "preload" hook to bin/load-data.py that can execute custom logic for each data set. This is used to call the TPC-H and TPC-DS data generation utilities that are already available in the Impala toolchain. Testing: Ran private test job with loading from snapshot disabled and without the tpch/tpcds tarballs available. Change-Id: Ieccfbd7d8d4a91bffddbe35abb7f5572e71a71cf Reviewed-on: http://gerrit.cloudera.org:8080/3761 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:56:57 +00:00
Dimitris Tsirogiannis	6fbd35fa87	Enable TPC-H workload for Kudu tables With this commit we enable loading of TPC-H data in Kudu tables and running the 22 TPC-H queries against Kudu. Since Kudu doesn't support the decimal data type, we had to modify the queries by using round() function and update the test results. Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289 Reviewed-on: http://gerrit.cloudera.org:8080/3789 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:35:11 +00:00
Alex Behm	55b43ba8c4	IMPALA-3084: Cache the sequence of table ref and materialized tuple ids during analysis. The bug: For correct predicate assignment we rely on TableRef.getAllTupleIds() and TableRef.getMaterializedTupleIds(). The implementation of those functions used to traverse the chain of table refs and collect the appropriate ids. However, during plan generation we alter the chain of table refs, in particular, for dealing with nested collections, so those altered TableRefs do not return the expected list of ids, leading to wrong decisions in predicate assignment. The fix: Cache the lists of ids during analysis, so we are free to alter the chain of TableRefs during plan generation. Change-Id: I298b8695c9f26644a395ca9f0e86040e3f5f3846 Reviewed-on: http://gerrit.cloudera.org:8080/2415 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-07-22 22:35:18 -07:00
Michael Ho	276376acac	IMPALA-3674: Lazy materialization of LLVM module bitcode. Previously, each fragment using dynamic code generation will parse the bitcode module and populate the LLVM data structures for all the functions and their bodies in the bitcode module. This is wasteful as we may only use a few functions out of all the functions parsed. We rely on dead code elimination to delete most of the unused functions so we won't waste time compiling them. This change implements lazy materialization of the functions' bodies. On the initial parse of the bitcode module, we just create the Function objects for each function in the module. The functions' bodies will be materialized on demand from the bitcode module when they are actually referenced in the query. This ensures that the prepare time during codegen is proportional to the number of IR functions referenced by the query instead of being proportional to the total number of IR functions in the module. This change also stops cross-compiling BufferedTupleStream::GetTupleRow() as there isn't much benefit for doing it. In addition, move the ctors and dtors of LikePredicate to the header file to avoid an unnecessary alias in the IR module. For TPCH-Q2, a fragment which only codegen 9 functions used to spend 146ms in codegen. It now goes down to 35ms, a 76% reduction. CodeGen:(Total: 146.041ms, non-child: 146.041ms, % non-child: 100.00%) - CodegenTime: 0.000ns - CompileTime: 2.003ms - LoadTime: 0.000ns - ModuleBitcodeSize: 2.12 MB (2225304) - NumFunctions: 9 (9) - NumInstructions: 129 (129) - OptimizationTime: 29.019ms - PrepareTime: 114.651ms CodeGen:(Total: 35.288ms, non-child: 35.288ms, % non-child: 100.00%) - CodegenTime: 0.000ns - CompileTime: 1.880ms - LoadTime: 0.000ns - ModuleBitcodeSize: 2.12 MB (2221276) - NumFunctions: 9 (9) - NumInstructions: 129 (129) - OptimizationTime: 5.101ms - PrepareTime: 28.044ms Change-Id: I6ed7862fc5e86005ecea83fa2ceb489e737d66b2 Reviewed-on: http://gerrit.cloudera.org:8080/3220 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-07-20 18:30:25 -07:00
Tim Armstrong	bc8c55afcd	IMPALA-3729: batch_size=1 coverage for avro scanner Also fix a stale comment in the avro scanner header. The main work here is to fix the handling of empty result sets in the test result verifier. This is a problem because we wanted to verify that the results in the test file were a superset of the rows returned, and this was thrown off by superflous '' rows in the expected and actual result sets. The basic problem is that the way test file sections was parsed conflated an empty result section with non-empty result section that had a single empty string. I.e.: ---- RESULTS ==== vs ---- RESULTS ==== both got resolved to ['']. Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/3413 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-07-19 23:30:02 -07:00
Thomas Tauber-Marshall	343bdad866	IMPALA-3210: last/first_value() support for IGNORE NULLS Added support for the 'ignore nulls' keyword to the last_value and first_value analytic functions, eg. 'last_value(col ignore nulls)', which would return the last value from the window that is not null, or null if all of the values in the window are null. We handle 'ignore nulls' in the FE in the same way that we handle 'distinct' - by adding isIgnoreNulls as a field in FunctionParams. To avoid affecting performance when 'ignore nulls' is not used, and to avoid having to special case 'ignore nulls' on the backend, this patch adds 'last_value_ignore_nulls' and 'first_value_ignore_nulls' builtin analytic functions that wrap 'last_value' and 'first_value' respectively. Change-Id: Ic27525e2237fb54318549d2674f1610884208e9b Reviewed-on: http://gerrit.cloudera.org:8080/3328 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Internal Jenkins	2016-07-18 08:28:09 -07:00
Alex Behm	45740c8bcc	IMPALA-3678: Fix migration of predicates into union operands with an order by + limit. There were two separate issues: First, the SortNode incorrectly picked up unassigned conjuncts, and expected those to be empty. In this case where predicates are migrated into union operands, there could actually be unassigned conjuncts bound by the SortNode's tuple id (and so would be incorrectly picked up). The fix is to not pick up unassigned conjuncts in the SortNode, and allow them to be picked up later (into a SelectNode). Second, when generating the plan for union operands we were missing a call to graft a SelectNode on top of the operand plan to capture unassigned conjuncts. Change-Id: I95d105ac15a3dc975e52dfd418890e13f912dfce Reviewed-on: http://gerrit.cloudera.org:8080/3600 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-07-15 18:27:05 +00:00
Michael Ho	f129dfd202	IMPALA-3018: Don't return NULL on zero length allocations. FunctionContext::Allocate() and FunctionContext::AllocateLocal() used to return NULL for zero length allocations. This makes it hard to distinguish between allocation failures and zero length allocations. Such confusion may lead to DCHECK failure in the macro RETURN_IF_NULL() in debug builds or access to NULL pointers in non-debug builds. This change fixes the problem above by returning NULL only if there is allocation failure. Zero-length allocations will always return a dummy non-NULL pointer. Change-Id: Id8c3211f4d9417f44b8018ccc58ae182682693da Reviewed-on: http://gerrit.cloudera.org:8080/3601 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-07-14 19:04:45 +00:00

1 2 3 4 5 ...

1382 Commits