Enforces that the planner treats IS NOT DISTINCT FROM as eligible for
hash joins, but does not find the minimum spanning tree of
equivalences for use in optimizing query plans; this is left as future
work.
Change-Id: I62c5300b1fbd764796116f95efe36573eed4c8d0
Reviewed-on: http://gerrit.cloudera.org:8080/710
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
Various test scripts operating on postgres databases output
unhelpful log messages, including "ERROR" messages that
aren't actual errors when trying to drop a database that doesn't exist.
Send useless output to /dev/null and consistently use || true to
ignore errors from dropdb.
Change-Id: I95f123a8e8cc083bf4eb81fe1199be74a64180f5
Reviewed-on: http://gerrit.cloudera.org:8080/1753
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
input_stream_ is now set to NULL on eos, but execution can
continue through the function to l753 which touches
input_stream_ without checking for NULL. This occurs
occasionally when the transfer of memory from the mem pool
to the output row batch occurs at eos.
Change-Id: I8ca88ef10d48e19cfde7f3c6de9512eefcae561e
Reviewed-on: http://gerrit.cloudera.org:8080/1757
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Maven was complaining that the source encoding was not set, and that the
version of a plugin was not specified.
Change-Id: I2bc6bbe95fc71575aeec5b6969cc869794309a49
Reviewed-on: http://gerrit.cloudera.org:8080/1741
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Fixes tests in show.test that executes 'show files' on the
'insert_string_partitioned' table in our functional db. The expected
output relied on modifications to 'insert_string_partitioned' made
in test_insert.py. Tests should not rely on the overall ordering
of test execution.
This patch also fixes 'show files' to produce a consistently ordered
output.
Change-Id: Ic736b94b70677b0e3f4f8a9838ffdfdde2ba17ab
Reviewed-on: http://gerrit.cloudera.org:8080/1748
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Some of the tests rely on hdfs trash mechanism to be enabled and poll
the paths in the trash directory during test runs. These tests are
failing intermittenly due to a race with the hdfs trash checkpointing
mechanism which moves all the trash contents to another directory.
This checkpointing runs every fs.trash.checkpoint.interval minutes
and defaults to fs.trash.interval (when set to 0). Currently there
seems to be no way to disable this checkpointing. This patch increases
the fs.trash.interval from the current value of 30 minutes to 24 hours
so that the test runs never hit this race condition.
Change-Id: I42fcaee70a461712f1df6bac23c71f915718b015
Reviewed-on: http://gerrit.cloudera.org:8080/1703
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
The problem was that a recently added planner test was assuming stats
were computed for functional_hbase.alltypes, but we purposely do not compute
stats for that table. The fix is to use the alltypessmall table instead.
The modified test still covers the same issue.
Change-Id: I043c485489c7868b4f320048eb383943627f620b
Reviewed-on: http://gerrit.cloudera.org:8080/1705
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
The original purpose of the escapechartesttable was to test
Impala's behavior on text tables that have the same character
as line terminator and escape character. Recent changes in
Hive have made creating such a table impossible because
1) Only newline is allowed as the line terminator
2) Newline is forbidden as the escape character
See HIVE-11785 for details on the Hive changes.
This commit removes escapechartesttable and all associated
tests, but does not add the same enforcement rules as Hive.
These enforcement rules should be added in a follow-on change.
Change-Id: I2bd9755f4c2cc3d7dfd8d67c3759885951550f08
Reviewed-on: http://gerrit.cloudera.org:8080/1690
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
This patch fixes the comment style in the queries to work properly with
the limitations of the run-workload.py script. This includes removing
quotes and + from comments that otherwise get interpreted.
Change-Id: I791e7bd4145717aa0628c56b93582cd207195039
Reviewed-on: http://gerrit.cloudera.org:8080/1689
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
The change to the start script for OSX used "find" with the "-perm
+0111" option as an "executables only" filter but that doesn't work
with newer versions of "find". "-perm +" has been deprecated or removed
(depending on the version) in Linux. I couldn't find a OSX+Linux
compatible filter.
The variable IS_OSX was added and used to choose the appopriate filter.
Change-Id: I0c49f78e816147c820ec539cfc398fb77b83307a
Reviewed-on: http://gerrit.cloudera.org:8080/1630
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
This commit fixes a bug where catalog incorrectly converts all sentry
authorizables to lower case. Since hdfs URIs are case sensitive,
this bug can result in incorrect grants when the URIs have uppercase
letters.
Change-Id: I642c34d9046729dc904cc45871c7d7959ae828bc
Reviewed-on: http://gerrit.cloudera.org:8080/1675
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
Tests were tuned to run on a 9x node cluster with 64GB RAM against TPC-H 300GB database.
Change-Id: Ib421bcd463d370f795a235b755aeb24a6a70f705
Reviewed-on: http://gerrit.cloudera.org:8080/1394
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This commit fixes the issue where queries with outerjoins and case expressions
in predicates can fail with an AnalysisException. This is due to the method
isTrueWithNullSlots() not preserving the return types of sub-expressions during
substitution with null literals and the resulting predicate can fail to analyze
even though the original predicate succeeds. The fix is to preserve the type of
each slot in the predicate that we subsitute with the null literal.
Change-Id: I8cd827b460620355db6fd518464418e701a724f1
Reviewed-on: http://gerrit.cloudera.org:8080/1656
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
FunctionContext::Allocate(), FunctionContextImpl::AllocateLocal()
and FunctionContext::Reallocate() allocate memory without taking
memory limits into account. The problem is that these functions
invoke FreePool::Allocate() which may call MemPool::Allocate()
that doesn't check against the memory limits. This patch fixes
the problem by making these FunctionContext functions check for
memory limits and set an error in the FunctionContext object if
memory limits are exceeded.
An alternative would be for these functions to call
MemPool::TryAllocate() instead and return NULL if memory limits
are exceeded. However, this may break some existing external
UDAs which don't check for allocation failures, leading to
unexpected crashes of Impala. Therefore, we stick with this
ad hoc approach until the UDF/UDA interfaces are updated in
the future releases.
Callers of these FunctionContext functions are also updated to
handle potential failed allocations instead of operating on
NULL pointers. The query status will be polled at various
locations and terminate the query.
This patch also fixes MemPool to handle the case in which malloc
may return NULL. It propagates the failure to the callers instead
of continuing to run with NULL pointers. In addition, errors during
aggregate functions' initialization are now properly propagated.
Change-Id: Icefda795cd685e5d0d8a518cbadd37f02ea5e733
Reviewed-on: http://gerrit.cloudera.org:8080/1445
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
Until now, our YARN configuration was broken so that we weren't able to
run local Map Reduce jobs. The jobs would fail with a class not found
exception of the LZO codec. This patch fixes this issues and corrects
the classpath.
Change-Id: I689cca7a079dbd269d4bd96f1b4e3d91147d527c
Reviewed-on: http://gerrit.cloudera.org:8080/1667
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
Changes:
1) Consistently use "set -euo pipefail".
2) When an error happens, print the file and line.
3) Consolidated some of the kill scripts.
4) Added better error messages to the load data script.
5) Changed use of #!/bin/sh to bash.
Change-Id: I14fef66c46c1b4461859382ba3fd0dee0fbcdce1
Reviewed-on: http://gerrit.cloudera.org:8080/1620
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
This is for compatibility with docker containers. Before this patch,
when the scripts were run on the docker host, the scripts would try
to kill the mini-cluster in the docker containers and fail because they
didn't have permissions (the user is different). Now the scripts will
only try to kill mini-cluster processes that were started by the current
user.
Also some psutil availability checks were removed because psutil is now
provided by the python virtualenv.
Change-Id: Ida371797bbaffd0a3bd84ab353cb9f466ca510fd
Reviewed-on: http://gerrit.cloudera.org:8080/1541
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
SHOW CREATE TABLE now supports views. It returns a CREATE VIEW statement
with column names and the original sql statement.
Authorization allows SHOW CREATE TABLE to be run on view if the user has
VIEW_METADATA privilege on the view and SELECT privilege on all
underlying views and table.
E.g. "SHOW CREATE TABLE some_view" returns output of form:
CREATE VIEW a_database.some_view (id, bool_col, tinyint_col) AS
SELECT id, bool_col, tinyint_col FROM functional.alltypes
Change-Id: Id633af2f5c1f5b0e01c13ed85c4bf9c045dc0666
Reviewed-on: http://gerrit.cloudera.org:8080/713
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Remove all use of deprecated HBase APIs and bring everything up to
1.0. This work is done to enable use of the Google Cloud Bigtable
HBase client, which does not support the deprecated APIs. However,
nothing in this change depends on Cloud Bigtable, and should work
fine for HBase 1.x and greater.
This involves two major changes.
1) HBase is trying to move away from unmanaged connections. Thus,
catalogd and the backend are updated to create a Connection object
which is used to create the Table client objects. Since the
Connection object owns a threadpool, there is no need to create a
separate ExecutorService and pass it to the HTables on creation.
2) Instead of reading the size on disk of the different Region
servers, we use a single call to the HBase master to get the
ClusterStatus, which contains the storefile sizes for all the region
servers the master knows about. This enables the HBase table
implementation to work with Cloud Bigtable, which of course does not
keep its data in HDFS.
To use the Cloud Bigtable driver, simply update hbase-site.xml to set
the "hbase.client.connection.impl" property to the appriopriate Cloud
Bigtable client implementation class . Further details can be found
here:
https://cloud.google.com/bigtable/docs/connecting-hbase
Tested by running:
py.test tests/query_test/test_hbase_queries.py \
--exploration_strategy=exhaustive
Change-Id: I6c758502126884670bb6dd3153aea5aa5b41aab6
Reviewed-on: http://gerrit.cloudera.org:8080/775
Readability: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
admin was using the -executable flag of find that is not available on
Mac. This patch replaces it with "-perm +0111 -type f" which is similar
semantics. In addition, there seem to be differences in which shell
builtins are available so some changes have been made to fix that issue.
Change-Id: I9b2ecbd5bf6a9b1610e7ca9f15b1a4d1407b94c1
Reviewed-on: http://gerrit.cloudera.org:8080/1612
Reviewed-by: Casey Ching <casey@cloudera.com>
Readability: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
Allow Impala to start only with a running HMS (and no additional services like HDFS,
HBase, Hive, YARN) and use the local file system.
Skip all tests that need these services, use HDFS caching or assume that multiple impalads
are running.
To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and
WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has
permissions since this is the location where the test data will be extracted.
Test coverage (with core strategy) in comparison with HDFS and S3:
HDFS 1348 tests passed
S3 1157 tests passed
Local Filesystem 1161 tests passed
Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03
Reviewed-on: http://gerrit.cloudera.org:8080/1352
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Readability: Alex Behm <alex.behm@cloudera.com>
As part of change, refactor catalog and frontend functions to return
TDatabase/Db objects instead of just the string names of databases -
this required a lot of method/variable renamings.
Add test for creating database with comment. Modify existing tests
that assumed only a single column in SHOW DATABASES results.
Change-Id: I400e99b0aa60df24e7f051040074e2ab184163bf
Reviewed-on: http://gerrit.cloudera.org:8080/620
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
When building hash tables for the build side in partitioned
hash join or aggreagtion, we will evaluate the build or probe
side expressions to compute the hash values for each TupleRow.
Evaluation of certain expressions (e.g. CastToChar) requires
"local" memory allocation. "Local" memory allocation is supposed
to be freed after processing each row batch.
However, the calls to free local allocations are missing in
PartitionedHashJoinNode::BuildHashTableInternal() and
PartitionedAggregationNode::ProcessStream(). This causes all
"local" memory allocation to accumulate potentially for the
entire duration of the query or until GetNext() is called.
This may lead to unnecessary memory allocation failure as
memory limit is exceeded.
This patch calls ExecNode::FreeLocalAllocations() at least once
per row-batch when building hash tables. It also adds the missing
checks for the query status in the loop building hash tables.
Please note that QueryMaintenance() isn't called due to its
overhead in memory limit checks.
Change-Id: Idbeab043a45b0aaf6b6a8c560882bd1474a1216d
Reviewed-on: http://gerrit.cloudera.org:8080/1448
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
The bug: Our predicate assignment logic used to rely on a flag isWhereClauseConjunct
set in Exprs to determine whether assigning a predicate at a certain plan node
was correct or not. This flag was intended to distinguish between predicates from
the On-clause of a join and other predicates, and the bug was that we used
!isWhereClauseConjunct to imply that a predicate originated from an On-clause.
For example, predicates migrated into inline views apply to the post-join,
post-aggregation, post-analytic result of the inline view, but the existing
flag and logic were insufficient to correctly assign predicates coming
from the On-clause of an enclosing block.
This patch removes the isWhereClauseConjunct flag in favor of an isOnClauseConjunct
flag which can more directly capture the originally intended logic.
This patch also adds a test to cover the same issue reported in IMPALA-2665.
Change-Id: I9ff9086b8a4e0bd2090bf88319cb917245ae73d2
Reviewed-on: http://gerrit.cloudera.org:8080/1453
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This patch reduces memory usage of scanners by adjusting how batch
capacity is checked and handled and by freeing unneeded memory.
Change RowBatch::AtCapacity(MemPool) so that batches with no rows
cannot hold onto an unbounded amount of memory - instead they
will pass these batches up operator tree so that the resources
can be freed.
The Parquet scanner also only checked capacity every 1024 rows.
With large rows (e.g. nested collections), it can overrun the
intended 8mb limit. It also didn't include the MemPool usage
in its checks. After the change the scanner will produce smaller
batches if rows contain large nested collections or strings.
I benchmarked this with a scan of the nested TPC-H customers
tables. The row batch sized decrease from ~16MB to ~8MB. If the
nested collections were larger this would be more drastic.
Also pass at capacity up the tree if no rows passed the conjuncts in
the DataSourceScanNode and Parquet scanner so that resources can be
freed.
HdfsTableSink is modified to avoid the incorrect assumption that a batch
only has 0 rows at eos. It is also refactored to pass a related flag as
an argument to make the semantics clearer.
Two simple benchmarks (one column and many columns) shows no change
in scanner performance:
> set num_scanner_threads=1;
> select count(l_orderkey) from biglineitem;
> select count(l_orderkey), count(l_partkey), count(l_suppkey),
count(l_returnflag), count(l_quantity), count(l_linenumber),
count(l_extendedprice), count(l_linestatus), count(l_shipdate),
count(l_commitdate) from biglineitem;
Change-Id: I3b79671ffd3af50a2dc20c643b06cc353ba13503
Reviewed-on: http://gerrit.cloudera.org:8080/1239
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This commmit implements DESCRIBE DATABASE [FORMATTED|EXTENDED] <db_name>.
Without FORMATTED|EXTENDED this statement only prints database's location and
comment. With FORMATTED|EXTENDED it will output all the properties of the database
(e.g. OWNER and PARAMETERS). Currently we only retrieve privileges stored in
hive metastore.
This commit also implements DESCRIBE EXTENDED <table>, which is the same
as DESCRIBE FORMATTED <table> for consistency purpose.
Change-Id: I2a101ec0e3d27b344fcb521eb00e5bdbcbac8986
Reviewed-on: http://gerrit.cloudera.org:8080/804
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Status was ignored by the callers of CreateMerger(), which caused the
sorter to continue in a bad state, leading to this DCHECK.
Change-Id: I1aae11d142cdbf3923289768e296dbf340f049e9
Reviewed-on: http://gerrit.cloudera.org:8080/1367
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This change adds more analysis checks to verify the location
of the table or partition specified in a "CREATE/ALTER TABLE
... CACHED IN ..." statement can actually be cached. Caching
is only supported for HDFS locations.
If table-wide caching is enabled for a table, adding a partition
at an uncacheable location will be disallowed for that table
unless the attribute "UNCACHED" is explicitly specified.
Enabling table-wide caching for a table at an uncacheable
location or a table with partitions at uncacheable locations
will also be disallowed. However, caching can still be enabled
for individual partitions whose underlying locations
support caching.
Change-Id: I2299c9285126f4b035360f2ef902147188ccd5f1
Reviewed-on: http://gerrit.cloudera.org:8080/1373
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
This commit adds query endtime to impalad's lineage log entries consumed
by navigator. The lineage graph is constructed in the frontend and is then
passed to the backend as a serialized thrift object. When the query terminates
(includes cancellations and aborts), the backend appends the query endtime
("endTime") to the lineage graph and generates the lineage log entry in
JSON format.
Change-Id: I2236e98895ae9a159ad6e78b0e18e3622fdc3306
Reviewed-on: http://gerrit.cloudera.org:8080/934
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
There was an incorrect DCHECK in the parquet scanner. If abort_on_error
is false, the intended behaviour is to skip to the next row group, but
the DCHECK assumed that execution should have aborted if a parse error
was encountered.
This also:
- Fixes a DCHECK after an empty row group. InitColumns() would try to
create empty scan ranges for the column readers.
- Uses metadata_range_->file() instead of stream_->filename() in the
scanner. InitColumns() was using stream_->filename() in error
messages, which used to work but now stream_ is set to NULL before
calling InitColumns().
Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701
Reviewed-on: http://gerrit.cloudera.org:8080/1257
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This only enforced the defaults in Hive. Users how manually choose to
change the schema in Hive may trigger these new analysis exceptions in
this commit unnecessarily. The Hive issue tracking the length
restrictions is
https://issues.apache.org/jira/browse/HIVE-9815
Change-Id: Ia30f286193fe63e51a10f0c19f12b848c4b02f34
Reviewed-on: http://gerrit.cloudera.org:8080/721
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
This patch implements a new built-in function
regexp_match_count. This function returns the number of
matching occurrences in input.
The regexp_match_count() function has the following syntax:
int = regexp_match_count(string input, string pattern)
int = regexp_match_count(string input, string pattern,
int start_pos, string flags)
The input value specifies the string on which the regular
expression is processed.
The pattern value specifies the regular expression.
The start_pos value specifies the character position
at which to start the search for a match. It is set
to 1 by default if it's not specified.
The flags value (if specified) dictates the behavior of
the regular expression matcher:
m: Specifies that the input data might contain more than
one line so that the '^' and the '$' matches should take
that into account.
i: Specifies that the regex matcher is case insensitive.
c: Specifies that the regex matcher is case sensitive.
n: Specifies that the '.' character matches newlines.
By default, the flag value is set to 'c'. Note that the
flags are consistent with other existing built-in functions
(e.g. regexp_like) so certain flags in IBM netezza such as
's' are not supported to avoid confusion.
Change-Id: Ib33ece0448f78e6a60bf215640f11b5049e47bb5
Reviewed-on: http://gerrit.cloudera.org:8080/1248
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
This patch extends the SHOW statement to also support
user-defined functions and user-defined aggregate functions.
The syntax of the new SHOW statements is as follows:
SHOW CREATE [AGGREGATE] FUNCTION [<db_name>.]<func_name>;
<db_name> and <func_name> are the names of the database
and udf/uda respectively.
Sample outputs of the new SHOW statements are as follows:
Query: show create function fn
+------------------------------------------------------------------+
| result |
+------------------------------------------------------------------+
| CREATE FUNCTION default.fn() |
| RETURNS INT |
| LOCATION 'hdfs://localhost:20500/test-warehouse/libTestUdfs.so' |
| SYMBOL='_Z2FnPN10impala_udf15FunctionContextE' |
| |
+------------------------------------------------------------------+
Query: show create aggregate function agg_fn
+------------------------------------------------------------------------------------------+
| result |
+------------------------------------------------------------------------------------------+
| CREATE AGGREGATE FUNCTION default.agg_fn(INT) |
| RETURNS BIGINT |
| LOCATION 'hdfs://localhost:20500/test-warehouse/libudasample.so' |
| UPDATE_FN='_Z11CountUpdatePN10impala_udf15FunctionContextERKNS_6IntValEPNS_9BigIntValE' |
| INIT_FN='_Z9CountInitPN10impala_udf15FunctionContextEPNS_9BigIntValE' |
| MERGE_FN='_Z10CountMergePN10impala_udf15FunctionContextERKNS_9BigIntValEPS2_' |
| FINALIZE_FN='_Z13CountFinalizePN10impala_udf15FunctionContextERKNS_9BigIntValE' |
| |
+------------------------------------------------------------------------------------------+
Please note that all the overloaded functions which match
the given function name and category will be printed.
This patch also extends the python test infrastructure to
support expected results which include newline characters.
A new subsection comment called 'MULTI_LINE' has been added
for the 'RESULT' section. With this comment, a test can
include its multi-line output inside [ ] and the content
inside [ ] will be treated as a single line, including the
newline character.
Change-Id: Idbe433eeaf5e24ed55c31d905fea2a6160c46011
Reviewed-on: http://gerrit.cloudera.org:8080/1271
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Loading functional data may fail if table_no_newline_part
partitions already exist. We should only ADD with IF NOT
EXISTS to handle this case as we do with other tables.
Change-Id: I5fe5c318d2cbbd5b5419394212b94a2fe7d386ce
Reviewed-on: http://gerrit.cloudera.org:8080/1261
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
When scanning tables with collection types the limit applied to the scan
node is not sufficient enough to safely enable the small-query
optimization. This patch adds an additional check to the
MaxRowsProcessedVisitor that will abort checking the number of processed
rows once a scan node accesse a collection type.
Change-Id: Ic43baf3f97acfb8d7b53b0591c215046179d18b3
Reviewed-on: http://gerrit.cloudera.org:8080/1235
Reviewed-by: Silvius Rus <srus@cloudera.com>
Tested-by: Internal Jenkins
This commit adds PURGE option to DROP TABLE/ALTER TABLE DROP
PARTITION statements. Following is the usage:
1. DROP TABLE <tablename> takes an optional argument PURGE. Adding
purge purges the table data by skipping trash, if configured.
DROP TABLE [<database>.]<tablename> [IF EXISTS] [PURGE]
2. PURGE is also supported with alter table drop partition query
with the following syntax. If specified, impala purges the partition
data by skipping trash.
ALTER TABLE [<database>.]<tablename> DROP PARTITION [IF EXISTS] [PURGE]
This patch also helps the use case where Trash and the data directories
are in different encryption zones, in which case we cannot move the data
during ALTER/DROP. Then purge option can be used to skip the trash and
make sure data is actually deleted.
Change-Id: I64bf71d660b719896c32e0f3a7ab768f30ec7b3b
(cherry picked from commit 585d4f8d9e809f3bf194018dd161a22d3f144270)
Reviewed-on: http://gerrit.cloudera.org:8080/1244
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
multi-line text
This patch fixes the option passed to the RE2 regex matcher
so that it will count the newline character '\n' as a valid
candidate for '.'. Previously, that option was set to false
by default, causing multi-line text to fail to match against
patterns with wildcard characters in it. This patch also adds
some tests to address these cases and fixes some typos in
like-predicate.h.
Change-Id: I25367623f587bf151e4c87cc7cb6aec3cd57e41a
Reviewed-on: http://gerrit.cloudera.org:8080/1172
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
This fixes a regression introduced with:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce
min mem needed for PAGG/PHJ
Prior to that change, as soon as any partition's stream overflowed
its small buffers, all partitions' streams would be switched
immediately to IO-buffers, which would be satisfied by the initial
buffer "reservation".
After that change, individual streams are switched to IO-buffers on
demand as they overflow their small buffers. However, that change
also made it so that Partition::Spill() would eagerly switch that
partition's streams to IO-buffers, and fail the query if the buffer
is not available. The buffer may not be available because the
reserved buffers may be in use by other partition's streams.
We don't need to fail the query if the switch to IO-buffers in
Partition::Spill() fails. Instead, we should just let the streams
switch on demand as they fill up the small buffers. When that
happens, if the IO buffer is not available, then we already have a
mechanism to pick partitions to spill until we can get the IO-buffer
(in the worst case it means working our way back down to the initial
reservation). See AppendRowStreamFull() and BuildHashTables().
The symptom of this regression was that some queries would fail at a
lower memory limit than before.
Also revert the max_block_mgr_memory values back to their originals.
Additional testing: loop custom_cluster/spilling.py. We should also
remeasure minimum memory required by queries after this change.
Change-Id: I11add15540606d42cd64f2af99f4e96140ae8bb5
Reviewed-on: http://gerrit.cloudera.org:8080/1228
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Failure due to an issue with NULL tuples (IMPALA-2375)
where NULL tuples come from the right side of a left
outer join where the right side comes from an inline view
which produces 0 slots (e.g. the view selects a constant).
The HJ doesn't handle them correctly because the planner
inserts an IsTupleNull expr. This isn't an issue for the
PHJ because the BufferedTupleStream returns non-NULL Tuple*
ptrs even for tuples with no slots.
Per IMPALA-2375, we're going to address this after 2.3, so
moving this test case into joins-partitioned.test which only
runs on the PHJ.
Change-Id: I64cb7e8ffd60f3379aa8860135db5af8e66d686f
Reviewed-on: http://gerrit.cloudera.org:8080/1231
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
Commit: IMPALA-2265: Sorter was not checking the returned Status of
PrepareRead
added a new test case that sets a mem_limit. The mem_limit was
calibrated against functional_parquet tables, but during exhaustive
this test is run against other formats. Other scanners will use
different amounts of memory causing this test to fail in exhaustive.
Fix the query to always run exactly as it was tuned.
Change-Id: I8140653825cb4f303ad569f087757148c756e42d
Reviewed-on: http://gerrit.cloudera.org:8080/1230
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
If an error occurs while processing an Expr, only that ExprContext
is closed and the others are still left open. This leads to a DCHECK
during teardown of the query because the destructor of ExprContext
expects it to be closed.
In this patch, we close all the ExprContexts if an error occurs in any
one.
Change-Id: Ic748bfd213511c314c59594d075048f6c6d82073
Reviewed-on: http://gerrit.cloudera.org:8080/1222
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This query was occasionally succeeding. This patch lowers the mem
limit so it's is less likely to succeed. It also adds the specific
"Failed to allocate buffer" message to the expected error so we don't
accidentally lose coverage. This test could still become flaky in the
future (I'm not sure 4mb is guaranteed to work every time, plus it
could change in the future), but I can't think of a better solution
than to continue adjusting it or only rely on more targetted unit
tests.
Change-Id: Id585f0e3b2c77a0279efffe9dd8b8b4472225730
Reviewed-on: http://gerrit.cloudera.org:8080/1207
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
Before, Expr::IsConstant() manually specified the constant Expr
classes, but TupleIsNullPredicate and AnalyticExpr overrode
IsConstant() to always return false (which Expr::IsConstant() didn't
specify). This meant that unless the TupleIsNullPredicate was the root
expr, TupleIsNullPredicate::IsConstant() would never be called and
Expr::IsConstant() would return true. This patch changes
Expr::IsConstant() to recurse on its children, rather than having it
contain the constant logic for all expr types.
Change-Id: I756eb945e04c791eff39c33305fe78d957ec29f4
Reviewed-on: http://gerrit.cloudera.org:8080/1214
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
The bug: While separating the parent table refs from the subplan refs, we immediately
changed the left table link of a chosen ref. However, we rely on the link structure
to correctly determine the required table ref ids, so in some cases we missed
a required table ref id due to a broken table ref chain.
The fix: Preserve the original chain of table refs until the end of
computeParentAndSubplanRefs().
TODO:
This fix has the unfortunate consequence of making the plan for nested TPCH Q21
worse. It should be possible to change the planning logic to be correct and
also get the original better plan for Q21, but this needs some more thought.
Change-Id: Ib8dc13c950f7783b62ce6ab7c8a6f534f9a9bb31
Reviewed-on: http://gerrit.cloudera.org:8080/1177
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Conflicts:
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
The sorter was dropping on the floor the returned Status of the
PrepareRead() calls. PrepareRead() tries to Pin() blocks. In some
queries with large sorts, those Pin() calls could fail with OOM,
but because the sorter was ignoring the returned Status it would
happily put the unpinned block in the vector of blocks and eventually
seg fault, because the buffer_desc_ of that block was NULL.
This patch fixes this problem and adds a test that eventually we may
want to move to the exhaustive build because it takes quite some time.
It also changes the comments of the sorter class to the doxygen style.
Change-Id: Icad48bcfbb97a68f2d51b015a37a7345ebb5e479
Reviewed-on: http://gerrit.cloudera.org:8080/1156
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
(part2)
With this commit we use regex when comparing the file sizes of table
'tpch_nested_parquet.region' in the PlannerTest.
Change-Id: I03fa177c9d36d60bcb5ce7eece8a5a7c98bb7985
Reviewed-on: http://gerrit.cloudera.org:8080/1216
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Conflicts:
testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test