The problem: varlen data (e.g. strings) produced by aggregations is
freed by FreeLocalAllocations() after passing up the output
batch. This works for streaming operators or blocking operators that
copy their input, but results in memory corruption when the output
reaches non-copying blocking operators, e.g. SubplanNode and
NestedLoopJoinNode.
The fix: this patch makes the PartitionedAggregationNode copy out
produced string data if the node is in a subplan. Otherwise it calls
MarkNeedsToReturn() on the output batch. Marking the batch would work
in the subplan case as well, but would likely be less efficient since
it would result in many small batches coming out of the subplan.
The patch includes a test case. However, this test only exposes the
problem with an ASAN build and the --disable_mem_pools flag, which we
don't currently have automated testing for.
Change-Id: Iada891504c261ba54f4eb8c9d7e4e5223668d7b9
Reviewed-on: http://gerrit.cloudera.org:8080/2929
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This patch adds two query options for runtime filters:
RUNTIME_FILTER_MAX_SIZE
RUNTIME_FILTER_MIN_SIZE
These options define the minimum and maximum filter sizes for a filter,
no matter what the estimates produced by the planner are. Filter sizes
are rounded up to the nearest power of two.
Change-Id: I5c13c200a0f1855f38a5da50ca34a737e741868b
Reviewed-on: http://gerrit.cloudera.org:8080/2966
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
INSERTs on S3 are slower because of double buffering where we buffer
once locally and once in a staging directoy in S3 before moving the
file(s) to the final location. Also, moving the file from the staging
directory to the final location in HDFS is a quick rename which is
only a metadata operation. However, on S3, renames are not supported,
thus becoming a full file copy instead of just a metadata rename
operation.
This patch instroduces a boolean query option "s3_skip_insert_staging"
which avoids the staging step on S3 and allows the sinks to write to
the final location directly.
This trades in consistency for the sake of performance. If a node(s)
fails during the query, then we will end up with inconsistent results
in the final location.
P.S: This option is disabled for INSERT OVERWRITE queries as that
would require cleaning the destination directory before moving the
final files there. However, the coordinator is responsible for the
cleaning which takes place only after the table sinks have moved
the files to the final location. Thus, INSERT OVERWRITE queries must
still have their files moved to a staging location by the table sinks.
Performance gains:
- For non-partitioned tables, the INSERT queries run 4-4.5x faster on
S3. (Tested on a 63GB INSERT to a table)
- For heavily partitioned tables, there is considerable improvement
in the order of 4-5 minutes on queries that take ~27 minutes but
queries are still slow because of IMPALA-3482 where the catalog
takes too long to update all the metadata. (Tested with a query
that creates 2.4K partitions in a table totalling ~19GB).
Change-Id: Iff9620d41ba0d5fb1aa0c9f4abb48866fc2b0698
Reviewed-on: http://gerrit.cloudera.org:8080/2905
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
Testing: Tested the changes locally by running them in a loop
10 times. Also did a private core/hdfs run.
Change-Id: I37e1528c02e598f3fb2d673b6559d55a34bf79b4
Reviewed-on: http://gerrit.cloudera.org:8080/3002
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
This commit fixes an issue where a GRANT ALL ON SERVER to role_name statement
followed by a REVOKE ALL ON SERVER from role_name statement would not revoke all
privileges from role_name. The problem was triggered by a specific
combination of Sentry client API calls used in Impala during
grant/revoke statements at server scope. In particular, during GRANT, Impala was using
an API call that didn't explicitly specify the privilege action (Sentry uses '*' if
no action is specified). In contrast, the corresponding REVOKE call was explicitly
specifying the privilege action to be 'ALL'. Sentry doesn't seem to
handle this case correctly, thereby failing to remove all the privileges
after a REVOKE ALL ON SERVER call. The fix from the Impala side, that
results in the correct behavior, is to always specify the privilege
action by using the appropriate API calls.
Change-Id: I6b3a0d10f5e88c6a0a10bd20f620562d2de7ab25
Reviewed-on: http://gerrit.cloudera.org:8080/2979
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Add missing coverage for sorting by CHAR and VARCHAR.
Add more coverage for spilling sorts.
Fix spilling tests: ensure that they actually reliably spill (many of
them had memory limits high enough that they could run entirely in
memory).
I ran this in a loop for a while to flush out flaky tests. The tests
should be fairly predictable given that they're not run concurrently
with other tests and we allocate enough block manager memory so that
each operator can obtain its reservation.
Change-Id: Ia2d2627a2c327dcdf269ea3216385b1af9dfa305
Reviewed-on: http://gerrit.cloudera.org:8080/2877
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Now that we functionally support writes to S3 via Impala,
test_grant_revoke should not have a special case for S3, which
till this patch did the test without INSERTs.
Change-Id: Id981e7f83bf86b32d1a5b267ad3781db02337e86
Reviewed-on: http://gerrit.cloudera.org:8080/2949
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This patch makes it a little easier to use the unique_database fixture
with .test files. The RESULTS section can now contain $DATABASE which
is replaced with the current database by the test framework.
Testing:
- ran the test locally on exhaustive
- ran the test on hdfs and the local filesystem on Jenkins
Change-Id: I8655eb769003f88c0e1ec1b254118e4ec3353b48
Reviewed-on: http://gerrit.cloudera.org:8080/2947
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.
S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.
The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.
Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.
P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.
ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.
Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Reviewed-on: http://gerrit.cloudera.org:8080/2574
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This change is a first step towards a more efficient Parquet scanner.
The focus is on presenting the new code flow that materializes
the table-level slots in a column-wise fashion, without going deep
into actually improving scan efficieny.
After these changes there are several obvious places that should
be optimized to realize efficiency gains.
Summary of changes
- the table-level tuples are materialized in a column-wise fashion
with new ColumnReader::ReadValueBatch() functions
- this is done by materializing a 'scratch' batch, and transferring
scratch tuples that survive filters/conjuncts to the output batch
- the tuples of nested collections are still materialized in
a row-wise fashion using the ColumnReader::ReadValue() function,
just as before
Mini benchmark
I ran the following queries on a single impalad before and after my
change using a synthetic 'huge_lineitem' table.
I modified hdfs-scan-node.cc to set the number of rows of any row
batch to 0 to focus the measurement on the scan time.
Query options:
set num_scanner_threads=1;
set disable_codegen=true;
set num_nodes=1;
select * from huge_lineitem;
Before: 22.39s
Afer: 18.50s
select * from huge_lineitem where l_linenumber < 0;
Before: 25.11s
After: 20.56s
select * from huge_lineitem where l_linenumber % 2 = 0;
Before: 26.32s
After: 21.82s
Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Reviewed-on: http://gerrit.cloudera.org:8080/2779
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
For IMPALA-1740 we added a test to insert.test, which creates a table and
inserts data. The table was created on HDFS by default and thus inserts with
compression enabled did not work. This change adds the required table to the
functional schema in the same way we do it for the other insert tests.
Change-Id: Ie68e7067b7a16218d27935820d5d1ce7035d2e6c
Reviewed-on: http://gerrit.cloudera.org:8080/2919
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Instead of having a default Bloom Filter size for all runtime filters,
adjust filter size according to desired FP-rate and expected NDV from
join's build-side. Size of filter is still clipped to 4k < N < 16MB range.
If NDV estimate from planner is -1 (i.e. no stats) the default filter
size is used.
The NDV of all filters produced by the same join is currently the same
because the NDV is estimated from the cardinality of the input. In the
future, the NDV should be estimated for each filter source expr. The BE
changes anticipate this and can enable or disable individual filters if
they have differing FP rates.
Change-Id: I1fe37b8d4cfb3c52bb8e8cf0ca55e92665b87803
Reviewed-on: http://gerrit.cloudera.org:8080/2812
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
HIVE-5795 introduced a parameter skip.header.line.count to skip header
lines from input files. This change introduces the capability to skip
an arbitrary number of header lines from csv input files on hdfs. The
size of the total file header must be smaller than
max_scan_range_length, otherwise an error will be reported. This is
necessary because scan ranges are not read in disk order, so there is
no way of identifying header lines except by counting from the start
of the first scan range.
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='1');
Query: alter table t1 set tblproperties('skip.header.line.count'='1')
[localhost:21000] > select * from t1;
Query: select * from t1
+----+----+
| c1 | c2 |
+----+----+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+----+----+
Fetched 3 row(s) in 0.32s
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='0');
Query: alter table t1 set tblproperties('skip.header.line.count'='0')
[localhost:21000] > select * from t1;
Query: select * from t1
+------+------+
| c1 | c2 |
+------+------+
| NULL | NULL |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+------+------+
WARNINGS: Error converting column: 0 TO INT (Data is: num1)
Error converting column: 1 TO DOUBLE (Data is: num2)
file: hdfs://localhost:20500/test-warehouse/t1/test.txt
record: num1,num2
Fetched 4 row(s) in 0.41s
Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543
Reviewed-on: http://gerrit.cloudera.org:8080/2110
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
Attempting to codegen a sort where the sort expr has a CHAR type as an
intermediate result fails completely.
The problem is that ScalarFnCall checked whether its input arguments
were CHAR to disable codegen, but didn't check its output.
This patch also replaces some incorrect codegen CHAR logic that should
not be executed with DCHECKs.
Testing:
The test is a minimal reproduction of the issue. The test is executed
both by the sorter and top-n nodes so covers both cases.
Change-Id: I189073d46a10988803d572928a38f4a718690fa3
Reviewed-on: http://gerrit.cloudera.org:8080/2876
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Filter locality was not correctly set when NUM_NODES=1 (and therefore no
distributed plan was created). The default locality should be 'local'.
This patch also fixes a bug where the initial filter routing table
wasn't printed when NUM_NODES=1.
Change-Id: I7b9a6bcc64ca6ec5fd51d63815cea25de866ef93
Reviewed-on: http://gerrit.cloudera.org:8080/2721
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
This patch:
1) Removes JniUtil::Cleanup() and JniUtil::global_refs_. We never
called Cleanup(), and all the jobjects in global_refs_ are meant to
have the lifetime of the impalad process. This makes
JniUtil::GetGlobalClassRef() and JniUtil::LocalToGlobalRef()
thread-safe (which fixes IMPALA-3379).
2) Introduces a new JniUtil::FreeGlobalRef() method, which is a
wrapper around the JNI DeleteGlobalRef() method.
3) Change JNI users to use the JniUtil methods instead of the JNI
methods directly where appropriate. This makes error checking more
consistent, and makes it easier to find all JNI uses. This is possible
since GetGlobalClassRef() and LocalToGlobalRef() are now thread-safe
and don't leak jobjects.
4) Removes HiveUdfCall::JniContext::cl, as well as other JNI
constants, and replaces them with process-wide static singletons. It
then moves the initialization to a new HiveUdfCall::Init() method
which is once called in the main thread at the beginning of the
process. This fixes IMPALA-3378.
5) Deletes the JniContext created for each HiveUdfCall
Unfortunately I am not able to repro IMPALA-3378 so there is no test
case (I didn't attempt IMPALA-3379 but it's similar).
Change-Id: I8cd089e355d2ee2d5ace81f05b214272c05cf941
Reviewed-on: http://gerrit.cloudera.org:8080/2820
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
Impala has a hardcoded limit of 1GB in size for StringVal.
If the length of the string exceeds 1GB, Impala will simply
mark the StringVal as NULL (i.e. is_null = true). It's important
that string functions or built-in UDFs check this field before
accessing the pointer or Impala may end up doing null pointer
access, leading to crashes.
Change-Id: I55777487fff15a521818e39b4f93a8a242770ec2
Reviewed-on: http://gerrit.cloudera.org:8080/2786
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
The regex didn't match cases where the 'set' statement had whitespace
between the preceding semicolon and the 'set'. E.g. if it is not the
first statement in the block and is preceded by a newline.
The resolution by name test implicitly relied on the bug, so it needed
to be updated.
Change-Id: Ic810b31c1ad7b2bcfd29413181bb81d1a0dbcb90
Reviewed-on: http://gerrit.cloudera.org:8080/2823
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
The sorter converts string pointers to block offsets when spilling.
There was a subtle bug in the logic that assumed if the offset was
past the end of the current block, the data must necessarily be in
the next block. This is not true for zero-length strings, because
there is no backing storage so the pointer can point to the byte
after the end of the block.
This patch fixes the bug by using a simpler offset encoding scheme
that packs the block number into the upper 32 bits and the offset
within the block into the lower 32 bits.
It also slightly refactors the functions so that the method signatures
and types are more consistent with the rest of the impala codebase.
Also fix a bug with handling of multiple query options in tests.
Change-Id: I5f64593e94d367d6b6efb61a8b86e35516f18839
Reviewed-on: http://gerrit.cloudera.org:8080/2780
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch changes when runtime filters are produced in the partitioned
hash-join node to allow filters to be produced even when the PHJ
spills. Filters are now produced during the level0 processing of the
PHJ's build-side input in ProcessBuildBatch().
Since this function is codegen'ed, so now is filter production. We use
constant-propagation via constant argument injection to disable filter
production at no cost when it is not needed (including in level1+
repartitioning). I inspected the IR to confirm that the constant
propagation works as expected.
This change also allows us to send filters earlier during build-side
processing. A tradeoff is that filters are still built even if the
expected FP rate is too high, although any too-permissive filters are
still not sent to the scan (see 'Performance impact' below).
The restriction that prevented filters from being computed inside a
sub-plan is removed as part of this cleanup (since the FE handles
assigning filters correctly in subplans), and a test is added to confirm
that one of the correct cases for filters in subplans works.
This patch also fixes a bug where re-partitioning beyond level0 would
not use the codegen'ed implementation of ProcessBuildBatch().
A new test is added to test_runtime_row_filters, for Parquet only, which
spills and confirms that filtering still occurs.
Finally, the legacy --enable_phj_probe_side_filtering /
--enable_probe_side_filtering flags have been deprecated, as runtime
filtering can be permanently disabled via setting
RUNTIME_FILTER_MODE=OFF. The implementation that the old flags referred
to has been removed.
Performance impact
------------------
We benchmark the performance loss due to always computing runtime
filters even when the FP-rate will turn out to be too high as follows:
select STRAIGHT_JOIN count(*) from (select id from functional.alltypes
LIMIT 1) a JOIN [BROADCAST] (select * FROM p LIMIT 100000000) b on a.id
= -b.id and b.part_col > 0
('p' is a two-column Parquet table with 1B rows).
This builds a 100M row build table (benchmarks run on one node). When
filtering is enabled, the filter is built but selects all rows from the
probe side (so that there's no benefit to having the filter, to
emphasise the cost of building the filter in the first place).
RUNTIME_FILTER_MODE Avg. time (s) over 5 runs
OFF 18.95
GLOBAL 19.55
-------------------------------
Change +3%
Change-Id: I59a2d9ee03ccea6b674392584e4c7f272233571e
Reviewed-on: http://gerrit.cloudera.org:8080/2783
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Computing stats caused the Kudu table to be reloaded in the catalog and
the column definitions ended up getting appended to the existing ones.
There was already a method to clear the column state, so now that is
called during load().
Change-Id: I9ad42338750e9d8873a3584bc22a7cd7bd465c5d
Reviewed-on: http://gerrit.cloudera.org:8080/2813
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change fixes two problems:
1. The query options OPTIMIZE_PARTITION_KEY_SCANS and
DISABLE_STREAMING_PREAGGREGATIONS are both boolean
so they should accept 'true' and '1' as input values.
Previously, these two options are treated as int and
value such as 'true' doesn't work with them.
2. The break statement in the case statement of the option
SCAN_NODE_CODEGEN_THRESHOLD was 'stolen' by the option
DISABLE_STREAMING_PREAGGREGATIONS when it was added.
This change adds the missing break statement back for
SCAN_NODE_CODEGEN_THRESHOLD.
Change-Id: I5c74a1e5c49e3bda15a91b40740fc7310303207b
Reviewed-on: http://gerrit.cloudera.org:8080/2776
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
For a table with location "ABC", most partitions will have locations
like "ABC/DEF=2". The "ABC" part of the location does not need to be
stored in Catalog for each partition; we can compress it down to one
int in the common case.
This is done by stripping from each partition location the last N
directories (where N is the number of clustering columns) and storing
the resulting string in a cache of partition location prefixes. In the
cache, this location prefix string is mapped to an int. Partition
locations are then stored as a tuple consisting of that int and a
suffix string; the partition location can be reconstructed as the
concatenation of the prefix string (from the cache) and the suffix.
Though this scheme was designed in the expectation that most
partitions will be stored in directories like
"/part_col_1=1.23/part_col_2=234/", it works even when that is not the
case.
TODO: Since each partition stores the literal values for the
partitioning columns, we could also elide the column names and values
when partitions are placed in directories like
"/part_col_1=1.23/part_col_2=234/"
Change-Id: I8c67b6ce0f83de2f5277a528a9ce67e47d638adb
Reviewed-on: http://gerrit.cloudera.org:8080/2355
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
There was an observed race between TestUdfs.test_java_udfs and
TestUdfs.test_libs_with_same_filenames, because they both used the
same database name.
This patch just changes the name of the database used by
test_libs_with_same_filenames.
Change-Id: Icc38cbe720a3b9d864935416eb10612171132e17
Reviewed-on: http://gerrit.cloudera.org:8080/2767
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This patch introduces a new query option,
PARQUET_FALLBACK_SCHEMA_RESOLUTION which allows Parquet files' schemas
to be resolved by either name or position. It's "fallback" because
eventually field IDs will be the primary schema resolution scheme, and
we don't want to create an option that we will have to change the name
of later. The default is still by position. I chose to do a query
option because it will make testing easier and also be easier to
diagnose resolution problems quickly in the field. If users want to
switch the default behavior to be by name (like Hive), they can use
the --default_query_options flag.
This patch also introduces a new test section, SHELL, which can be
used to execute shell commands in a .test file. This is useful for
copying files into test tables.
Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6
Reviewed-on: http://gerrit.cloudera.org:8080/2384
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
This field was included in the schema and data files, but the
checked-in generated parquet files didn't include it. It's not
referenced in any tests so we didn't catch it.
Change-Id: I5d394f074e7082fa12fafb7e57a144a83b3099a6
Reviewed-on: http://gerrit.cloudera.org:8080/2562
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
The stubs in Impala broke during the merge commit. This commit removes
the stubs in hopes of improving robustness of the build. The original
problem (Kudu clients are only available for some OSs) is now addressed
by moving the stubbing into a dummy Kudu client. The dummy client only
allows linking to succeed, if any client method is called, Impala will
crash. Before calling any such method, Kudu availability must be
checked.
Change-Id: I4bf1c964faf21722137adc4f7ba7f78654f0f712
Reviewed-on: http://gerrit.cloudera.org:8080/2585
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
These tests functionally test whether the following type of files
are able to be scanned properly:
1) Add a parquet file with multiple blocks such that each node has to
scan multiple blocks.
2) Add a parquet file with multiple blocks but only one row group
that spans the entire file. Only one scan range should do any work
in this case.
Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368
Reviewed-on: http://gerrit.cloudera.org:8080/1500
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
The PHJ may disable runtime filter production for one of several
reasons, including a predicted high false-positive rate. If the filters
are not produced, any scans will wait for their entire timeout before
continuing.
This patch changes the filter logic to always send a filter, even if one
wasn't actually produced by the PHJ. To preserve correctness, that
filter must contain every element of the set. Such a filter is
represented by (BloomFilter*)NULL. This allows us to make no changes to
RuntimeFilter::Eval(), which already returns true if the member Bloom
filter is NULL.
In RPCs, a new field is added to TBloomFilter to identify filters that
are always true.
The HdfsParquetScanner checks to see if filters would always return true
for any element, and disables them if so.
There is some miscellaneous cleanup in this patch, particularly the
removal of unused members in BloomFilter.
This patch has been manually tested on queries that would otherwise take
a long time to time-out. A unit test was added to ensure that queries do
not wait.
Change-Id: I04b3e6542651c1e7b77a9bab01d0e3d9506af42f
Reviewed-on: http://gerrit.cloudera.org:8080/2475
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
When running with ASAN enabled, runtime filters may take a lot longer to
be produced, triggering timeouts in the filter tests. This patch triples
the timeout time.
We still want the timeout to be reasonable as protection against
excessive regressions in filter production time, which is why I've not
set the timeout to a very large value, plus if the test fails and
filters aren't produced we don't want to hang the build for a large
timeout delay.
Change-Id: Ife1d36a78d6ad587462fe112afda573f6e480441
Reviewed-on: http://gerrit.cloudera.org:8080/2609
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This patch adds functional tests for runtime filters. It relies on
setting RUNTIME_FILTER_WAIT_TIME_MS high enough to ensure that filters
are received.
To make the test files more readable, this patch also adds a new COMMENT
section to the test syntax, and allows blank spaces between queries so
that the separation of different test cases can be made more obvious.
Currently missing is a test for disabling probe-side filters based on
selectivity, as we lack suitable tables to trigger the disable condition.
Change-Id: I94d617c6d23ffa394a6eb7ead56f1cfb701e0d90
Reviewed-on: http://gerrit.cloudera.org:8080/2603
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
Added the ability for the "GRANT/REVOKE ALL ON SERVER TO ROLE <role>"
statement to optionally take a server name parameter as:
"GRANT/REVOKE ALL ON SERVER <server> TO ROLE <role>" since Hive
allows this.
The specified server name is checked against the expected server
name from the config during analysis, and an exception is thrown if
they do not match.
Change-Id: Id6c136d9a171ec062d4ff803682d026422497e8b
Reviewed-on: http://gerrit.cloudera.org:8080/2296
Tested-by: Internal Jenkins
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
This commit fixes the following two issues
- A drop table statement can drop a view with same name when "IF EXISTS" is specified.
- A drop view statement can drop a table with same name when "IF EXISTS" is specified.
This happens due to lack of checks in the Catalog before the drop executes.
Change-Id: I0d35cd1f50d9b8d50223660f753c56529cbbc311
Reviewed-on: http://gerrit.cloudera.org:8080/2458
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
This merges the 'feature/kudu' branch with cdh5-trunk as of commit:
055500cc753f87f6d1c70627321fcc825044e183
This patch is not a pure merge patch in the sense that goes beyond conflict
resolution to also address reviews to the 'feature/kudu' branch as a whole.
The review items and their resolution can be inspected at:
http://gerrit.cloudera.org:8080/#/c/1403/
Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92
Previously, thread resource manager only supports a single callback
for each resource pool. The callback is invoked when a thread token
is available. This mostly works as scan node is the only consumer
and there is usually one scan node in a plan fragment. As shown in
IMPALA-3064 and IMPALA-561, it's possible to generate a plan fragment
with more than one scan nodes. In which case, one of the scan nodes
may be running with single thread and in debug builds, a DCHECK will
be hit.
This change fixes the problem by allowing more than one callbacks in
a given resource pool. The thread resource manager will go through all
the registered callbacks in round robin fashion.
This change also adds a missing thread token release call in
HdfsScanNode::ThreadTokenAvailableCb().
Change-Id: Iddfff1feef0b59d407994ad3bc560166acbfa623
Reviewed-on: http://gerrit.cloudera.org:8080/2430
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
The bug:
Evaluating !empty() predicates at non-scan nodes interacts
poorly with our BE projection of collection slots. For example,
rows could incorrectly be filtered if a !empty() predicate is
assigned to a plan node that comes after the unnest of the
collection that also performs the projection.
The fix:
This patch reworks the generation of !empty() predicates
introduced in IMPALA-2663 for correctness purposes.
The predicates are generated in cases where we can ensure that
they will be assigned only by the parent scan, and no other
plan node.
The conditions are as follows:
- collection table ref is relative and non-correlated
- collection table ref represents the rhs of an inner/cross/semi join
- collection table ref's parent tuple is not outer joined
Change-Id: Ie975ce139a103285c4e9f93c59ce1f1d2aa71767
Reviewed-on: http://gerrit.cloudera.org:8080/2399
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Silvius Rus <srus@cloudera.com>
Tested-by: Internal Jenkins
Add a custom cluster test that tests for delays in registering data
stream receivers. We add a stress option to artificially delay this
registration to ensure that it can be handled correctly.
Change-Id: Id5f5746b6023c301bacfa305c525846cdde822c9
Reviewed-on: http://gerrit.cloudera.org:8080/2306
Tested-by: Internal Jenkins
Reviewed-by: Silvius Rus <srus@cloudera.com>
Fix a bug in which Impala only reads the first stream
of a multi-stream bz2/gzip file.
Changes the bz2 decoder to read the file in a streaming
fashion rather than reading the entire file into memory
before it can be decompressed.
Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8
(cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e)
Reviewed-on: http://gerrit.cloudera.org:8080/2219
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
We need to skip queries that select from tables wiht nested types is
running with the old aggs and joins. To achieve this, move the failing
test to a separate test and use the skip decorator.
Change-Id: Iaf1351c711b524be66a99084657926909425cbff
Reviewed-on: http://gerrit.cloudera.org:8080/2272
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
After this patch structs can be parsed/created with field names
that are regular identifiers or keywords, even if unquoted.
This fix is needed for parsing type strings stored in the
Hive Metastore which could contain unquoted identifiers that
correspond to Impala keywords.
The parser changes required an upgrade of Cup and its Maven plugin.
In the old version, the generated parser would not compile because
of a giant method that exceeded the JVM maximum allowed size for a
single method.
Change-Id: Ic989c7afd034216f6db4c8f9f3901c025cceb524
Reviewed-on: http://gerrit.cloudera.org:8080/2249
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
We were missing handling booleans in the Kudu scanner, though
we handled them in the sink.
This patch fixes this issue and adds some tests.
Change-Id: If8edbe85ae257c6374eddf757845c1ec917b1693
In the frontend we support creating Kudu tables with VARCHAR
but in the backend we don't handle it. Moreover we were swallowing
the error in release mode, causing inserts to just skip values when
this type is used.
This patch adds support for VARCHAR, along with the corresponding
tests.
Change-Id: Ic734890e1b3aae2eef1e0a3d45a7561d02eeb917
This commit adds a new feature to persist hive/java udfs across
catalog restarts. IMPALA-1748 already added this for non-java
udfs by storing them in parameters map of the Db object and
reading them back at catalog startup. However we follow a
different approach for hive udfs by converting them to Hive's
function format and adding them as hive functions to the metastore.
This makes it possible to share udfs between hive and Impala as the
udfs added from one service are accessible to other. This commit
takes care of format conversions between hive and impala and user
can just add function once in either of the services.
Background: Hive and impala treat udfs differently. Hive resolves the
evaluate function in the udf class at runtime depending on the data
types of the input arguments. So user can add one function by name and
can pass any arguments to it as long as there is a compatible evaluate
function in the udf class. However Impala takes the input types of the
udf as a part of function definition (that maps to only one evaluate
function) and loads the function only for those set of input argument
types. If we have multiple 'evaluate' methods, we need to add multiple
functions one for each of them.
This commit adds new variants of CREATE | DROP FUNCTIONS to Impala which
lets the user to create and drop hive/java udfs without input argument
types or return types. Catalog takes care of loading/dropping the udf
signatures corresponding to each "evaluate" method in the udf symbol
class. The syntax is as follows,
CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts>
DROP FUNCTION [IF EXISTS] <function name>
Examples:
CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;
The older way of creating hive/java udfs with specific signature is still supported,
however they are *not* persisted across restarts. So a restart of catalog can
wipe them out. Additionally this commit also loads all the compatible java udfs
added outside of Impala and they needn't be separately loaded. One thing
to note here is that the functions added using the new CREATE FUNCTION
can only be dropped using the new DROP FUNCTION syntax (without
signature). The same rule applies for the java udfs added using the old
CREATE FUNCTION syntax (with signature).
Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d
Reviewed-on: http://gerrit.cloudera.org:8080/2250
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
We do not have exceptions enabled for codegen'd code, so exceptions
thrown by functions called by codegen'd functions cannot be caught by
the codegen'd functions. TimestampValue::UnixTimeToPtime() has a
try/catch around boost::posix_time::ptime_from_tm(), but since it was
inlined into the TimestampFunctions::FromUnix() IR the try/catch
didn't work. This patch moves the UnixTimeToPtime() implementation to
the .cc file so it doesn't get included in the IR. It does the same
for TimestampParser::Parse() in case it gets inlined into IR code as
well.
Change-Id: Ic0af73629e1e3b6bf18cbf5d832973712b068527
Reviewed-on: http://gerrit.cloudera.org:8080/2210
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
Hive allows udfs with primitive data types as return values (along
with Writables) and input arguments. This commmit adds this support
for Impala.
Change-Id: I2ec24eab5a824772a8618d7fb97ae5c7ea2a0e39
Reviewed-on: http://gerrit.cloudera.org:8080/2207
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins