Commit Graph

1368 Commits

Author SHA1 Message Date
Yuanhao Luo
052d3cc8dd IMPALA-4056: Fix toSql() of DistributeParam
This commit fixes two issues in toSql() of DistributeParam:
1. string literals were not quoted
2. range partition split rows were not printed.
Besides, this commit fixes a small issue in run-hive-server.sh

Change-Id: I984a63a24f02670347b0e1efceb864d265d1f931
Reviewed-on: http://gerrit.cloudera.org:8080/4195
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 20:11:27 +00:00
Alex Behm
ab9e54bc42 IMPALA-3491: Use unique database fixture in test_ddl.py.
Adds new parametrization to the unique database fixture:
- num_dbs: allows creating multiple unique databases at once;
  the 2nd, 3rd, etc. datbase name is generated by appending
  "2", "3", etc., to the first database name
- sync_ddl: allows creating the dabatases(s) with sync_ddl
  which is needed by most tests in test_ddl.py

Testing: I ran debug/core and debug/exhaustive on HDFS and
core/debug on S3. Also ran the test locally in a loop on
exhaustive.

Change-Id: Idf667dd5e960768879c019e2037cf48ad4e4241b
Reviewed-on: http://gerrit.cloudera.org:8080/4155
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 02:47:02 +00:00
Alex Behm
16f1c8d8de IMPALA-4054: Remove serial test workarounds for IMPALA-2479.
The underlying issue IMPALA-2479 has been fixed, so it
should be safe to execute these tests in parallel again:
- test_runtime_filters.py (all tests)
- test_scanners.py::TestParquet::test_multiple_blocks
- test_scanners.py::testParquet::test_multiple_blocks_one_row_group

Testing: Ran the tests locally in a loop. Did a private core/hdfs run.

Change-Id: I8f046e67eb1de1c6ff87980f906870ec9816f551
Reviewed-on: http://gerrit.cloudera.org:8080/4291
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 02:19:52 +00:00
Jim Apple
20ef3b016e IMPALA-4058: ByteSwap256 assumed memory was 16-byte aligned.
This changes the code to use the lddqu and movdqu instructions (via
Intel intrinsics) to allow unaligned memory access.

Change-Id: I39b2b47bb717d5ac9727512a24fcf8a8a6a8dcc6
Reviewed-on: http://gerrit.cloudera.org:8080/4205
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 01:47:08 +00:00
Henry Robinson
24869d40fd IMPALA-3610: Account for memory used by filters in the coordinator
Before this patch, Impala would not account for the memory used to
aggregate runtime filters together in the coordinator. Impala's memory
could therefore be silently overcommitted.

This patch accounts for aggregated filter memory in a new filter
memtracker that is attached to the coordinator's query_mem_tracker(). If
the query memory limit is exceeded when a filter update arrives, that
update is discarded. If the filter is from a partitioned join, the
entire filter can therefore be discarded immediately (to alleviate
memory pressure) and a dummy 'always true' filter is sent to backends to
unblock them.

If the filter is from a broadcast join, no aggregation is done, so there
is no tracking. The Thrift input and output filter data structures are
not tracked (as we generally don't track RPC objects, but plan to in the
future). The filter payload is moved from the input request structure to
the output broadcast structure without copying.

Memory that is added to a memtracker must always be released. To do
this, we need to signal to the coordinator that it is finished, and that
there is no point trying to process any future updates that might arrive
concurrently. This patch adds Coordinator::Done() which is called from
QueryExecState::Done(), and which releases memory from all in-process
runtime filters.

Finally, this patch increases the upper limit for runtime filters to
512MB. This allows testing on very large datasets. The default maximum
is still 16MB, per RUNTIME_FILTER_MAX_SIZE.

Testing: Added a new test that triggers the OOM condition on the
coordinator. All existing runtime filter tests pass.

Change-Id: I3c52c8a1c2e79ef370c77bf264885fc859678d1b
Reviewed-on: http://gerrit.cloudera.org:8080/4066
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-09-01 02:35:41 +00:00
Tim Armstrong
1350c34763 IMPALA-4049: fix empty batch handling NLJ build side
Memory from the build side of a nested loop join is
referenced by its output batches, so accumulated memory
build side resources must be transferred to the caller.
Special-cased handling of empty batches did not transfer
the memory. The fix is to accumulate empty batches and
transfer their resources in the same way as non-empty
batches. The iterator required changes to handle empty
batches in the list.

Testing:
Added a unit test that exercises the bug RowBatchList.
Add a query test that causes a crash in the ASAN build
and incorrect results in the debug build.

Change-Id: I3cb19e536b87bbb4d4ae82d1636ba1463a422789
Reviewed-on: http://gerrit.cloudera.org:8080/4182
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-31 21:20:29 +00:00
Alex Behm
5adedc6a1a IMPALA-3930,IMPALA-2570: Fix shuffle insert hint with constant partition exprs.
Fixes inserts into partitioned tables that have a shuffle hint and only constant
partition exprs. The rows to be inserted are merged at the coordinator where
the table sink is executed. There is no need to hash exchange rows.

Now accepts insert hints when inserting into unpartitioned tables. The shuffle
hint leads to a plan where all rows are merged at the coordinator where
the table sink is executed.

Change-Id: I1084d49c95b7d867eeac3297fd2016daff0ab687
Reviewed-on: http://gerrit.cloudera.org:8080/4162
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2016-08-31 09:59:00 +00:00
Alex Behm
df830901de IMPALA-3491: Use unique database fixture in test_join_queries.py.
Testing: Ran the core/exhaustive on hdfs.

Change-Id: Ib639ff8a37dbf64840606f88badff8f2590587b6
Reviewed-on: http://gerrit.cloudera.org:8080/4169
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-31 03:12:30 +00:00
Alex Behm
12496c7fbf IMPALA-1657: Rework detection and reporting of corrupt table stats.
1. Minor fixes for cardinality estimation of unpartitioned tables.
2. Reworks handling of corrupt table stats as follows:
   The stats of a table or partition are reported as corrupt if the
   numRows < -1, or if numRows == 0 but the table size is positive.
3. Removes the Preconditions check reported in IMPALA-1657 in favor
   or issuing a corrupt table stats warning.
4. Fixes a few tests to set numRows together with
   STATS_GENERATED_VIA_STATS_TASK so that the numRows is definitely
   set in the HMS.

Change-Id: I1d3305791d96e1c23a901af7b7c109af9352bb44
Reviewed-on: http://gerrit.cloudera.org:8080/4166
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-31 00:58:03 +00:00
Thomas Tauber-Marshall
d72353d0c9 IMPALA-2932: Extend DistributedPlanner to account for hash table build cost
When deciding between a broadcast or repartition join, Impala calculates
the cost of each join as the total amount of data that is sent over the
network. This ignores some relevant costs, and can lead to bad plans.

One such relevant cost is the work to create the hash table used in the
join. This patch accounts for this by adding the amount of data inserted
into the hash table (the size of the right side of the join) to the
previous cost.

This generally increases the estimated cost of broadcast joins relative
to repartitioning joins, as the broadcast join must build the hash table
on each node the data was broadcast to, so its effect will be to make
repartitioning joins more likely to be chosen, especially in large
clusters.

This patch has not yet been performance tested.

Change-Id: I03a0f56f69c8deae68d48dfdb9dc95b71aec11f1
Reviewed-on: http://gerrit.cloudera.org:8080/4098
Tested-by: Internal Jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
2016-08-29 16:44:22 +00:00
Lars Volker
0e886618e2 IMPALA-3776: fix 'describe formatted' for Avro tables
For Avro tables the column information in the underlying database of the
Hive metastore can be different from what is specified in the avro
schema. HIVE-6308 aimed to improve upon this, but for older tables the
two don't necessarily align.

There are two possible cases:

1) Hive's underlying database contains a column which is not present in
the Avro schema file. In this case we encounter a NullPointerException
in DescribeResultFactory.java#L189 when trying to look up the column in
the internal table object.

2) The Avro schema contains a column, which is not present in the
underlying database. In this case the column will not be displayed in
describe formatted.

In addition to the automatic tests I verified this manually by creating
an Avro table with an external schema file in Hive. This populated the
underlying database with the column information. I then either removed
a column from the Avro schema file (case 1) or cleared the column
information from the "COLUMNS_V2" table in the underlying database
(case 2) and verified that the change fixed both cases.

Change-Id: Ieb69d3678e662465d40aee80ba23132ea13871a0
Reviewed-on: http://gerrit.cloudera.org:8080/4126
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
Reviewed-by: Jim Apple <jbapple@cloudera.com>
2016-08-26 17:20:10 +00:00
Henry Robinson
34b5f1c416 IMPALA-(3895,3859): Don't log file data on parse errors
Logging file or table data is a bad idea, and doing it by default is
particularly bad. This patch changes HdfsScanNode::LogRowParseError() to
log a file and offset only.

Testing: See rewritten tests.

To support testing this change, we also fix IMPALA-3895, by introducing
a canonical string __HDFS_FILENAME__ that all Hadoop filenames in the ERROR
output are replaced with before comparing with the expected
results. This fixes a number of issues with the old way of matching
filenames which purported to be a regex, but really wasn't. In
particular, we can now match the rest of an ERROR line after the
filename, which was not possible before.

In some cases, we don't want to substitute filenames because the ERROR
output is looking for a very specific output. In that case we can write:

$NAMENODE/<filename>

and this patch will not perform _any_ filename substitutions on ERROR
sections that contain the $NAMENODE string.

Finally, this patch fixes a bug where a test that had an ERRORS section
but no RESULTS section would silently pass without testing anything.

Change-Id: I5a604f8784a9ff7b4bf878f82ee7f56697df3272
Reviewed-on: http://gerrit.cloudera.org:8080/4020
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-08-25 10:20:36 +00:00
Alex Behm
1a430fe40c IMPALA-2540: Add regression test.
The bug was also fixed by the fix for IMPALA-3063:
532b1fe118

This patch adds a regression test for IMPALA-2540.

Change-Id: I7c7dececfee90540fe7d5f8a606381ec50a3b241
Reviewed-on: http://gerrit.cloudera.org:8080/4071
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-25 00:43:05 +00:00
Attila Jeges
211f60d831 IMPALA-1731,IMPALA-3868: Float values are not parsed correctly
Fixed StringToFloatInternal() not to parse strings like "1.23inf"
and "infinite" with leading/trailing garbage as Infinity. These
strings are now rejected with PARSE_FAILURE.
Only "inf" and "infinity" are accepted, parsing is case-insensitive.

"NaN" values are handled similarly: strings with leading/trailing
garbage like "nana" are rejected, parsing is case-insensitive.

Other changes:
- StringToFloatInternal() was cleaned up a bit. Parsing inf and NaN
strings was moved out of the main loop.
- Use std::numeric_limits<T>::infinity() instead of INFINITY macro
and std::numeric_limits<T>::quiet_NaN() instead of NAN macro.
- Fixed another minor bug: multiple dots are allowed when parsing
float values (e.g. "2.1..6" is interpreted as 2.16).
- New BE and E2E tests were added.

Change-Id: I9e17d0f051b300a22a520ce34e276c2d4460d35e
Reviewed-on: http://gerrit.cloudera.org:8080/3791
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-08-24 03:34:01 +00:00
Tim Armstrong
f613dcd02d Add functional and targeted perf tests for joins with empty builds
I wrote these tests for my IMPALA-3987 patch, but other issues block
that optimisations.  These tests exercise an interesting corner case
so I split them out into a separate patch.

The functional tests exercise every join mode for nested loop join and
hash join with an empty build side. The perf test exercises hash join
with an empty build side.

Testing:
Made sure the tests passed with both partitioned and non-partitioned
hash join implementations. Ran the targeted perf query through the
single node perf run script to make sure it worked.

Change-Id: I0a68cafec32011a47c569b254979601237e7f2a5
Reviewed-on: http://gerrit.cloudera.org:8080/4051
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-08-19 06:04:18 +00:00
Alex Behm
1bbd667fd3 IMPALA-3828: Enable inversion for inner joins.
Testing: Ran the FE planner tests. Examined all the changed plans
to verify that the changes are benefitial according to our
cardinality estimates. Still need to do a real perf run.

Change-Id: I8ba903f1df2446350cca7e71fdb13f550bf9de72
Reviewed-on: http://gerrit.cloudera.org:8080/4035
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-19 05:40:01 +00:00
Matthew Jacobs
d113205cee IMPALA-3650: DISTRIBUTE BY required for managed Kudu tables
As of Kudu 0.9, DISTRIBUTE BY is now required when creating
a new Kudu table. Create table analysis, data loading, and
tests are updated to reflect this.

This also bumps the Kudu version to 0.10.0.

Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e
Reviewed-on: http://gerrit.cloudera.org:8080/3987
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-08-19 02:14:39 +00:00
Matthew Jacobs
0983da92ba IMPALA-3856,IMPALA-3871: Fix BinaryPredicate normalization for Kudu
Change-Id: Iae7612433a2e27f8887abe6624f9ee0f4867b934
Reviewed-on: http://gerrit.cloudera.org:8080/3986
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-08-18 04:03:00 +00:00
Alex Behm
532b1fe118 IMPALA-3063: Separate join inversion from join ordering.
Before this change joins were inverted while doing join ordering.
That approach was unnecessarily complex because it required
modifying the global analysis state for correct conjunct
placement, etc. However, join inversion is independent of join
ordering, and the existing approach could lead to generating
invalid plans with distributed non-equi right outer/semi joins,
which we cannot execute in the backend.

After this change joins are inverted in a separate pass over
the single-node plan. This simplifies the inversion
logic and allows us to avoid generating those invalid plans.

Note that this change is not only a separation of functionality
for the following reasons:
1. Our join cardinality estimation is not symmetric, i.e., A JOIN B
may not give the same estimate as B JOIN A due to our FK/PK detection
heuristic. In the context of this patch this means that an inverted
join may have a different cardinality estimate, so plans may change
depending on whether the inversion is done during join ordering of after.
2. We currently only invert outer/semi/anti joins based on the rhs table
ref join op. In this patch I want to preserve the existing behavior as
much as possible, but when doing the join ordering in a separate pass we
may see a join opn in a JoinNode that is different from the rhs table ref.
So in some situations the inversion behavior based on the join op could be
different and there are some examples in this patch.

This patch also moves the logic of converting hash joins to
nested-loop joins into a separate pass over the single-node plan.

Change-Id: If86db7753fc585bb4c69612745ec0103278888a4
Reviewed-on: http://gerrit.cloudera.org:8080/3846
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-18 03:25:16 +00:00
Christopher Channing
90a6b3206e IMPALA-3964: Fix crash when a count(*) is performed on a nested collection.
The Bug: Prior to this patch, a DCHECK was used to verify that the
underlying memory pool for the scratch batch was empty in a count based
scenario. For IMPALA-3964 (where a count(*) is performed on a nested
collection), if a Parquet column chunk is compressed, upon reading each
new data page it would be decompressed and eventually placed in to the
underlying scratch batch memory pool causing the aforementioned DCHECK
to fail. This was not picked up in the test suite as the TPCH nested
Parquet data is not compressed.

The Fix: Removed the erroneous DCHECK. Added logic to determine if any
memory in the scratch batch needs to be freed (due to the transfer that
occurs from the decompressed data pool), if so, it will be done.
Augmented the load_nested.py script to snappy compress each of the
tables within the 'tpch_nested_parquet' database. This is consistent with
how the flat TPCH Parquet data set is stored. Regarding test coverage,
there are already a number of tests that will perform nested collection
counts against the tables in the 'tpch_nested_parquet' database. For
uncompressed nested Parquet, the 'test_nested_types.py' test suite
leverages the 'ComplexTypesTbl' table to provide good coverage.

Change-Id: Id0955c85d18dfba4bd29a35ec95d0355da050607
Reviewed-on: http://gerrit.cloudera.org:8080/3940
Reviewed-by: Michael Ho <kwho@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-16 11:54:04 +00:00
Tim Armstrong
5afd9f7df7 IMPALA-3764,3914: fuzz test HDFS scanners and fix parquet bugs found
This adds a test that performs some simple fuzz testing of HDFS
scanners. It creates a copy of a given HDFS table, with each
file in the table corrupted in a random way: either a single
byte is set to a random value, or the file is truncated to a
random length. It then runs a query that scans the whole table
with several different batch_size settings. I made some effort
to make the failures reproducible by explicitly seeding the
random number generator, and providing a mechanism to override
the seed.

The fuzzer has found crashes resulting from corrupted or truncated
input files for RCFile, SequenceFile, Parquet, and Text LZO so far.
Avro only had a small buffer read overrun detected by ASAN.

Includes fixes for Parquet crashes found by the fuzzer, a small
buffer overrun in Avro, and a DCHECK in MemPool.

Initially it is only enabled for Avro, Parquet, and uncompressed
text. As follow-up work we should fix the bugs in the other scanners
and enable the test for them.

We also don't implement abort_on_error=0 correctly in Parquet:
for some file formats, corrupt headers result in the query being
aborted, so an exception will xfail the test.

Testing:
Ran the test with exploration_strategy=exhaustive in a loop locally
with both DEBUG and ASAN builds for a couple of days over a weekend.
Also ran exhaustive private build.

Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b
Reviewed-on: http://gerrit.cloudera.org:8080/3833
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-08-11 08:42:41 +00:00
Alex Behm
286da59219 IMPALA-3940: Fix getting column stats through views.
The bug: During join ordering we rely on the column stats of
join predicates for estimating the join cardinality. We have code
that tries to find the stats of a column through views but there
was a bug in identifying slots that belong to base table scans.
The bug lead us to incorrectly accept slots of view references
which do not have stats.

This patch fixes the above issue and adds new test infrastructure
for creating test-local views. It adds a TPCH-equivalent database that
contains views of the form "select * from tpch_basetbl" for all TPCH
tables and add tests the plans of all TPCH queries on the view database.

Change-Id: Ie3b62a5e7e7d0e84850749108c13991647cedce6
Reviewed-on: http://gerrit.cloudera.org:8080/3865
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-11 08:22:30 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Alex Behm
ac1215fd31 IMPALA-3861: Replace BetweenPredicates with their equivalent CompoundPredicate.
The bug: Our BetweenPredicate has a complex object structure that is unlike
most other Exprs because we generate an equivalent CompoundPredicate during
analysis and replace the original children. Keeping the various members in
sync and preserving the object structure during clone() and substitute() is
very difficult and error prone. In particular, subquery rewriting is
difficult because we extract and replace correlated BinaryPredicates.
Substituting BinaryPredicates in a BetweenPredicate's children is not
equivalent to a substitution on the BetweenPredicat's original children,
so keeping the various redundant members in sync is quite difficult.

The fix is to replace BetweenPredicates with their equivalent CompoundPredicates
before performing subquery rewrites.

We ultimately still want to fix clone() and substitute() for BetweenPredicates,
but an elegant solution is likely to more involved.

Change-Id: I0838b30444ed9704ce6a058d30718a24caa7444a
Reviewed-on: http://gerrit.cloudera.org:8080/3804
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-07-30 07:23:52 +00:00
Matthew Jacobs
b9f6392e84 IMPALA-3939: Data loading may fail on tpch kudu
Change-Id: I7f229a9f74fa4dceb14335914b0dde9bf607264e
Reviewed-on: http://gerrit.cloudera.org:8080/3818
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-07-30 03:59:29 +00:00
Bikramjeet Vig
36b4ea6f65 IMPALA-1683: Allow REFRESH on a single partition
Currently the only way to refresh metadata for a partition was to refresh
the whole table. This is a relatively time consuming process especially if
there are many partitions and only one is to be refreshed.
This patch allows the client to REFRESH on a single partition by using the
following syntax:
REFRESH [database_name.]table_name PARTITION (partition_spec)

Testing:
Added parsing and authorization tests in ParserTest.java and
AuthorizationTest.java respectively. A new test file
"test_refresh_partition.py" was added for testing functionality.

Performance:
For a table with 10000 partitions and 1 file per partition

                     execResetMetadata()       Total Execution Time

Refresh Table              3795 ms                   4630 ms

Refersh Partition            42 ms                    680 ms

We see that the time to refresh improves by a factor of 90x but due to
significant overhead of about 640ms in this case the effective improvement
is about 7x. As the size of the table and number of partitions increase,
this improvement would be more significant.

Change-Id: Ia9aa25d190ada367fbebaca47ae8b2cafbea16fb
Reviewed-on: http://gerrit.cloudera.org:8080/3813
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-07-29 23:57:50 +00:00
Alex Behm
c77fb628f7 IMPALA-3915: Register privilege and audit requests when analyzing resolved table refs.
The bug: We used to register privilege requests for table refs in TableRef.analyze()
which only got called for unresolved TableRefs. As a result, a reference to a view that
contains a subquery did not get properly authorized, explained as follows.
1. In the first analysis pass the view is replaced by a an InlineViewRef and we
   correctly register an authorizarion request.
2. We rewrite the subquery via the StmtRewriter and wipe the analysis state, but
   preserve the InlineViewRef that replaces the view reference.
3. The rewritten statement is analyzed again, but since an InlineViewRef is
   considered to be resolved, we never call TableRef.analyze(), and hence
   never register an authorization event for the view.

The fix: We now register authorization and auditing events when calling analyze() on a
resolved TableRef (BaseTableRef, InlineViewRef, CollectionTableRef).

Change-Id: I18fa8af9a94ce190c5a3c29c3221c659a2ace659
Reviewed-on: http://gerrit.cloudera.org:8080/3783
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-07-29 05:16:23 +00:00
Dimitris Tsirogiannis
0e88f0d7aa Add TPC-H based planner tests for Kudu tables
This commit adds a set of planner tests for Kudu tables based on the 22
TPC-H queries.

Change-Id: I6c40534b72b9aa1ee582b9679c2a63cad52df703
Reviewed-on: http://gerrit.cloudera.org:8080/3790
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-07-29 02:49:50 +00:00
Tim Armstrong
c1d70f814e IMPALA-3227: generate test TPC data sets during data load
The generated data is identical to the pregenerated tpch.tar.gz
and tpcds.tar.gz data that was used previously and were not
publically accessible.

This adds a "preload" hook to bin/load-data.py that can execute custom
logic for each data set. This is used to call the TPC-H and TPC-DS data
generation utilities that are already available in the Impala toolchain.

Testing:
Ran private test job with loading from snapshot disabled and without
the tpch/tpcds tarballs available.

Change-Id: Ieccfbd7d8d4a91bffddbe35abb7f5572e71a71cf
Reviewed-on: http://gerrit.cloudera.org:8080/3761
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-07-28 04:56:57 +00:00
Dimitris Tsirogiannis
6fbd35fa87 Enable TPC-H workload for Kudu tables
With this commit we enable loading of TPC-H data in Kudu tables and
running the 22 TPC-H queries against Kudu. Since Kudu doesn't support
the decimal data type, we had to modify the queries by using round()
function and update the test results.

Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289
Reviewed-on: http://gerrit.cloudera.org:8080/3789
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-07-28 04:35:11 +00:00
Alex Behm
55b43ba8c4 IMPALA-3084: Cache the sequence of table ref and materialized tuple ids during analysis.
The bug: For correct predicate assignment we rely on TableRef.getAllTupleIds()
and TableRef.getMaterializedTupleIds(). The implementation of those functions
used to traverse the chain of table refs and collect the appropriate ids.
However, during plan generation we alter the chain of table refs, in particular,
for dealing with nested collections, so those altered TableRefs do not return the
expected list of ids, leading to wrong decisions in predicate assignment.

The fix: Cache the lists of ids during analysis, so we are free to alter the
chain of TableRefs during plan generation.

Change-Id: I298b8695c9f26644a395ca9f0e86040e3f5f3846
Reviewed-on: http://gerrit.cloudera.org:8080/2415
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-07-22 22:35:18 -07:00
Michael Ho
276376acac IMPALA-3674: Lazy materialization of LLVM module bitcode.
Previously, each fragment using dynamic code generation will
parse the bitcode module and populate the LLVM data structures
for all the functions and their bodies in the bitcode module.
This is wasteful as we may only use a few functions out of all
the functions parsed. We rely on dead code elimination to
delete most of the unused functions so we won't waste time
compiling them.

This change implements lazy materialization of the functions'
bodies. On the initial parse of the bitcode module, we just
create the Function objects for each function in the module.
The functions' bodies will be materialized on demand from the
bitcode module when they are actually referenced in the query.
This ensures that the prepare time during codegen is proportional
to the number of IR functions referenced by the query instead
of being proportional to the total number of IR functions in
the module.

This change also stops cross-compiling BufferedTupleStream::GetTupleRow()
as there isn't much benefit for doing it. In addition, move the ctors
and dtors of LikePredicate to the header file to avoid an unnecessary
alias in the IR module.

For TPCH-Q2, a fragment which only codegen 9 functions used to spend
146ms in codegen. It now goes down to 35ms, a 76% reduction.

      CodeGen:(Total: 146.041ms, non-child: 146.041ms, % non-child: 100.00%)
         - CodegenTime: 0.000ns
         - CompileTime: 2.003ms
         - LoadTime: 0.000ns
         - ModuleBitcodeSize: 2.12 MB (2225304)
         - NumFunctions: 9 (9)
         - NumInstructions: 129 (129)
         - OptimizationTime: 29.019ms
         - PrepareTime: 114.651ms

      CodeGen:(Total: 35.288ms, non-child: 35.288ms, % non-child: 100.00%)
         - CodegenTime: 0.000ns
         - CompileTime: 1.880ms
         - LoadTime: 0.000ns
         - ModuleBitcodeSize: 2.12 MB (2221276)
         - NumFunctions: 9 (9)
         - NumInstructions: 129 (129)
         - OptimizationTime: 5.101ms
         - PrepareTime: 28.044ms

Change-Id: I6ed7862fc5e86005ecea83fa2ceb489e737d66b2
Reviewed-on: http://gerrit.cloudera.org:8080/3220
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-07-20 18:30:25 -07:00
Tim Armstrong
bc8c55afcd IMPALA-3729: batch_size=1 coverage for avro scanner
Also fix a stale comment in the avro scanner header.

The main work here is to fix the handling of empty result sets in the
test result verifier. This is a problem because we wanted to verify
that the results in the test file were a superset of the rows
returned, and this was thrown off by superflous '' rows in the expected
and actual result sets.

The basic problem is that the way test file sections
was parsed conflated an empty result section with non-empty result
section that had a single empty string. I.e.:

---- RESULTS
====

vs
---- RESULTS

====

both got resolved to [''].

Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/3413
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-07-19 23:30:02 -07:00
Thomas Tauber-Marshall
343bdad866 IMPALA-3210: last/first_value() support for IGNORE NULLS
Added support for the 'ignore nulls' keyword to the last_value and
first_value analytic functions, eg. 'last_value(col ignore nulls)',
which would return the last value from the window that is not null,
or null if all of the values in the window are null.

We handle 'ignore nulls' in the FE in the same way that we handle
'distinct' - by adding isIgnoreNulls as a field in FunctionParams.

To avoid affecting performance when 'ignore nulls' is not used, and
to avoid having to special case 'ignore nulls' on the backend, this
patch adds 'last_value_ignore_nulls' and 'first_value_ignore_nulls'
builtin analytic functions that wrap 'last_value' and 'first_value'
respectively.

Change-Id: Ic27525e2237fb54318549d2674f1610884208e9b
Reviewed-on: http://gerrit.cloudera.org:8080/3328
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Internal Jenkins
2016-07-18 08:28:09 -07:00
Alex Behm
45740c8bcc IMPALA-3678: Fix migration of predicates into union operands with an order by + limit.
There were two separate issues:

First, the SortNode incorrectly picked up unassigned conjuncts, and expected those to
be empty. In this case where predicates are migrated into union operands, there could
actually be unassigned conjuncts bound by the SortNode's tuple id (and so would be
incorrectly picked up). The fix is to not pick up unassigned conjuncts in the SortNode,
and allow them to be picked up later (into a SelectNode).

Second, when generating the plan for union operands we were missing a call to
graft a SelectNode on top of the operand plan to capture unassigned conjuncts.

Change-Id: I95d105ac15a3dc975e52dfd418890e13f912dfce
Reviewed-on: http://gerrit.cloudera.org:8080/3600
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-07-15 18:27:05 +00:00
Michael Ho
f129dfd202 IMPALA-3018: Don't return NULL on zero length allocations.
FunctionContext::Allocate() and FunctionContext::AllocateLocal()
used to return NULL for zero length allocations. This makes
it hard to distinguish between allocation failures and zero
length allocations. Such confusion may lead to DCHECK failure
in the macro RETURN_IF_NULL() in debug builds or access to NULL
pointers in non-debug builds.

This change fixes the problem above by returning NULL only if
there is allocation failure. Zero-length allocations will always
return a dummy non-NULL pointer.

Change-Id: Id8c3211f4d9417f44b8018ccc58ae182682693da
Reviewed-on: http://gerrit.cloudera.org:8080/3601
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-07-14 19:04:45 +00:00
Michael Ho
df2fc08d22 IMPALA-3206: Enable codegen for AVRO_DECIMAL
This change adds the missing switch statement in
CodegenReadScalar() for AVRO_DECIMAL so that we will
also codegen if an avro table contains AVRO_DECIMAL.
With this change, the following query improves by 37.5%,
going from 8s to 5s:

select count(distinct l_linenumber), avg(l_extendedprice), max(l_discount), min(l_tax) from tpch15_avro.lineitem;

This change also un-inlines BitUtil::ByteSwap() as the
third argument 'len' is not compilation constant for
all call sites.

Change-Id: I51adf0c1ba76e055f31ccb0034a0d23ea2afb30e
Reviewed-on: http://gerrit.cloudera.org:8080/3489
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-07-14 19:04:44 +00:00
Huaisi Xu
c6ce32b3b6 IMPALA-3687: Prefer Avro field name during schema reconciliation
Since it is possible to create an Avro table with both column
definitions and an Avro schema, Impala attempts to reconcile
inconsistencies in the two schema definitions, generally preferring the
Avro schema. The only exception to this rule was with
CHAR/VARCHAR/STRING columns, where the column definition was preferred
in order to support tables with CHAR/VARCHAR columns although Avro only
supports STRING. This exception is confusing because the name for such a
column will be taken from the column definition (and not from the Avro
schema).

This patch prefers name, comment from Avro schema definition and
uses column type from column definition for CHAR/VARCHAR/STRING
columns.

Change-Id: Ia3e43b2885853c2b4f207a45a873c9d7f31379cd
Reviewed-on: http://gerrit.cloudera.org:8080/3331
Reviewed-by: Huaisi Xu <hxu@cloudera.com>
Tested-by: Internal Jenkins
2016-07-14 19:04:43 +00:00
Michael Ho
ed5ec6772f IMPALA-1619: Support 64-bit allocations.
This change extends MemPool, FreePool and StringBuffer to support
64-bit allocations, fixes a bug in decompressor and extends various
places in the code to support 64-bit allocation sizes. With this
change, the text scanner can now decompress compressed files larger
than 1GB.

Note that the UDF interfaces FunctionContext::Allocate() and
FunctionContext::Reallocate() still use 32-bit for the input
argument to avoid breaking compatibility.

In addition, the byte size of a tuple is still assumed to be
within 32-bit. If it needs to be upgraded to 64-bit, it will be
done in a separate change.

A new test has been added to test the decompression of a 2GB
snappy block compressed text file.

Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f
Reviewed-on: http://gerrit.cloudera.org:8080/3575
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-07-08 15:42:09 -07:00
Huaisi Xu
c1da1409ba IMPALA-3711: Remove unnecessary privilege checks in getDbsMetadata()
Previously all code paths using getDbsMetadata() sufferred
unnecessary privilege checks:
1. Impala checked privilege of all databases, tables before
applying user provided JDBC pattern filters.
2. Impala passed a null pattern to getDbsMetadata() when
user did not provide one. However, null pattern is treated
as "%", which matches everything thereby causing unnecessary
privilege checks for catalog objects that are not in the
result set.

This patch creates PatternMatcher early so that user specified
null pattern is respected when calling getDbsMetadata().

Change-Id: I17d8c5b9fb12483e4b01b819fba48b6849311a14
Reviewed-on: http://gerrit.cloudera.org:8080/3371
Reviewed-by: Huaisi Xu <hxu@cloudera.com>
Tested-by: Huaisi Xu <hxu@cloudera.com>
2016-07-07 10:41:29 -07:00
Bharath Vissapragada
d19751669a IMPALA-3680: Cleanup the scan range state after failed hdfs cache reads
Currently we don't reset the file read offset if ZCR fails. Due to
this, when we switch to the normal read path, we hit the eosr of
the scan-range even before reading the expected data length. If both
the ReadFromCache() and ReadRange() calls fail without reading any
data, we end up creating a whole list of scan-ranges, each with size
1KB (DEFAULT_READ_PAST_SIZE) assuming we are reading past the scan
range. This gives a huge performance hit. This patch just calls
ScanRange::Close() after the failed cache reads to clean up the
file system state so that the re-reads start from beginning of
the scan range.

This was hit as a part of debugging IMPALA-3679, where the queries
on 1gb cached data were running ~20x slower compared to non-cached
runs.

Change-Id: I0a9ea19dd8571b01d2cd5b87da1c259219f6297a
Reviewed-on: http://gerrit.cloudera.org:8080/3313
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Bharath Vissapragada <bharathv@cloudera.com>
2016-07-05 13:37:26 -07:00
Michael Brown
08e8de73b2 IMPALA-3806: remove a few modern shell idioms to improve RHEL5 support
Both `find -executable` and the Bash "&>>" operator are too new to be
supported on RHEL5. Both have reasonable workarounds, so prefer them.
Note that this may not be the exhaustive list of such "modern"
conventions, but RHEL5 isn't working end-to-end, so we can't identify
all of them in a single commit yet.

Testing:

Before, the RHEL5 build would fail quite early here. Now, data load
succeeds and most of the backend tests successfully run.

Change-Id: I7438bed908d8026327923607238808122212d2d8
Reviewed-on: http://gerrit.cloudera.org:8080/3531
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Internal Jenkins
2016-07-05 13:37:26 -07:00
Michael Ho
6e71e903ff IMPALA-3223: Supports download of CDH components from S3.
This change updates the toolchain bootstrapping script
to download the CDH components (hadoop, hbase, hive, llama,
llama-minikdc and sentry) from the toolchain S3 bucket to
the toolchain directory if the environment variable
$DOWNLOAD_CDH_COMPONENTS is true. By default, it is false
which means the CDH components in the thirdparty directory
will be used instead.

To build the ASF tree(https://git-wip-us.apache.org/repos/asf?p=incubator-impala.git),
set $DOWNLOAD_CDH_COMPONENTS to true. Currently, the CDH
components in S3 are snapshots from the thirdparty directory
at 688d0efcd38731e8e27a8236dbdca21c8fd571a1. Once the integration
jenkins job (impala-cdh5-trunk-core-integration) is modified
to upload the latest stable builds to the S3 buckets, we can
remove the thirdparty directory and always use the CDH components
in the toolchain directory.

Note that bootstrap_toolchain.py will not overwrite existing
directories in the toolchain directory. To force a refresh of
cpmponents in the toolchain directory, a user should delete the
cached copy in the toolchain directory and execute
bootstrap_toolchain.py again. This behavior allows users to
develop locally without network connection once the toolchain
has been bootstrapped.

Change-Id: I16fa79db0005554cc0a116e74775647ba99f8dda
Reviewed-on: http://gerrit.cloudera.org:8080/3333
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-06-21 00:37:53 -07:00
Tim Armstrong
8a04b170d2 IMPALA-3754: fix TestParquet.test_corrupt_rle_counts flakiness
The test could hit one of two similar errors, depending on which order
it read the files. This patch fixes the ERROR/CATCH blocks to be more
permissive, so that either error is accepted.

Change-Id: I785048eda36552981b6ba9c739517f83ac8715f4
Reviewed-on: http://gerrit.cloudera.org:8080/3402
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-06-20 15:37:18 -07:00
Sailesh Mukil
73595d8f40 IMPALA-3737: Local filesystem build failed loading custom schemas
When a LOCATION that does not have the scheme specified is used,
the default FS is used as the filesystem scheme.
The default FS is set as 'file:/tmp' for localFS runs, however the
Hadoop library seems to ignore the '/tmp' part of the defaultFS for
locations without schemes and just uses 'file:'.
So the test warehouse is in: 'file:/tmp/test-warehouse'
However, the scripts access '/test-warehouse' without the scheme which
hadoop translates to: 'file:/test-warehouse' which does not exist.

This change disables metadata loading on local filesystem if there is
a schema change detected just as it is done in S3 and Isilon too.

Change-Id: Ie404079aeb2f837ac8b03244b2019e2c8ee9f221
Reviewed-on: http://gerrit.cloudera.org:8080/3384
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Sailesh Mukil <sailesh@cloudera.com>
2016-06-16 17:34:34 -07:00
Tim Armstrong
547be27e77 IMPALA-3745: parquet invalid data handling
Added checks/error handling:
* Negative string lengths while decoding dictionary or data page.
* Buffer overruns while decoding dictionary or data page.
* Some metadata FILECHECKs were converted to statuses.

Testing:
Unit tests for:
* decoding of strings with negative lengths
* truncation of all parquet types
* dictionary creation correctly handling error returns from Decode().

End-to-end tests for handling of negative string lengths in
dictionary- and plain-encoded data in corrupt files, and for
handling of buffer overruns for string data. The corrupted
parquet files were generated by hacking Impala's parquet
writer to write invalid lengths, and by hacking it to
write plain-encoded data instead of dictionary-encoded
data by default.

Performance:
set num_nodes=1;
set num_scanner_threads=1;
select * from biglineitem where l_orderkey = -1;

I inspected MaterializeTupleTime. Before the average was 8.24s and after
was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%).

Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece
Reviewed-on: http://gerrit.cloudera.org:8080/3387
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-06-15 21:33:39 -07:00
Skye Wanderman-Milne
01287a3ba9 IMPALA-3441, IMPALA-3659: check for malformed Avro data
This patch adds error checking to the Avro scanner (both the codegen'd
and interepted paths), including out-of-bounds checks and data
validity checks.

I ran a local benchmark using the following queries:
  set num_scanner_threads=1;
  select count(i) from default.avro_bigints_big; # file contains only longs
  select max(l_orderkey) from biglineitem_avro; # file has tpch.lineitem schema

Both benchmark queries see negligable or no performance impact.

This patch adds a new Avro scanner unit test and an end-to-end test
that queries several corrupted files, as well as updates the zig-zag
varlen int unit test.

Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Reviewed-on: http://gerrit.cloudera.org:8080/3072
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-06-13 18:32:32 -07:00
Taras Bobrovytsky
bee5375502 Add kill cluster marker to KMS
If PID files of each process in the mini cluster get deleted for some
reason, it should still possible to kill them because each process is
marked with "-DIBelongToTheMiniCluster". It turns out that the KMS
process was not being marked. This patch fixes this.

Change-Id: I0398dec94be3ae91548d11a79c1d5eec0ad3dadb
Reviewed-on: http://gerrit.cloudera.org:8080/3354
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
2016-06-13 15:32:17 -07:00
Alex Behm
19ff47091c IMPALA-3530: Clean up test_ddl.py. Part 1.
This is the first in a series of patches to clean up test_ddl.py

Summary of changes:
  - Break up test_create() and corresponding .test files into:
    * test_create_database()
    * test_create_table()
    * test_create_table_like_table()
    * test_create_table_like_file()
    * test_create_table_as_select()
  - Merge test_nested() into the tests above
  - Move a test into test_hms_integration.py
  - Add a new test_ddl_base.py as base class for DDL tests.
    The plan is to split up test_ddl.py into several smaller
    .py files in subsequent patches.

Testing: I tested test_ddl.py and test_hms_integration.py on
exhaustive locally as well as in private builds on all filesystems.

Change-Id: I5f4c044d39e165c2535961b8d0a765c8dbbd051c
Reviewed-on: http://gerrit.cloudera.org:8080/3044
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-06-10 10:31:15 -07:00
Alex Behm
e57fd2d831 IMPALA-3491: Use unique_database fixture in test_local_fs.py
Testing: Ran hdfs/core and localfs/core private builds.

Change-Id: I0720458882ac3b1138deccf9af0ee57bf2eed7dc
Reviewed-on: http://gerrit.cloudera.org:8080/3334
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-06-08 16:30:32 -07:00