1.) IMPALA-4134: Use Kudu AUTO FLUSH
Improves performance of writes to Kudu up to 4.2x in
bulk data loading tests (load 200 million rows from
lineitem).
2.) IMPALA-3704: Improve errors on PK conflicts
The Kudu client reports an error for every PK conflict,
and all errors were being returned in the error status.
As a result, inserts/updates/deletes could return errors
with thousands errors reported. This changes the error
handling to log all reported errors as warnings and
return only the first error in the query error status.
3.) Improve the DataSink reporting of the insert stats.
The per-partition stats returned by the data sink weren't
useful for Kudu sinks. Firstly, the number of appended rows
was not being displayed in the profile. Secondly, the
'stats' field isn't populated for Kudu tables and thus was
confusing in the profile, so it is no longer printed if it
is not set in the thrift struct.
Testing: Ran local tests, including new tests to verify
the query profile insert stats. Manual cluster testing was
conducted of the AUTO FLUSH functionality, and that testing
informed the default mutation buffer value of 100MB which
was found to provide good results.
Change-Id: I5542b9a061b01c543a139e8722560b1365f06595
Reviewed-on: http://gerrit.cloudera.org:8080/4728
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This has been working for several months, and it it was written mainly
by Casey Ching while he was at Cloudera working on Impala.
Change-Id: Ia4bc78ad46dda13e4533183195af632f46377cae
Reviewed-on: http://gerrit.cloudera.org:8080/4820
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
* Change run-step to output full log path
* Change text to say "Computing table stats" rather than "Computing
HBase stats" when running compute-table-stats.sh
Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07
Reviewed-on: http://gerrit.cloudera.org:8080/4825
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The previous code was not written in a way that GCC could
auto-vectorize it. Manually vectorizing speeds up BloomFilter::Or by
up to 184x.
Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Reviewed-on: http://gerrit.cloudera.org:8080/4813
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
These methods and code paths have been made obsolete by the switch to
Bloom filters.
Change-Id: I95fcaaa40243999800c2ec2ead5b3479d66a63e7
Reviewed-on: http://gerrit.cloudera.org:8080/4801
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
IMPALA-4258: The problem was that there was a reference to
HdfsScanner::batch_ hidden inside WriteEmptyTuples(). The batch_
reference is NULL when the scanner is run with MT_DOP > 1.
IMPALA-4286: When there are no scan ranges HdfsScanNodeBase::Open()
exits early without initializing the reader context. This lead to
a DCHECK in IoMgr::GetNextRange() called from HdfsScanNodeMt.
The fix is to remove that unnecessary short-circuit Open().
I combined these two bugfixes because the new basic test covers
both cases.
Testing: Added a new test_mt_dop.py test. A private code/hdfs
run passed.
Change-Id: I79c0f6fd2aeb4bc6fa5f87219a485194fef2db1b
Reviewed-on: http://gerrit.cloudera.org:8080/4767
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change fixes a memory management problem with LEAD()/LAG()
analytic functions which led to incorrect result. In particular,
the update functions specified for these analytic functions only
make a shallow copy of StringVal (i.e. copying only the pointer
and the length of the string) without copying the string itself.
This may lead to problem if the string is created from some UDFs
which do local allocations whose buffer may be freed and reused
before the result tuple is copied out. This change fixes the problem
above by allocating a buffer at the Init() functions of these
analytic functions to track the intermediate value. In addition,
when the value is copied out in GetValue(), it will be copied into
the MemPool belonging to the AnalyticEvalNode and attached to the
outgoing row batches. This change also fixes a missing free of
local allocations in QueryMaintenance().
Change-Id: I85bb1745232d8dd383a6047c86019c6378ab571f
Reviewed-on: http://gerrit.cloudera.org:8080/4740
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
This patch restores some behaviour from pre-IMPALA-2905 where we would
not send 0-row batches to the client. Although 0-row batches are legal,
they're not very useful for clients to receive (and clients may not
correctly process them).
No query was found which reliably produced 0-row batches, so no test is
added.
Change-Id: I7d339c1f9a55d9d75fb0e97d16b3176cc34f2171
Reviewed-on: http://gerrit.cloudera.org:8080/4787
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
commit 9f61397fc4 exposed a bug (one
that was latent before the commit). I am XFAILing this now just to
green the build; IMPALA-4295 can be resolved when this issue is fixed
and not just XFAILed.
Change-Id: Ie809c6c6c967447d527927ebbc6b110095e7320a
Reviewed-on: http://gerrit.cloudera.org:8080/4784
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
"IMPALA-4037,IMPALA-4038: fix locking during query
cancellation" accidentally added the "Child queries
finished" event unconditionally. We should only do
this if there are actually child queries.
Change-Id: I3881d032622750444d750f161ad6843bdbd16c30
Reviewed-on: http://gerrit.cloudera.org:8080/4768
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This commit modifies the stress test framework to run TPC-H and TPC-DS
workloads against Kudu. The follwing changes are included in this
commit:
1. Created template files with DDL and DML statements for loading TPC-H and
TPC-DS data in Kudu
2. Created a script (load-tpc-kudu.py) to load data in Kudu. The
script is invoked by the stress test runner to load test data in an
existing Impala/Kudu cluster (both local and CM-managed clusters are
supported).
3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL
files with TPC-H queries for Kudu were added in a previous patch.
4. Modified the stress test runner to take additional parameters
specific to Kudu (e.g. kudu master addr)
The stress test runner for Kudu was tested on EC2 clusters for both TPC-H
and TPC-DS workloads.
Missing functionality:
* No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu.
* Not all supported TPC-DS queries are included. Currently, only the
TPC-DS queries from the testdata/workloads/tpcds/queries directory
were modified to run against Kudu.
Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34
Reviewed-on: http://gerrit.cloudera.org:8080/4327
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.
Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU
Changes:
1) Remove the requirement to specify table properties such as key
columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
time of creation in Impala.
4) Disallow table properties that could conflict with an existing
table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
addresses. The flag is used as the default value for the table
property kudu_master_addresses but it can still be overriden
using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
wasn't implemented for Kudu tables and silently ignored. The Kudu
tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
Kudu) the existence of the other delegate and the use of delegates in
general has led to confusion. The Kudu delegate only exists to provide
functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
standard. When used at the column level, only one column can be
marked as a key. When used at the table level, multiple columns can
be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
The old "kudu.key_columns" table property is no longer accepted
though it is still used internally. "PRIMARY" is now a keyword.
The ident style declaration is used for "KEY" because it is also used
for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
The table property "kudu.table_name" is optional for managed tables
and is required for external tables. If for a managed table a Kudu
table name is not provided, a table name will be generated based
on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
of HMS when a table is loaded or refreshed. Table/column metadata
are cached in the catalog and are stored in HMS in order to be
able to use table and column statistics.
Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
StmtRewrite lost parentheses of CompoundPredicate in pushNegationToOperands()
and leads to incorrect toSql() result. Even though this issue would not leads
to incorrect result of query, it makes user confuse of the logical operator
precedence of predicates shown in EXPLAIN statement.
Change-Id: I79bfc67605206e0e026293bf7032a88227a95623
Reviewed-on: http://gerrit.cloudera.org:8080/4753
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The scheduler crashed with a segmentation fault when there were no
backends registered: After not being able to find a local backend (none
are configured at all) in ComputeScanRangeAssignment(), the previous
code would eventually try to return the top of
assignment_ctx.assignment_heap in SelectRemoteBackendHost(), but that
heap would be empty. Subsequently, when using the IP address of that
heap node, a segmentation fault would occur.
This change adds a check and aborts scheduling with an error. It also
contains a test.
Change-Id: I6d93158f34841ea66dc3682290266262c87ea7ff
Reviewed-on: http://gerrit.cloudera.org:8080/4776
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
If the table format is changed by the Alter Table statement, the
default partition in partitioned tables used to not get updated. This
caused a problem because Insert picks up the file format for new
partitions from the default partition. This patch fixes the problem by
calling addDefaultPartition().
Also removed "drop table if not exists" in tests in alter-table.test
because we already have the unique_database fixture.
Change-Id: I59bf21caa5c5e7867d07d87cda0c0a5b4b994859
Reviewed-on: http://gerrit.cloudera.org:8080/4750
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
AnayticExpr.analyze() replaces the original FIRST/LAST_VALUE
function with a FIRST/LAST_VALUE_IGNORE_NULLS function if
the IGNORE NULLS clause is specified.
The bug was that several places in AnalyticExpr.analyze() assumed
and asserted that only the original FIRST/LAST_VALUE function
could be encountered during analysis. However, with subquery
rewriting the IGNORE NULLS version of the function may also be
seen because the whole statement is re-analyzed after rewriting.
The fix is to unset the IGNORE NULLS flag of the function params
after changing the analytic function name.
Change-Id: I708de7925fe6aeef582fd7510da93d24c71229d9
Reviewed-on: http://gerrit.cloudera.org:8080/4732
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change enables codegen for the tuple row comparator
used in merging-exchange node.
With this change, merging-exchange operator improves by
40% and 50% respectively for primitive_orderby_bigint and
primitive_orderby_all on TPCH-300, speeding up the query by
6% and 11% respectively.
Change-Id: I944b8d52ea63ede58e4dc6fbe6e6953756394d41
Reviewed-on: http://gerrit.cloudera.org:8080/4759
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Also pass the flag that enables ld.gold directly to the
compiler. This is understood by both gcc and clang
(if prefixed with -Wl, clang just forwards the flag to ld,
where it is ignored).
Testing:
Did ASAN and debug private builds to validate it works.
Tested shared library, release, ninja and distcc builds locally
as part of my normal workflow.
Change-Id: Ib05c944ced9cdfe54941f4b690574e45a25110a2
Reviewed-on: http://gerrit.cloudera.org:8080/4751
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
In our IPMC vote to release 2.7.0 rc3, Justing Mclean pointed out a
number of issues of compliance with ASF policy. He asked:
1. "Please place build instruction and supported platforms in the
README. The wiki may change over time and that may make it difficult
to build older versions."
2. Remove binary file llvm-ir/test-loop.bc
3. Add be/src/gutil/valgrind.h,
shell/ext-py/sqlparse-0.1.14/sqlparse/pipeline.py and
cmake_modules/FindJNI.cmake, normalize.css (embedded in bootstrap.css)
to LICENSE.txt
4. Fix be/src/thirdparty/squeasel/squeasel* in LICENSE.txt
5. Remove outdated copyright lines from HBase (see
https://issues.apache.org/jira/browse/HBASE-3870)
6. Remove duplicate jquery notice from LICENSE.txt
Change-Id: I30ff77d7ac28ce67511c200764fba19ae69922e0
Reviewed-on: http://gerrit.cloudera.org:8080/4582
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
with Kudu scan node
Currently we do not start the TotalStorageWaitTime timer in the
kudu-scanner. This patch replaces the kudu_read_timer with the
TotalStorageWaitTime which measures the intended time.
Change-Id: If0c793930799fdcaff53e705f94b52cadac2f53a
Reviewed-on: http://gerrit.cloudera.org:8080/4639
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This patch is mostly mechanical move of codegen related logic
from each exec node's Prepare() to its Codegen() function.
After this change, code generation will no longer happen in
Prepare(). Instead, it will happen after Prepare() completes in
PlanFragmentExecutor. This is an intermediate step towards the
final goal of sharing compiled code among fragment instances in
multi-threading.
As part of the clean up, this change also removes the logic for
lazy codegen object creation. In other words, if codegen is enabled,
the codegen object will always be created. This simplifies some
of the logic in ScalarFnCall::Prepare() and various Codegen()
functions by reducing error checking needed. This change also
removes the logic added for tackling IMPALA-1755 as it's not
needed anymore after the clean up.
The clean up also rectifies a not so well documented situation.
Previously, even if a user explicitly sets DISABLE_CODEGEN to true,
we may still codegen a UDF if it was written in LLVM IR or if it
has more than 8 arguments. This patch enforces the query option
by failing the query in both cases. To run the query, the user
must enable codegen. This change also extends the number of
arguments supported in the interpretation path of ScalarFn to 20.
Change-Id: I207566bc9f4c6a159271ecdbc4bbdba3d78c6651
Reviewed-on: http://gerrit.cloudera.org:8080/4651
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
This change prevents us from depending on LLAMA to build.
Note that the LLAMA MiniKDC is left in - it is a test
utility that does not depend on LLAMA itself.
IMPALA-4292 tracks cleaning this up.
Testing:
Ran a private build to verify that all tests pass.
Change-Id: If2e5e21d8047097d56062ded11b0832a1d397fe0
Reviewed-on: http://gerrit.cloudera.org:8080/4739
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
This mostly mechanical change moves the definition and implementation of
the Beeswax and HS2-specific result sets into their own module. Result
sets are now uniformly created by one of two factory methods, so the
implementation is decoupled from the client.
Change-Id: I6ab883b62d3ec7012240edf8d56889349e7c0e32
Reviewed-on: http://gerrit.cloudera.org:8080/4736
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
Fixed double decrement in case a cached connection is broken
and cannot be re-created.
Change-Id: Ic9e28055cb232cdb543c4c9f05a558ab0f73f777
Reviewed-on: http://gerrit.cloudera.org:8080/4668
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
This is to help with IMPALA-4277 to make it easier to build against
Hadoop/Hive distributions where the directory layout doesn't exactly
match our current CDH dependencies, or where we may want to
temporarily override a version without making a source change.
Change-Id: I7da10e38f9c4309f2d193dc25f14a6ea308c9639
Reviewed-on: http://gerrit.cloudera.org:8080/4720
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
Adds utility functions for fast unpacking of batches of bit-packed
values. These support reading batches of any number of values provided
that the start of the batch is aligned to a byte boundary. Callers that
want to read smaller batches that don't align to byte boundaries will
need to implement their own buffering.
The unpacking code uses only portable C++ and no SIMD intrinsics, but is
fairly efficient because unpacking a full batch of 32 values compiles
down to 32-bit loads, shifts by constants, masks by constants, bitwise
ors when a value straddles 32-bit words and stores. Further speedups
should be possible using SIMD intrinsics.
Testing:
Added unit tests for unpacking, exhaustively covering different
bitwidths with additional test dimensions (memory alignment, various
input sizes, etc).
Tested under ASAN to ensure the bit unpacking doesn't read past the end
of buffers.
Perf:
Added microbenchmark that shows on average an 8-9x speedup over the
existing BitReader code.
Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622
Reviewed-on: http://gerrit.cloudera.org:8080/4494
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
A previous commit "IMPALA-4259: build Impala without any test
cluster setup" altered some undocumented side-effects of
buildall.sh.
Previously the following commands reconfigured and restarted the test
cluster. It worked because buildall.sh unconditionally regenerated
the test cluster configs.
./buildall.sh -notests && ./testdata/bin/run-all.sh
./buildall.sh -noclean -notests && ./testdata/bin/run-all.sh
Instead of restoring the old behaviour and continuing to encourage
mixing use of low and high level scripts like testdata/bin/run-all.sh
as part of the "standard" workflow, this commit adds another
high-level option to buildall.sh, -start_minicluster, that
accomplishes the high-level task of restarting a minicluster with
fresh configs. The above commands can be replaced with:
./buildall.sh -notests -start_minicluster
./buildall.sh -notests -noclean -start_minicluster
Change-Id: I0ab3461f8ff3de49b3f28a0dc22fa0a6d5569da5
Reviewed-on: http://gerrit.cloudera.org:8080/4734
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
MT_DOP > 0 is only supported for plans without distributed joins
or table sinks. Adds validation to fail unsupported queries
gracefully in planning.
For scans in queries that are executable with MT_DOP > 0 we either
use the optimized MT scan node BE implementation (only Parquet), or
we use the conventional scan node with num_scanner_threads=1.
TODO: Still need to add end-to-end tests.
Change-Id: I91a60ea7b6e3ae4ee44be856615ddd3cd0af476d
Reviewed-on: http://gerrit.cloudera.org:8080/4677
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
There are two motivations for this change:
1. Reduce memory consumption.
2. Pave the way for full memory layout compatibility between
Impala and Kudu to eventually enable zero-copy scans. This
patch is a only first step towards that goal.
New Memory Layout
Slots are placed in descending order by size with trailing bytes to
store null flags. Null flags are omitted for non-nullable slots. There
is no padding between tuples when stored back-to-back in a row batch.
Example: select bool_col, int_col, string_col, smallint_col
from functional.alltypes
Slots: string_col|int_col|smallint_col|bool_col|null_byte
Offsets: 0 16 20 22 23
The main change is to move the null indicators to the end of tuples.
The new memory layout is fully packed with no padding in between
slots or tuples.
Performance:
Our standard cluster perf tests showed no significant difference in
query response times as well as consumed cycles, and a slight
reduction in peak memory consumption.
Testing:
An exhaustive test run passed. Ran a few select tests like TPC-H/DS
with ASAN locally.
These follow-on changes are planned:
1. Planner needs to mark slots non-nullable if they correspond
to a non-nullable Kudu column.
2. Update Kudu scan node to copy tuples with memcpy.
3. Kudu client needs to support transferring ownership of the
tuple memory (maybe do direct and indirect buffers separately).
4. Update Kudu scan node to use memory transfer instead of copy
Change-Id: Ib6510c75d841bddafa6638f1bd2ac6731a7053f6
Reviewed-on: http://gerrit.cloudera.org:8080/4673
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The plan-root fragment instance that runs on the coordinator should be
handled like all others: started via RPC and run asynchronously. Without
this, the fragment requires special-case code throughout the
coordinator, and does not show up in system metrics etc.
This patch adds a new sink type, PlanRootSink, to the root fragment
instance so that the coordinator can pull row batches that are pushed by
the root instance. The coordinator signals completion to the fragment
instance via closing the consumer side of the sink, whereupon the
instance is free to complete.
Since the root instance now runs asynchronously wrt to the coordinator,
we add several coordination methods to allow the coordinator to wait for
a point in the instance's execution to be hit - e.g. to wait until the
instance has been opened.
Done in this patch:
* Add PlanRootSink
* Add coordination to PFE to allow coordinator to observe lifecycle
* Make FragmentMgr a singleton
* Removed dead code from Coordinator::Wait() and elsewhere.
* Moved result output exprs out of QES and into PlanRootSink.
* Remove special-case limit-based teardown of coordinator fragment, and
supporting functions in PlanFragmentExecutor.
* Simplified lifecycle of PlanFragmentExecutor by separating Open() into
Open() and Exec(), the latter of which drives the sink by reading
rows from the plan tree.
* Add child profile to PlanFragmentExecutor to measure time spent in
each lifecycle phase.
* Removed dependency between InitExecProfiles() and starting root
fragment.
* Removed mostly dead-code handling of LIMIT 0 queries.
* Ensured that SET returns a result set in all cases.
* Fix test_get_log() HS2 test. Errors are only guaranteed to be visible
after fetch calls return EOS, but test was assuming this would happen
after first fetch.
Change-Id: Ibb0064ec2f085fa3a5598ea80894fb489a01e4df
Reviewed-on: http://gerrit.cloudera.org:8080/4402
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Fixes a regression in the data load process that had been introduced
by commit 75a857c. To making check-schema-diff.sh work from anywhere.
we need to specify the git-dir and work-tree arguments everywhere we
call git.
Change-Id: I32e0dce2c10c443763a038aa3b64b1c123ed62ad
Reviewed-on: http://gerrit.cloudera.org:8080/4726
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Fixes a small perf issue by avoiding extra calls to check
a vector size on every slot.
Testing: Ran EE tests.
Change-Id: Ie76d33c3d00e3be6d238226d28c4100bb65aac58
Reviewed-on: http://gerrit.cloudera.org:8080/4688
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
IMPALA-3002:
The shell prints an incorrect value for '#Rows' in the exec
summary for broadcast nodes due to incorrect logic around
whether to use max or agg stats. This patch makes the behavior
consistent with the way the be treats exec summaries in
summary-util.cc. This incorrect logic was also duplicated in
the impala_beeswax test framework.
IMPALA-1473:
When there is a merging exchange with a limit, we may copy rows
into the output batch beyond the limit. In this case, we currently
update the output batch's size to reflect the limit, but we also
need to update ExecNode::num_rows_returned_ or the exec summary
may show that the exchange node returned more rows than it really
did.
Additionally, PlanFragmentExecutor::GetNext does not update
rows_produced_counter_ in some cases, leading the runtime profile
to display an incorrect value for 'RowsProduced'.
Change-Id: I386719370386c9cff09b8b35d15dc712dc6480aa
Reviewed-on: http://gerrit.cloudera.org:8080/4679
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Adds a profile counter for the number of kudu scan tokens
(ranges) that are "expected" to be remote.
Testing: Manual; Have been running with this on the Kudu
cluster. Cannot easily simulate this in the minicluster
because the scheduler considers multiple impalads on the
same host to be local for the purposes of determining
locality. See BackendConfig::LookUpBackendIp().
Change-Id: I74fd5773c4ae10267de80b6572d93197a4131696
Reviewed-on: http://gerrit.cloudera.org:8080/4687
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This adds a tie-break to make sure that we sort predicates in a
deterministic order on Java 7 and 8. This was suggested by Alex in
IMPALA-3644.
There are still three broken tests when run in Java 8, but it seems best
to address them in a subsequent change.
Change-Id: Id11010bfeaff368869e6d430eeb4773ddf41faff
Reviewed-on: http://gerrit.cloudera.org:8080/4671
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
To be able to run the Random Query Generator with Impala and Kudu, we
need to mount an external Docker volume as a workaround to KUDU-1419.
This patch introduces a series of environment variables a user may tweak
in order to help with that purpose. The patch assumes a viable,
reasonable Docker container based on a standard Linux distribution like
Ubuntu 14.
To assist users, I've updated the Leopard README with instructions on
the environment variables' meanings.
The gist here is that the container is the source of truth, which means
to create an external volume, we need to copy the testdata off the
container onto the host running Docker Engine. To do that we suggest a
strategy using rsync via passwordless SSH key.
Testing:
I used a Cloudera Docker container that has Impala in /home/dev/Impala.
Before, Kudu would fail to start due to KUDU-1419. Now, we load testdata
into an external volume, build Impala, run the minicluster including
Kudu, and can access the tpch_kudu data.
I made flake8 fixes as well. flake8 on this file is now clean.
Change-Id: Ia7d9d9253fcd7e3905e389ddeb1438cee3e24480
Reviewed-on: http://gerrit.cloudera.org:8080/4678
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This script bootstraps an Impala dev environment on Ubuntu 14.04. It
is not hermetic -- it changes some config files for the user and for
the OS.
It is green on Jenkins, and it runs in about 6.5 hours. The intention
is to have this script run in a CI tool for post-commit testing, with
the hope that this will make it easier for new developers to get a
working development environment. Previously, the new developer
workflow lived on wiki pages and tended to bit-rot.
Still left to do: migrating the install script into the official
Impala repo.
Change-Id: If166a8a286d7559af547da39f6cc09e723f34c7e
Reviewed-on: http://gerrit.cloudera.org:8080/4674
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
Adds code comments and issues a warning for Parquet files
with num_rows=0 but at least one non-empty row group.
Change-Id: I72ccf00191afddb8583ac961f1eaf11e5eb28791
Reviewed-on: http://gerrit.cloudera.org:8080/4696
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Previously, when creating a LlvmCodeGen object, we
run an O(mn) algorithm to map the IRFunction::Type
to the actual LLVM::Function object in the module.
m is the size of IRFunction::Type enum and n is
the total number of functions in the module. This
is a waste of time if we only use few functions
from the module.
This change reduces the preparation time of a simple
query from 23ms to 10ms.
select count(*) from tpch100_parquet.lineitem where l_orderkey > 20;
Change-Id: I61ab9fa8cca5a0909bb716c3c62819da3e3b3041
Reviewed-on: http://gerrit.cloudera.org:8080/4691
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
The commit "IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder"
slightly increased codegen time, which caused TPC-H Q2 to sometimes
regress significantly because of races in runtime filter arrival.
This patch attempts to fix the regression by improving codegen time in a
few places.
* Revert to using the old bool/Status return pattern. The regular Status
return pattern results in significantly more complex IR because it has
to emit code to copy and free statuses. I spent some time trying to
convince it to optimise the extra code out, but didn't have much success.
* Remove some code that cannot be specialized from cross-compilation.
* Add noexcept to some functions that are used from the IR to ensure
exception-handling IR is not emitted. This is less important after the
first change but still should help produce cleaner IR.
Performance:
I was able to reproduce a regression locally, which is fixed by this
patch. I'm in the process of trying to verify the fix on a cluster.
Change-Id: Idf0fdedabd488550b6db90167a30c582949d608d
Reviewed-on: http://gerrit.cloudera.org:8080/4623
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
In order to attempt to get code like
double VeryLongFunctionNames(double x1, double x2, double x3,
double x4) {
return 1.0;
}
rather than
double VeryLongFunctionNames(
double x1, double x2, double x3, double x4) {
return 1.0;
}
I wrote a small set of programs to infer which .clang-format params
fit the current Impala codebase most closely; this patch is the
result.
This patch is the best the inferencer found (while maintaining certain
enforced parameters, like 90-character lines). It is about 10% closer
to Impala's current code base than the .clang-format that is checked
in at the moment, as measured by number of lines in the diff.
Change-Id: Iccaec6c1673c3e08d2c39200b0c84437af629aed
Reviewed-on: http://gerrit.cloudera.org:8080/4590
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Jim Apple <jbapple@cloudera.com>
This picks up the latest toolchain version. The only change is that
some symlinks in the previous version were broken.
Change-Id: I0c5e9ef10984fc8c6840acf285a04e472fc8b304
Reviewed-on: http://gerrit.cloudera.org:8080/4716
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
The main outcome of this change is to avoid making unnecessary
modification to the Impala or other source trees when we don't need the
test cluster.
To achieve that, this refactors the script to make the flow easier
to understand and makes it more consistent which build steps are
executed in which modes.
Change-Id: I429da7bc6681b16c07fe58bb3efac6d1a8579137
Reviewed-on: http://gerrit.cloudera.org:8080/4685
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This change might give a minor speedup to COMPUTE STATS
and COMPUTE INCREMENTAL STATS. In any case, marking the slots
non-nullable seems strictly better than leaving them nullable.
Testing: I ran our local testdata/compute-table-stats.sh and
it succeeded.
Change-Id: I1c05b8dfb797b2a42ee1a7bf14ad56bb83d2b1c5
Reviewed-on: http://gerrit.cloudera.org:8080/4707
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins