IMPALA-7307 disabled column index writing for floating point columns
until PARQUET-1222 is resolved. However, the problematic values are
only the NaNs. Therefore we can write column index if NaNs are not
present in data.
Testing:
* Added tests which should fail if a column index is
present while having NaN values in the column.
Change-Id: Ic9d367500243c8ca142a16ebfeef6c841f013434
Reviewed-on: http://gerrit.cloudera.org:8080/14264
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The main refactoring is to move result expressions into the
DataSink implementations, which is where they are used
in the backend. This will make it easier to explicitly collect
all the expressions in the plan tree for the purposes of
projection. Previously the expressions were owned by
the PlanFragment and passed into the DataSink.
Show result exprs in explain plan of the table sinks
at higher verbosity.
Change-Id: I163a393b5ce6b8a926b3fee9b4b920e31d6846b2
Reviewed-on: http://gerrit.cloudera.org:8080/14270
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Added the feature that computes an estimated number of rows in the current
hdfs table if the statistics for the cardinality of the current hdfs table is not
available.
Also added an additional query option to revert the change in case of regression.
Testing:
(1) In CardinalityTest.java, replaced the original statement
"verifyCardinality("SELECT a FROM functional.tinytable", -1);" in
the method testBasicsWithoutStats() with
"verifyCardinality("SELECT a FROM functional.tinytable", 2);".
(2) In CarginalityTest.java, added more tests to check the cardinality
of most PlanNode implementations. For each tested PlanNode, the behaviors
before and after we disable the feature are both tested.
(3) In set.test, modified three related test cases to make sure that
the added query option is included after executing "set all" in various
scenarios.
(4) There are 8 JUnit tests in PlannerTest.java that would produce different
distributed query plans when this feature is enabled. Added an additional
JUnit test for 6 of those 8 affected JUnit tests when this feature is
enabled. Specifically, each tested query in a newly added test files involves
at least one hdfs table without available statistics.
We do not add test cases for 2 of the affected JUnit tests when this feature
is enabled since it results in flaky tests. These two JUnit tests are
testResourceRequirements() and testSpillableBufferSizing(). In this patch
we only test them when the feature is disabled.
(5) There are 5 Python end to end tests that consist of queries that would
produce different results. Added an additional query for each affected query
when this feature is disabled.
Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a
Reviewed-on: http://gerrit.cloudera.org:8080/12974
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit implements page filtering based on the Parquet page index.
The read and evaluation of the page index is done by the
HdfsParquetScanner. At first, we determine the row ranges we are
interested in, and based on the row ranges we determine the candidate
pages for each column that we are reading.
We still issue one ScanRange per column chunk, but we specify
sub-ranges that store the candidate pages, i.e. we don't read
the whole column chunk, but only fractions of it.
Pages are not aligned across column chunks, i.e. page #2 of column A
might store completely different rows than page #2 of column B.
It means we need to implement some kind of row-skipping logic
when we read the data pages. This logic is implemented in
BaseScalarColumnReader and ScalarColumnReader. Collection column
readers know nothing about page filtering.
Page filtering can be turned off by setting the query option
'read_parquet_page_index' to false.
Testing:
* added some unit tests for the row range and
page selection logic
* generated various Parquet files with Parquet-MR
* enabled Page index writing and wrote selective queries against
tables written by Impala. Current tests are likely to use page
filtering transparently.
Performance:
* Measured locally, observed 3x to 20x speedup for selective queries.
The speedup was proportional to the IO operations need to be done.
* The TPCH benchmark didn't show a significant performance change. It
is not a suprise since the data is not being sorted in any useful
way. So the main goal was to not introduce perf regression.
TODO:
* measure performance for remote reads
Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Reviewed-on: http://gerrit.cloudera.org:8080/12065
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Storage layer information was added to the query profile by
IMPALA-6050. This broke some tests on exhaustive and s3 runs
due to changes in formatting.
This fixes the issues:
1. Replace HDFS SCAN with $FILESYSTEM_NAME SCAN in some test files
2. Add $FILESYSTEM_NAME to partition information string
Testing:
- Ran exhaustive HDFS tests
- Ran s3 tests
Change-Id: I11c6ab9c888464a0f0daaf8a7a6f565d25731872
Reviewed-on: http://gerrit.cloudera.org:8080/13025
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A recent fix added node cardinality to the standard EXPLAIN output,
displaying a large number like 123456780 as 123.46M. This patch applies
the same fix to the remaining row count numbers: metadata, extrapolated
rows, etc.
Tests:
* Rebased PlannerTest .test files as needed for the new row count
format.
* Reran all tests to check for dependencies on the old format.
Change-Id: I08faaa9ad7b5ed42dcd7a15a333e8734bb45f10c
Reviewed-on: http://gerrit.cloudera.org:8080/12438
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala often generates SQL for statements using the toSql() call.
Generated SQL is often used during testing or when writing the query
plan. Impala keywords such as "create", when used as identifiers,
must be quoted:
SELECT `select`, `from` FROM `order` ...
The code in ToSqlUtils.getIdentSql() quotes the identifier if it is
an Impala or Hive keyword, or if it does not follow the identifier
pattern. The code uses the Hive lexer to detect a keyword. But, the
code contained a flaw: the lexer expects a case-insensitive input.
We provide a case sensitive input. As a result, "MONTH" is caught as a
Hive keyword and quoted, but "month" is not. This patch fixes that flaw.
This patch also fixes:
IMPALA-8051: Compute stats fails on a column with comment character in
name
The code uses the Hive lexical analyzer to check names. Since "#" and
"--" are comment characters, a name like "foo#" is parsed as "foo" which
does not need quotes, hence we don't quote "foo#", which causes issues.
Added a special check for "#" and "--" to resolve this issue.
Testing:
* Refactored getIdentSql() easier testing.
* Added a tests to the recently added ToSqlUtilsTest for this case and
several others.
* Making this change caused the columns `month`, `year`, and `key` to be
quoted when before they were not. Updated many tests as a result.
* Added a new identSql() function, for use in tests, to match the
quoting that Impala uses, and to handle the wildcard, and multi-part
names. Used this in ToSqlTest to handle the quoted names.
* PlannerTest emits statement SQL to the output file wrapped to 80
columns and sometimes leaves trailing spaces at the end of the line.
Some tools remove that trailing space, resulting in trivial file
differences. Fixed this to remove trailing spaces in order to simplify
file comparisons.
* Tweaked the "In pipelines" output to avoid trailing spaces when no
pipelines are listed.
* Reran all FE tests.
Change-Id: I06cc20b052a3a66535a171c36b4b31477c0ba6d0
Reviewed-on: http://gerrit.cloudera.org:8080/12009
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Fill the LogicalType field in Parquet schemas for columns
that have an associated logical type. ConvertedType still
has to be filled to remain compatible with older readers.
Testing:
- added new tests to check both logical and converted types
to test_insert_parquet.py
Change-Id: I6f377950845683ab9c6dea79f4c54db0359d0b91
Reviewed-on: http://gerrit.cloudera.org:8080/12004
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Cardinality is vital to understanding why a plan has the form it does,
yet the planner normally emits cardinality information only for the
detailed levels. Unfortunately, most query profiles we see are at the
standard level without this information (except in the summary table),
making it hard to understand what happened.
This patch adds cardinality to the standard EXPLAIN output. It also
changes the displayed cardinality value to be in abbreviated "metric"
form: 1.23K instead of 1234, etc.
Changing the DESCRIBE output has a huge impact on PlannerTest: all the
"golden" test files must change. To avoid doing this twice, this patch
also includes:
IMPALA-7919: Add predicates line in plan output for partition key
predicates
This is also the time to also include:
IMPALA-8022: Add cardinality checks to PlannerTest
The comparison code was changed to allow a set of validators, one of
which compares cardinality to ensure it is within 5% of the expected
value. This should ensure we don't change estimates unintentionally.
While many planner tests are concerned with cardinality, many others are
not. Testing showed that the cardinality is actually unstable within
tests. For such tests, added filters to ignore cardinality. The filter
is enabled by default (for backward compatibility) but disabled (to
allow cardinality verification) for the critical tests.
Rebasing the tests was complicated by a bug in the error-matching code,
so this patch also fixes:
IMPALA-8023: Fix PlannerTest to handle error lines consistently
Now, the error output written to the output "save results" file matches
that expected in the "golden" file -- no more handling these specially.
Testing:
* Added cardinality verification.
* Reran all FE tests.
* Rebased all PlannerTest .test files.
* Adjusted the metadata/test_explain.py test to handle the changed
EXPLAIN output.
Change-Id: Ie9aa2d715b04cbb279aaffec8c5692686562d986
Reviewed-on: http://gerrit.cloudera.org:8080/12136
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If explain_level is at 'extended' level or higher, then enhance the
output from the explain command. (1) Show the analyzed sql in the
explain header, this is the rewritten sql, which includes implicit
casts, and literals are printed with a cast so that their type is
visible. (2) When predicates are shown in the plan these are shown in
the same format.
The toSql() method can be called on a ParseNode tree to return
the sql corresponding ot the tree. In the past toSQl() has been
enhanced to print rewritten sql by partially overloading toSql() [with
toSql(boolean)]. This current change requires changing toSQl() in
many places as NumericLiteral can appear at different points in ia
parse tree. To avoid many new fragile overloads of toSql() I added
toSql(ToSqlOptions), where ToSqlOptions is an enum which controls the
form of the Sql that is returned. This changes many files but is safer
and means that any future options to toSql() can be added painlessly.
If SHOW_IMPLICIT_CASTS is passed to toSql() then
- in CastExpr print the implicit cast
- in NumericLiteral print the literal with a cast to show the type
Add a PlannerTestOption directive that will force the query text showing
implicit casts to be included in the PLAN section of a .test file.
The analyzed query text is wrapped at 80 characters. Note that the
analyzed query cannot always be executed as queries rewritten to use
LEFT SEMI JOIN are not legal sql. In addition some space characters may
be removed from the query for prettier display.
Documentation of this change will be done as IMPALA-7718
EXAMPLE OUTPUT:
[localhost:21000] default> set explain_level=2;
EXPLAIN_LEVEL set to 2
[localhost:21000] default> explain select * from functional_kudu.alltypestiny where bigint_col < 1000 / 100;
Query: explain select * from functional_kudu.alltypestiny where bigint_col < 1000 / 100
Max Per-Host Resource Reservation: Memory=0B Threads=2
Per-Host Resource Estimates: Memory=10MB
Codegen disabled by planner
Analyzed query: SELECT * FROM functional_kudu.alltypestiny WHERE CAST(bigint_col
AS DOUBLE) < CAST(10 AS DOUBLE)
""
F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
| Per-Host Resources: mem-estimate=4.88MB mem-reservation=0B thread-reservation=2
PLAN-ROOT SINK
| mem-estimate=0B mem-reservation=0B thread-reservation=0
|
00:SCAN KUDU [functional_kudu.alltypestiny]
predicates: CAST(bigint_col AS DOUBLE) < CAST(10 AS DOUBLE)
mem-estimate=4.88MB mem-reservation=0B thread-reservation=1
tuple-ids=0 row-size=97B cardinality=1
in pipelines: 00(GETNEXT)
Fetched 16 row(s) in 0.03s
TESTING:
All end-to-end tests pass.
Added a new test in ExprRewriterTest which prints sql with implict casts
for some interesting queries.
Add a unit test for the code which wraps text at 60 characters.
The output of some Planner Tests in .test files has been updated to
include the Analyzed sql that is printed when explain_level is
at at least 'extended' level.
Change-Id: I55c3bdacc295137f66b2316a912fc347da30d6b0
Reviewed-on: http://gerrit.cloudera.org:8080/11719
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
The .test file parser implemented an unconventional method for parsing
single-quoted strings in comma-separated value format. This didn't handle
trailing commas in the string correctly.
This commit switches to using a conventional method for parsing
comma-separated value format:
* Commas enclosed by single quotes are not treated as field separators
* Single quotes can be escaped within a string by doubling them.
I looked into using Python's .csv module for this, but it wouldn't
work without modifying the test file format more because it
automatically discards the quotes during parsing, which are actually
semantically important in .test files. E.g. without the quotes we can't
distinguish between the literal string 'regex:...' and the regex
regex:....
Testing:
Ran exhaustive tests and fixed .test files that required modifications.
Will rerun before merging.
Added a couple of tests to exercise edge cases in the test file parser.
Change-Id: I18ddcb0440490ddf8184be66d3681038a1615dd9
Reviewed-on: http://gerrit.cloudera.org:8080/11800
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
This commit adds the command line flag enable_parquet_page_index_writing
to the Impala daemon that switches Impala's ability of writing the
Parquet page index. By default the flag is false, i.e. Impala doesn't
write the page index.
This flag is only temporary, we plan to remove it once Impala is able to
read the page index and has better testing around it.
Because of this change I had to move test_parquet_page_index.py to the
custom_cluster test suite since I need to set this command line flag
in order to test the functionality. I also merged most of the test cases
because we don't want to restart the cluster too many times.
I removed 'num_data_pages_' from BaseColumnWriter since it was rather
confusing and didn't provide any measurable performance improvement.
This commit fixes the ASAN error produced by the first IMPALA-7644
commit which was reverted later.
Change-Id: Ib4a9098a2085a385351477c715ae245d83bf1c72
Reviewed-on: http://gerrit.cloudera.org:8080/11694
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit adds the command line flag enable_parquet_page_index_writing
to the Impala daemon that switches Impala's ability of writing the
Parquet page index. By default the flag is false, i.e. Impala doesn't
write the page index.
This flag is only temporary, we plan to remove it once Impala is able to
read the page index and has better testing around it.
Because of this change I had to move test_parquet_page_index.py to the
custom_cluset test suite since I need to set this command line flag
in order to test the functionality. I also merged most of the test cases
because we don't want to restart the cluster too many times.
Change-Id: If9994882aa59cbaf3ae464100caa8211598287bc
Reviewed-on: http://gerrit.cloudera.org:8080/11563
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds some informational output to explain plans and
sends the information to the backend.
The idea is that this will make it easier to explain how Impala's
pipelined execution works and also enable future work on profile
analysis that can more intelligently group plan nodes.
Tests:
* Updated planner tests to include new output.
Change-Id: I1d10eb14d997242f445e5c5fc5362d5410370721
Reviewed-on: http://gerrit.cloudera.org:8080/10848
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds the resource estimates for key benchmark workloads:
TPC-H, TPC-DS, TPC-H Nested and TPC-H Kudu to the planner
test so that we can track changes in resource requirements
and estimates for these queries.
Also don't show decimal places for MB and KB estimates. The
estimates are not accurate to that level and displaying
extra precision has some disadvantages:
* It communicates to readers that the estimates have a high level of
precision.
* It increases the odds of small variations in file sizes, etc
causing test failures.
Also fixed a regex in the stress test that didn't escape the decimal
point correctly.
Testing:
Ran core tests.
Change-Id: I6a9f836699200ea87fb03bf36abad0e23949ac26
Reviewed-on: http://gerrit.cloudera.org:8080/11087
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/
The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.
I updated Pooja's python tests as well.
Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Reviewed-on: http://gerrit.cloudera.org:8080/9693
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the following squashed patches that were reverted.
I will fix the known issues with some follow-on patches.
======================================================================
IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation
In preparation for switching the I/O mgr to the buffer pool, this
removes and cleans up a lot of code so that the switchover patch starts
from a cleaner slate.
* Remove the free buffer cache (which will be replaced by buffer pool's
own caching).
* Make memory limit exceeded error checking synchronous (in anticipation
of having to propagate buffer pool errors synchronously).
* Simplify error propagation - remove the (ineffectual) code that
enqueued BufferDescriptors containing error statuses.
* Document locking scheme better in a few places, make it part of the
function signature when it seemed reasonable.
* Move ReturnBuffer() to ScanRange, because it is intrinsically
connected with the lifecycle of a scan range.
* Separate external ReturnBuffer() and internal CleanUpBuffer()
interfaces - previously callers of ReturnBuffer() were fudging
the num_buffers_in_reader accounting to make the external interface work.
* Eliminate redundant state in ScanRange: 'eosr_returned_' and
'is_cancelled_'.
* Clarify the logic around calling Close() for the last
BufferDescriptor.
-> There appeared to be an implicit assumption that buffers would be
freed in the order they were returned from the scan range, so that
the "eos" buffer was returned last. Instead just count the number
of outstanding buffers to detect the last one.
-> Touching the is_cancelled_ field without holding a lock was hard to
reason about - violated locking rules and it was unclear that it
was race-free.
* Remove DiskIoMgr::Read() to simplify the interface. It is trivial to
inline at the callsites.
This will probably regress performance somewhat because of the cache
removal, so my plan is to merge it around the same time as switching
the I/O mgr to allocate from the buffer pool. I'm keeping the patches
separate to make reviewing easier.
Testing:
* Ran exhaustive tests
* Ran the disk-io-mgr-stress-test overnight
======================================================================
IMPALA-4835: Part 2: Allocate scan range buffers upfront
This change is a step towards reserving memory for buffers from the
buffer pool and constraining per-scanner memory requirements. This
change restructures the DiskIoMgr code so that each ScanRange operates
with a fixed set of buffers that are allocated upfront and recycled as
the I/O mgr works through the ScanRange.
One major change is that ScanRanges get blocked when a buffer is not
available and get unblocked when a client returns a buffer via
ReturnBuffer(). I was able to remove the logic to maintain the
blocked_ranges_ list by instead adding a separate set with all ranges
that are active.
There is also some miscellaneous cleanup included - e.g. reducing the
amount of code devoted to maintaining counters and metrics.
One tricky part of the existing code was the it called
IssueInitialRanges() with empty lists of files and depended on
DiskIoMgr::AddScanRanges() to not check for cancellation in that case.
See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue
ranges for empty lists of files.
I plan to merge this along with the actual buffer pool switch, but
separated it out to allow review of the DiskIoMgr changes separate from
other aspects of the buffer pool switchover.
Testing:
* Ran core and exhaustive tests.
======================================================================
IMPALA-4835: Part 3: switch I/O buffers to buffer pool
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00
Reviewed-on: http://gerrit.cloudera.org:8080/9679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Support row_regex and other lines for the subset and superset verifiers,
which previously assumed that lines in the actual and expected had to
match exactly.
Use in test_stats_extrapolation to make the test more robust to
irrelevant changes in the explain plan.
Testing:
Manually modified a superset and a subset test to check that tests fail
as expected.
Change-Id: Ia7a28d421c8e7cd84b14d07fcb71b76449156409
Reviewed-on: http://gerrit.cloudera.org:8080/10155
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit"
This reverts commit 25bcb258df.
Revert "IMPALA-6588: don't add empty list of ranges in text scan"
This reverts commit d57fbec6f6.
Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool"
This reverts commit 24b4ed0b29.
Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront"
This reverts commit 5699b59d0c.
Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation"
This reverts commit 65680dc421.
Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb
Reviewed-on: http://gerrit.cloudera.org:8080/9480
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786
Reviewed-on: http://gerrit.cloudera.org:8080/8966
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Introduces a new TBLPROPERTY for controlling stats
extrapolation on a per-table basis:
impala.enable.stats.extrapolation=true/false
The property key was chosen to be consistent with
the impalad startup flag --enable_stats_extrapolation
and to indicate that the property was set and is used
by Impala.
Behavior:
- If the property is not set, then the extrapolation
behavior is determined by the impalad startup flag.
- If the property is set, it overrides the impalad
startup flag, i.e., extrapolation can be explicitly
enabled or disabled regardless of the startup flag.
Testing:
- added new unit tests
- code/hdfs run passed
Change-Id: Ie49597bf1b93b7572106abc620d91f199cba0cfd
Reviewed-on: http://gerrit.cloudera.org:8080/9139
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This change enables clustering by default. IMPALA-2521 introduced the
'clustered' hint which inserts a local sort by the partitioning columns
to a query plan. The hint is only effective for HDFS and Kudu tables.
Like before, the 'noclustered' hint prevents clustering. If a table has
ordering columns defined, the 'noclustered' hint is ignored and we
issue a warning.
This change removes some tests that were added specifically to test
that clustering can be enabled using the 'clustered' hint. It changes
some tests to use the 'noclustered' hint to make sure that clustering
can be disabled. It also adds tests to make sure that we cover the
'noclustered' case properly.
Cherry-picks: not for 2.x.
Change-Id: Idbf2368cf4415e6ecfa65058daf6ff87ef62f9d9
Reviewed-on: http://gerrit.cloudera.org:8080/9153
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
Adds the TABLESAMPLE clause for COMPUTE STATS.
Syntax:
COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]
Computes and replaces the table-level row count and total file size,
as well as all table-level column statistics. Existing partition-level
row counts are not modified.
The TABLESAMPLE clause can be used to limit the scanned data volume to
a desired percentage. When sampling, the unmodified results of the
COMPUTE STATS queries are sent to the CatalogServer. There, the stats
are extrapolated before storing them into the HMS so as not to confuse
other engines like Hive/SparkSQL which may rely on the shared HMS
fields being accurate.
Limitations
- Only works for HDFS tables
- TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS
- TABLESAMPLE requires --enable_stats_extrapolation=true
Changes to EXPLAIN
The stored statistics from the HMS are more clearly displayed under
a 'stored statistics' section. Example:
00:SCAN HDFS [functional.alltypes, RANDOM]
partitions=24/24 files=24 size=478.45KB
stored statistics:
table: rows=7300 size=478.45KB
partitions: 24/24 rows=7300
columns: all
Testing:
- added new functional tests
- core/hdfs run passed
Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7
Reviewed-on: http://gerrit.cloudera.org:8080/8136
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Before this change, the per-host reservation size was computed
by the Planner. However, scheduling happens after planning,
so the Planner must assume that all fragments run on all
hosts, and the reservation size is likely much larger than
it needs to be.
This moves the computation of the per-host reservation size
to the BE where it can be computed more precisely. This also
includes a number of plan/profile changes.
Change-Id: Idbcd1e9b1be14edc4017b4907e83f9d56059fbac
Reviewed-on: http://gerrit.cloudera.org:8080/7630
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
This moves away from the PipelinedPlanNodeSet approach of enumerating
sets of concurrently-executing nodes because unions would force
creating many overlapping sets of nodes. The new approach computes
the peak resources during Open() and the peak resources between Open()
and Close() (i.e. while calling GetNext()) bottom-up for each plan node
in a fragment. The fragment resources are then combined to produce the
query resources.
The basic assumptions for the new resource estimates are:
* resources are acquired during or after the first call to Open()
and released in Close().
* Blocking nodes call Open() on their child before acquiring
their own resources (this required some backend changes).
* Blocking nodes call Close() on their children before returning
from Open().
* The peak resource consumption of the query is the sum of the
independent fragments (except for the parallel join build plans
where we can assume there will be synchronisation). This is
conservative but we don't synchronise fragment Open() and Close()
across exchanges so can't make stronger assumptions in general.
Also compute the sum of minimum reservations. This will be useful
in the backend to determine exactly when all of the initial
reservations have been claimed from a shared pool of initial reservations.
Testing:
* Updated planner tests to reflect behavioural changes.
* Added extra resource requirement planner tests for unions, subplans,
pipelines of blocking operators, and bushy join plans.
* Added single-node plans to resource-requirements tests. These have
more complex plan trees inside a single fragment, which is useful
for testing the peak resource requirement logic.
Change-Id: I492cf5052bb27e4e335395e2a8f8a3b07248ec9d
Reviewed-on: http://gerrit.cloudera.org:8080/7223
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This is similar to the single-node execution optimisation, but applies
to slightly larger queries that should run in a distributed manner but
won't benefit from codegen.
This adds a new query option disable_codegen_rows_threshold that
defaults to 50,000. If fewer than this number of rows are processed
by a plan node per impalad, the cost of codegen almost certainly
outweighs the benefit.
Using rows processed as a threshold is justified by a simple
model that assumes the cost of codegen and execution per row for
the same operation are proportional. E.g. if x is the complexity
of the operation, n is the number of rows processed, C is a
constant factor giving the cost of codegen and Ec/Ei are constant
factor giving the cost of codegen'd and interpreted execution and
d, then the cost of the codegen'd operator is C * x + Ec * x * n
and the cost of the interpreted operator is Ei * x * n. Rearranging
means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that
(at least with the simplified model) it makes sense to choose
interpretation or codegen based on a constant threshold. The
model also implies that it is somewhat safer to choose codegen
because the additional cost of codegen is O(1) but the additional
cost of interpretation is O(n).
I ran some experiments with TPC-H Q1, varying the input table size, to
determine what the cut-over point where codegen was beneficial was.
The cutover was around 150k rows per node for both text and parquet.
At 50k rows per node disabling codegen was very beneficial - around
0.12s versus 0.24s. To be somewhat conservative I set the default
threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the
cutover tends to be higher because there are plan nodes that process
many fewer than the max rows.
Fix a couple of minor issues in the frontend - the numNodes_
calculation could return 0 for Kudu, and the single node optimization
didn't handle the case where for a scan node with conjuncts, a limit
and missing stats correctly (it considered the estimate still valid.)
Testing:
Updated e2e tests that set disable_codegen to set
disable_codegen_rows_threshold to 0, so that those tests run both
with and without codegen still.
Added an e2e test to make sure that the optimisation is applied in
the backend.
Added planner tests for various cases where codegen should and shouldn't
be disabled.
Perf:
Added a targeted perf test for a join+agg over a small input, which
benefits from this change.
Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e
Reviewed-on: http://gerrit.cloudera.org:8080/7153
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The main idea of this patch is to use table stats to
extrapolate the row counts for new/modified partitions.
Existing behavior:
- Partitions that lack the row count stat are ignored
when estimating the cardinality of HDFS scans. Such
partitions effectively have an estimated row count
of zero.
- We always use the row count stats for partitions that
have one. The row count may be innaccurate if data in
such partitions has changed significantly.
Summary of changes:
- Enhance COMPUTE STATS to also store the total number
of file bytes in the table.
- Use the table-level row count and file bytes stats
to estimate the number of rows in a scan.
- A new impalad startup flag is added to enable/disable
the extrapolation behavior. The feature is disabled by
default. Note that even with the feature disabled,
COMPUTE STATS stores the file bytes so you can enable
the feature without having to run COMPUTE STATS again.
Testing:
- Added new FE unit test
- Added new EE test
Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/6840
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins