Adds the TABLESAMPLE clause for COMPUTE STATS.
Syntax:
COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]
Computes and replaces the table-level row count and total file size,
as well as all table-level column statistics. Existing partition-level
row counts are not modified.
The TABLESAMPLE clause can be used to limit the scanned data volume to
a desired percentage. When sampling, the unmodified results of the
COMPUTE STATS queries are sent to the CatalogServer. There, the stats
are extrapolated before storing them into the HMS so as not to confuse
other engines like Hive/SparkSQL which may rely on the shared HMS
fields being accurate.
Limitations
- Only works for HDFS tables
- TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS
- TABLESAMPLE requires --enable_stats_extrapolation=true
Changes to EXPLAIN
The stored statistics from the HMS are more clearly displayed under
a 'stored statistics' section. Example:
00:SCAN HDFS [functional.alltypes, RANDOM]
partitions=24/24 files=24 size=478.45KB
stored statistics:
table: rows=7300 size=478.45KB
partitions: 24/24 rows=7300
columns: all
Testing:
- added new functional tests
- core/hdfs run passed
Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7
Reviewed-on: http://gerrit.cloudera.org:8080/8136
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Impala incorrectly returned NULLs in the "Max Size" column of the SHOW
COLUMN STATS result when executed through the HS2 interface. The issue
was that the column was specified to be type INT in the result schema,
but the actual type of the contents that we inserted into it was
"long". The reason why this is not an issue in Impala shell is because
we stringify the contents without inspecting the metadata for beeswax
results.
The issue was fixed by changing the type from INT to BIGINT.
Change-Id: I419657744635dfdc2e1562fe60a597617fff446e
Reviewed-on: http://gerrit.cloudera.org:8080/6109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
- Moves the test into compute_stats.py
- Changes some test classes in compute_stats.py to inherit from
ImpalaTestSuite and not from TestComputeStats because that
will cause all tests in TestComputeStats to be run in the
subclasses again (redundantly).
- Clean up and add more coverage to testing incremental stats on
HBase which was probably broken in this commit 6b32ff06.
- Fixes a side effect that the original test had for testing
incremental stats on HBase. It computes stats on a functional
table which was not supposed to have stats.
Testing: Ran compute_stats.py on exhaustive locally in a loop 10 times.
Did a private hdfs/core run.
Change-Id: Iee8b84e30948c3c98166e08cae2666574777730c
Reviewed-on: http://gerrit.cloudera.org:8080/3074
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Test results can be verified using regular expressions. The extraction
of the regular expression substring from the expected test results had a
bug where only the first character of an expression was considered. This
lead to wrong but undetected test results.
Change-Id: Ia670da6e0758455a86dc44744b96b9465d890af3
Reviewed-on: http://gerrit.cloudera.org:8080/1818
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.
Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.
Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins