Commit Graph

6 Commits

Author SHA1 Message Date
Alex Behm
b3d8a507cb IMPALA-5310: Add COMPUTE STATS TABLESAMPLE.
Adds the TABLESAMPLE clause for COMPUTE STATS.

Syntax:
COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]

Computes and replaces the table-level row count and total file size,
as well as all table-level column statistics. Existing partition-level
row counts are not modified.
The TABLESAMPLE clause can be used to limit the scanned data volume to
a desired percentage. When sampling, the unmodified results of the
COMPUTE STATS queries are sent to the CatalogServer. There, the stats
are extrapolated before storing them into the HMS so as not to confuse
other engines like Hive/SparkSQL which may rely on the shared HMS
fields being accurate.

Limitations
- Only works for HDFS tables
- TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS
- TABLESAMPLE requires --enable_stats_extrapolation=true

Changes to EXPLAIN
The stored statistics from the HMS are more clearly displayed under
a 'stored statistics' section. Example:

00:SCAN HDFS [functional.alltypes, RANDOM]
   partitions=24/24 files=24 size=478.45KB
   stored statistics:
     table: rows=7300 size=478.45KB
     partitions: 24/24 rows=7300
     columns: all

Testing:
- added new functional tests
- core/hdfs run passed

Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7
Reviewed-on: http://gerrit.cloudera.org:8080/8136
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-29 22:37:01 +00:00
Taras Bobrovytsky
d07580c171 IMPALA-4962: Fix SHOW COLUMN STATS for HS2
Impala incorrectly returned NULLs in the "Max Size" column of the SHOW
COLUMN STATS result when executed through the HS2 interface. The issue
was that the column was specified to be type INT in the result schema,
but the actual type of the contents that we inserted into it was
"long". The reason why this is not an issue in Impala shell is because
we stringify the contents without inspecting the metadata for beeswax
results.

The issue was fixed by changing the type from INT to BIGINT.

Change-Id: I419657744635dfdc2e1562fe60a597617fff446e
Reviewed-on: http://gerrit.cloudera.org:8080/6109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-22 23:10:34 +00:00
Alex Behm
ea45de84f4 IMPALA-3491: Merge test_hbase_metadata.py into compute_stats.py. Use unique db fixture.
- Moves the test into compute_stats.py
- Changes some test classes in compute_stats.py to inherit from
  ImpalaTestSuite and not from TestComputeStats because that
  will cause all tests in TestComputeStats to be run in the
  subclasses again (redundantly).
- Clean up and add more coverage to testing incremental stats on
  HBase which was probably broken in this commit 6b32ff06.
- Fixes a side effect that the original test had for testing
  incremental stats on HBase. It computes stats on a functional
  table which was not supposed to have stats.

Testing: Ran compute_stats.py on exhaustive locally in a loop 10 times.
Did a private hdfs/core run.

Change-Id: Iee8b84e30948c3c98166e08cae2666574777730c
Reviewed-on: http://gerrit.cloudera.org:8080/3074
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-23 08:40:19 -07:00
Lars Volker
4fc7f15376 IMPALA-2862: Fix regex parsing in test result verifier
Test results can be verified using regular expressions. The extraction
of the regular expression substring from the expected test results had a
bug where only the first character of an expression was considered. This
lead to wrong but undetected test results.

Change-Id: Ia670da6e0758455a86dc44744b96b9465d890af3
Reviewed-on: http://gerrit.cloudera.org:8080/1818
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-02 21:55:57 +00:00
Shant Hovsepian
6d87fe090c Improve Hll estimate for small cardinalities.
Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.

Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2015-07-16 19:38:17 +00:00
ishaan
8369c3b13b Remove explicit references to functional_hbase tables from .test files.
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.

Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-02-23 23:32:41 +00:00