Commit Graph

16 Commits

Author SHA1 Message Date
Alex Behm
b3d8a507cb IMPALA-5310: Add COMPUTE STATS TABLESAMPLE.
Adds the TABLESAMPLE clause for COMPUTE STATS.

Syntax:
COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]

Computes and replaces the table-level row count and total file size,
as well as all table-level column statistics. Existing partition-level
row counts are not modified.
The TABLESAMPLE clause can be used to limit the scanned data volume to
a desired percentage. When sampling, the unmodified results of the
COMPUTE STATS queries are sent to the CatalogServer. There, the stats
are extrapolated before storing them into the HMS so as not to confuse
other engines like Hive/SparkSQL which may rely on the shared HMS
fields being accurate.

Limitations
- Only works for HDFS tables
- TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS
- TABLESAMPLE requires --enable_stats_extrapolation=true

Changes to EXPLAIN
The stored statistics from the HMS are more clearly displayed under
a 'stored statistics' section. Example:

00:SCAN HDFS [functional.alltypes, RANDOM]
   partitions=24/24 files=24 size=478.45KB
   stored statistics:
     table: rows=7300 size=478.45KB
     partitions: 24/24 rows=7300
     columns: all

Testing:
- added new functional tests
- core/hdfs run passed

Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7
Reviewed-on: http://gerrit.cloudera.org:8080/8136
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-29 22:37:01 +00:00
Bharath Vissapragada
2c0fc30628 IMPALA-5615: Fix compute incremental stats for general partition exprs
The fix for IMPALA-1654 has broken the compute incremental stats child
query generation logic for general partition expressions. This commit
fixes it and also adds new queries to fix the test gap. These tests
fail consistently without the patch.

Change-Id: I227fc06f580eb9174f60ad0f515a3641cec19268
Reviewed-on: http://gerrit.cloudera.org:8080/7379
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins
2017-07-25 03:31:19 +00:00
Taras Bobrovytsky
d07580c171 IMPALA-4962: Fix SHOW COLUMN STATS for HS2
Impala incorrectly returned NULLs in the "Max Size" column of the SHOW
COLUMN STATS result when executed through the HS2 interface. The issue
was that the column was specified to be type INT in the result schema,
but the actual type of the contents that we inserted into it was
"long". The reason why this is not an issue in Impala shell is because
we stringify the contents without inspecting the metadata for beeswax
results.

The issue was fixed by changing the type from INT to BIGINT.

Change-Id: I419657744635dfdc2e1562fe60a597617fff446e
Reviewed-on: http://gerrit.cloudera.org:8080/6109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-22 23:10:34 +00:00
Alex Behm
d845413ab8 IMPALA-4854: Fix incremental stats with complex types.
The bug: Compute incremental stats used to always do a
full stats recomputation for tables with complex types.
The logic for detecting schema changes (e.g. an added
column) did not take into consideration that columns
with complex types are ignored in the stats computation,
and should therefore not be recognized as a new column
that does not yet have stats.

Testing:
- Added a new regression test
- Locally ran test_compute_stats.py and the FE tests

Change-Id: I6e0335048d688ee25ff55c6628d0f6f8ecc1dd8a
Reviewed-on: http://gerrit.cloudera.org:8080/6033
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-17 06:02:48 +00:00
Alex Behm
3aa4351625 IMPALA-4170: Fix identifier quoting in COMPUTE INCREMENTAL STATS.
The SQL statements generated from COMPUTE INCREMENTAL STATS
did not properly quote identifiers when incrementally updating
the stats for newly added partitions.

Our existing tests did not catch this case because the code paths
for doing the initial stats computation and the incremental stats
computation are different, in particular, the code for generating
the SQL statements.

Change-Id: I63adcc45dc964ce769107bf4139fc4566937bb96
Reviewed-on: http://gerrit.cloudera.org:8080/4479
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-09-21 01:24:53 +00:00
Alex Behm
a41710a0c8 Use unique_database fixture in test_compute_stats.py.
This patch makes it a little easier to use the unique_database fixture
with .test files. The RESULTS section can now contain $DATABASE which
is replaced with the current database by the test framework.

Testing:
- ran the test locally on exhaustive
- ran the test on hdfs and the local filesystem on Jenkins

Change-Id: I8655eb769003f88c0e1ec1b254118e4ec3353b48
Reviewed-on: http://gerrit.cloudera.org:8080/2947
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:50 -07:00
Christopher Channing
9ea5caf0ef IMPALA-2199: Row count not set for empty partition when spec is used with compute incremental stats
This patch resolves an issue where row count is not set to 0 when a partition spec is
used with 'compute incremental stats' on a partition that contains no data. The fix is
to populate the partition 'expected list' in the frontend with the partition spec, the
backend keeps track of which partitions had statistics generated. In the scenario where
no statistics are generated for a partition, the backend will fall back to the
'expected list' to zero out the statistics.

Change-Id: If4aac131dbe44e14a0477afa58e980da9e235d6b
Reviewed-on: http://gerrit.cloudera.org:8080/627
Reviewed-by: Christopher Channing <cchanning@cloudera.com>
Tested-by: Internal Jenkins
2015-08-13 09:38:30 +00:00
Shant Hovsepian
6d87fe090c Improve Hll estimate for small cardinalities.
Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.

Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2015-07-16 19:38:17 +00:00
Henry Robinson
f22b8659fd IMPALA-1595: Add 'location' to SHOW [TABLE STATS|PARTITIONS] for HDFS tables
This patch adds a 'location' column to the output of SHOW TABLE STATS /
SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE
SET LOCATION commands, particularly for partitions, and is easier to
identify than the output of DESCRIBE FORMATTED.

Some existing tests in alter-table.test have been updated to include
checking the location output before and after a SET LOCATION
command. The tests in show.test have also been updated to check for the
location; all other tests that use SHOW [TABLE STATS|PARTITIONS] use a
generic regex to avoid overly verbose tests.

Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb
Reviewed-on: http://gerrit.cloudera.org:8080/316
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2015-04-21 19:27:50 +00:00
Henry Robinson
146fe64a26 IMPALA-1615: Don't drop row count during DROP INCREMENTAL STATS
Change-Id: I1ae23ca9d70eeb58a3c7c8c59fb633832edcff58
Reviewed-on: http://gerrit.cloudera.org:8080/148
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-03-05 20:22:49 +00:00
ishaan
8369c3b13b Remove explicit references to functional_hbase tables from .test files.
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.

Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-02-23 23:32:41 +00:00
Alex Behm
ad6b9364c0 IMPALA-1629: Compute stats properly updates CHAR/VARCHAR column stats.
The problem was that VARCHAR was not present in a few switch statements
for updating/populating column stats in various places.

Change-Id: I0b2b316b734d27a7ff08701b0986014be2473443
Reviewed-on: http://gerrit.cloudera.org:8080/65
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-02-19 22:53:42 +00:00
Martin Grund
cee1e84c1e IMPALA-1587: Extending caching directives for multiple replicas
This patch adds the possibility to specify the number of replicas that
should be cached in main memory. This can be useful in high QPS
scenarios as the majority of the load is no longer the single cached
replica, but a set of cached replicas. While the cache replication
factor can be larger than the block replication factor on disk, the
difference will be ignored by HDFS until more replicas become
available.

This extends the current syntax for specifying the cache pool in the
following way:

   cached in 'poolName'

is extended with the optional replication factor

   cached in 'poolName' with replication = XX

By default, the cache replication factor is set to 1. As this value is
not yet configurable in HDFS it's defined as a constant in the JniCatalog
thrift specification. If a partitioned table is cached, all its child
partitions inherit this cache replication factor. If child partitions
have a custom cache replication factor, changing the cache replication
factor on the partitioned table afterwards will overwrite this custom
value. If a new partition is added to the table, it will again inherit
the cache replication factor of the parent independent of the cache pool
that is used to cache the partition.

To review changes and status of the replication factor for tables and
partitions the replication factor is part of output of the "show
partitions" command.

Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-01-26 20:30:59 -08:00
Henry Robinson
13b9cdd6b0 More test coverage for incremental stats
Change-Id: I17778dcf019c2a219baa678211f221b7e04813bb
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5446
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 2f5ec47e0b5dd26bc9dfe884481d3a316201be2d)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5460
Reviewed-by: Henry Robinson <henry@cloudera.com>
2014-12-01 17:26:39 -08:00
Henry Robinson
98064c4da8 Fix crash when columns are dropped between compute stats calls
FinalizePartitionedColumnStats() should have iterated over the list of
columns present in the table, rather than the list in the existing stats
data structure. If a column was dropped, but still persisted in the old
structure, it was possible that we could index off the end of an array.

Change-Id: Ib1ab7690ffae05afff826b9d1a15871337691739
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5437
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
(cherry picked from commit cee8305cd2878c8f00622d39ddd43b7a5dfbbc0d)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5447
Reviewed-by: Henry Robinson <henry@cloudera.com>
2014-11-26 23:25:11 -08:00
Henry Robinson
44f57e5fb6 IMPALA-1122: Compute stats with partition granularity
This patch adds the ability to compute and drop column and table
statistics at partition granularity.

The following commands are added. Detail about the implementation
follows.

COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>]

This variant of COMPUTE STATS will, ultimately, do the same thing as the
traditional COMPUTE STATS statement, but does so by caching the
intermediate state of the computation for each partition in the Hive
MetaStore. If the PARTITION clause is added, the computation is
performed for only that partition. If the PARTITION clause is omitted,
incremental stats are updated only for those partitions with missing
incremental stats (e.g. one column does not have stats, or incremental
stats was never computed for this partition). In this patch, incremental
stats are only invalidated when a DROP STATS variant is executed. Future
patches can automatically invalidate the statistics after REFRESH or
INSERT queries, etc.

DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec>

This variant of DROP stats removes the incremental statistics for the
given table. It does *not* recalculate the statistics for the whole
table, so this should be used only to invalidate the intermediate state
for a partition which will shortly be subject to COMPUTE INCREMENTAL
STATS. The point of this variant is to allow users to notify Impala when
they believe a partition has changed significantly enough to warrant
recomputation of its statistics. It is not necessary for new partitions;
Impala will detect that they do not have any valid statistics.

--------

This is achieved by adapting the existing HLL UDA via swapping its
finalize method for a new one which returns the intermediate HLL
buckets, rather than aggregating and then disposing of them. This
intermediate state is then returned to Impala's catalog-op-executor.cc,
which then passes the intermediate state back to the frontend to be
ultimately stored in the HMS.

This intermediate state is computed on a per-partition basis by grouping
the input to the UDA by partition. Thus, the incremental computation
produces one row for each partition selected (the set of which might be
quite small, if there are few partitions without valid incremental
stats: this is the point of the new commands).

At the same time, the query coordinator aggregates the output of the UDA
to produce table-level statistics. This computation incorporates any
existing (and not re-computed) intermediate partition state which is
passed to the coordinator by the frontend. The resulting statistics are
saved to the table as normal.

Intermediate statistics are serialised to the HMS by writing a Thrift
structure's serialised form to the partition's 'parameters' map. There
is a schema-imposed limit of 4000 characters to the serialised string,
which is exacerbated by the fact that the Thrift representation must
first be base-64 encoded to avoid type errors in the HMS. The current
patch breaks the encoded structure into 4k chunks, and then recombines
them on read. The alltypes table (11 columns) takes about three of these
chunks. This may mean that incremental stats are not suitable for
particularly wide tables: these structures could be zipped before
encoding for some space savings. In the meantime, the NDV estimates are
run-length encoded (since they are generally sparse); this can result in
substantial space savings.

Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408
2014-11-25 09:13:37 -08:00