This patch makes it a little easier to use the unique_database fixture
with .test files. The RESULTS section can now contain $DATABASE which
is replaced with the current database by the test framework.
Testing:
- ran the test locally on exhaustive
- ran the test on hdfs and the local filesystem on Jenkins
Change-Id: I8655eb769003f88c0e1ec1b254118e4ec3353b48
Reviewed-on: http://gerrit.cloudera.org:8080/2947
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Hive allows creating Avro tables without an explicit Avro schema since 0.14.0.
For such tables, the Avro schema is inferred from the column definitions,
and not stored in the metadata at all (no Avro schema literal or Avro schema file).
This patch adds support for loading the metadata of such tables, although Impala
currently cannot create such tables (expect a follow-on patch).
Change-Id: I9e66921ffbeff7ce6db9619bcfb30278b571cd95
Reviewed-on: http://gerrit.cloudera.org:8080/538
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.
Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
This patch adds a 'location' column to the output of SHOW TABLE STATS /
SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE
SET LOCATION commands, particularly for partitions, and is easier to
identify than the output of DESCRIBE FORMATTED.
Some existing tests in alter-table.test have been updated to include
checking the location output before and after a SET LOCATION
command. The tests in show.test have also been updated to check for the
location; all other tests that use SHOW [TABLE STATS|PARTITIONS] use a
generic regex to avoid overly verbose tests.
Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb
Reviewed-on: http://gerrit.cloudera.org:8080/316
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
The following commits disabled tests to unblock the full data load:
a00a9a5e53f7a8e7a1e3c931ea0e4b7db21c6f00
bf29d06f2e53bb924d250275d51f5ccd1213531d
This patch re-enables those tests and adds new tests to guard against
regressions to HIVE-6308.
Unfortunately, we cannot completely remove the analysis check for HIVE-6308
in our code, because there is still one case where COMPUTE STATS will fail on
a Hive-created Avro table: If there is a mismatch in column names between
the Avro schema and the column defs given to a CREATE TABLE in Hive.
Change-Id: I81ae6b526db02fdfc634e09eeb9d12036e2adfdd
Reviewed-on: http://gerrit.cloudera.org:8080/180
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Add skip markers for S3 that can be used to categorize the tests that
are skipped against S3 to help see what coverage is missing. Soon
we'll be reworking some tests and/or adding new tests to get back the
important gaps.
Also, add a mechanism to parameterize paths in the .test files, and
start using these new variables. This is a step toward enabling some
more tests against S3.
Finally, a fix for buildall.sh to stop the minicluster before applying
the metastore snapshot. Otherwise, this fails since the ms db is in
use.
Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0
Reviewed-on: http://gerrit.cloudera.org:8080/127
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.
Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
The problem was that VARCHAR was not present in a few switch statements
for updating/populating column stats in various places.
Change-Id: I0b2b316b734d27a7ff08701b0986014be2473443
Reviewed-on: http://gerrit.cloudera.org:8080/65
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This patch adds the possibility to specify the number of replicas that
should be cached in main memory. This can be useful in high QPS
scenarios as the majority of the load is no longer the single cached
replica, but a set of cached replicas. While the cache replication
factor can be larger than the block replication factor on disk, the
difference will be ignored by HDFS until more replicas become
available.
This extends the current syntax for specifying the cache pool in the
following way:
cached in 'poolName'
is extended with the optional replication factor
cached in 'poolName' with replication = XX
By default, the cache replication factor is set to 1. As this value is
not yet configurable in HDFS it's defined as a constant in the JniCatalog
thrift specification. If a partitioned table is cached, all its child
partitions inherit this cache replication factor. If child partitions
have a custom cache replication factor, changing the cache replication
factor on the partitioned table afterwards will overwrite this custom
value. If a new partition is added to the table, it will again inherit
the cache replication factor of the parent independent of the cache pool
that is used to cache the partition.
To review changes and status of the replication factor for tables and
partitions the replication factor is part of output of the "show
partitions" command.
Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
This patch adds the ability to compute and drop column and table
statistics at partition granularity.
The following commands are added. Detail about the implementation
follows.
COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>]
This variant of COMPUTE STATS will, ultimately, do the same thing as the
traditional COMPUTE STATS statement, but does so by caching the
intermediate state of the computation for each partition in the Hive
MetaStore. If the PARTITION clause is added, the computation is
performed for only that partition. If the PARTITION clause is omitted,
incremental stats are updated only for those partitions with missing
incremental stats (e.g. one column does not have stats, or incremental
stats was never computed for this partition). In this patch, incremental
stats are only invalidated when a DROP STATS variant is executed. Future
patches can automatically invalidate the statistics after REFRESH or
INSERT queries, etc.
DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec>
This variant of DROP stats removes the incremental statistics for the
given table. It does *not* recalculate the statistics for the whole
table, so this should be used only to invalidate the intermediate state
for a partition which will shortly be subject to COMPUTE INCREMENTAL
STATS. The point of this variant is to allow users to notify Impala when
they believe a partition has changed significantly enough to warrant
recomputation of its statistics. It is not necessary for new partitions;
Impala will detect that they do not have any valid statistics.
--------
This is achieved by adapting the existing HLL UDA via swapping its
finalize method for a new one which returns the intermediate HLL
buckets, rather than aggregating and then disposing of them. This
intermediate state is then returned to Impala's catalog-op-executor.cc,
which then passes the intermediate state back to the frontend to be
ultimately stored in the HMS.
This intermediate state is computed on a per-partition basis by grouping
the input to the UDA by partition. Thus, the incremental computation
produces one row for each partition selected (the set of which might be
quite small, if there are few partitions without valid incremental
stats: this is the point of the new commands).
At the same time, the query coordinator aggregates the output of the UDA
to produce table-level statistics. This computation incorporates any
existing (and not re-computed) intermediate partition state which is
passed to the coordinator by the frontend. The resulting statistics are
saved to the table as normal.
Intermediate statistics are serialised to the HMS by writing a Thrift
structure's serialised form to the partition's 'parameters' map. There
is a schema-imposed limit of 4000 characters to the serialised string,
which is exacerbated by the fact that the Thrift representation must
first be base-64 encoded to avoid type errors in the HMS. The current
patch breaks the encoded structure into 4k chunks, and then recombines
them on read. The alltypes table (11 columns) takes about three of these
chunks. This may mean that incremental stats are not suitable for
particularly wide tables: these structures could be zipped before
encoding for some space savings. In the meantime, the NDV estimates are
run-length encoded (since they are generally sparse); this can result in
substantial space savings.
Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408
Because we add 'total' to the last row in SHOW PARTITIONS, we set the
partition key columns to be string. At least, that's what the comment
said, but we didn't do that in fact.
This patch also corrects the column type for max width, which should be INT.
Change-Id: I787ab17be27f45107340119017e528c58a3daad3
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4678
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
This patch addresses the following issues:
1. Allow creating Avro tables without col defs in Impala. Compute stats works on them.
2. Handle table creation with inconsistent col defs and Avro schema as follows:
The table creation will succeed and ignore the col defs in favor of the Avro schema.
A warning is issued that the col defs and the Avro schema are inconsistent.
Compute stats works on such tables.
This patch does not address the issue of compute stats after Avro schema evolution.
Change-Id: Iea6b737d238d81491dc2097012ebc149a89d03ba
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4182
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4250
Tested-by: jenkins
The query used to generate the stats does a GROUP BY on the partition keys,
and so empty partitions will not get any results. Detect the empty partition
case and set the number of rows to 0.
Change-Id: I1ccb7d2016f35026aa1b418155c4534024f3cee5
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4029
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 128a02f508cdb280b53b8a8429e6b90491e43956)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4042
Semi or anti-joined table references are now only visible inside the
On-clause of the corresponding join.
Change-Id: Id93e53ecdf2a74baf9736aa427fa7af15358ca27
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3789
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
This patch improves the performance of the DDL queries
"Alter table add partition" and "Alter table drop partition"
as the number of partitions is scaled up.
The issue was that every time a partition was added or dropped,
the entire block metadata for that table was reloaded. This
operation was highly expensive especially as the number
of partitions became larger.
This patch handles this by adding/dropping only the added/dropped
partition's metadata to the hdfsTable (adding/dropping it to/from
the internal partition list), and incrementally updating the
corresponding data structures instead of refreshing them from scratch.
The following are the time improvements observed.
Number of partitions Time taken to add/drop Time to add/drop
(existing) a new partition (before) a new partition (now)
1 1.02s 1.02s
10 0.27s 0.27s
100 0.14s 0.14s
500 0.35s 0.35s
1000 0.91s 0.51s
10000 11.72s 0.85s
20000 21.92s 0.87s
Out of this total time (for the worst case), around 0.50s is spent in
adding and dropping the partition to the hive meta store and rest of the
time is spent in updating the catalog.
Change-Id: I359ab0af921543c0fdcb975c14b05f80f93fe803
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3291
Reviewed-by: Anusha Dasarakothapalli <anusha.dasarakothapalli@cloudera.com>
Tested-by: jenkins
Adds support for dropping all table and column stats from a table. Once incremental
stats are supported, this will provide the user a way to force a recompute of all
stats.
Change-Id: I27e03d5986b64eb91852bfc3417ffa971d432d6b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3533
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f1f074f24bfdc77c4cef147fe9d26f27df80ab81)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3551
With IMPALA-1033 we disabled the counting of the number of NULLs in each column,
and that gave a 2x speed-up in the computation. But erroneously the value 0 was
being placed in the number of NULLs, instead of the correct -1 that indicates
'unknown'.
Change-Id: Ib882eb2a87e7e2469f606081cb2881461b441a45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3377
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3378
The compute stats statement was not quoting the DB and table names. If those names
were aliasing with keywords, then the compute stats would not execute due to a syntax
error.
Change-Id: Ie08421246bb54a63a44eaf19d0d835da780b7033
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3170
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3198
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED
When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.
When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.
It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).
Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.
Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Avro tables that were not created with a column-definition list do not have
their columns properly populated in the Metastore backend DB (HIVE-6308).
For such tables COMPUTE STATS and Hive's ANALYZE TABLE cannot succeed.
This patch fails COMPUTE STATS in analysis for such broken Avro tables
and adds tests for Avro tables with mismatched a column-definition list
and Avro schema.
Change-Id: I561ecea944ae2f83d69950b7a1ab9edaa89bdcea
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1892
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1920
The parquet file stores the application version that wrote it so
is different between our c4 and c5 branches.
HBase storage is also not guaranteed to be identical across versions.
Change-Id: I02984a55e0678756e50c1fff6db22c43788d3916
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1028
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
A compute stats command computes the table and column stats for a given
table and persists them in the metastore.
The table stats consist of the per-partition and per-table row count.
The column stats are computed on a per-table basis and consist of the
number of distinct values and the number of NULLs per column.
This patch introduces a new 'child query' concept that
compute stats utilizes. Child queries are cancelled
if the parent query is cancelled. A compute stats stmt is
executed by the following query hirarchy:
parent: compute stats query (DDL)
- child: compute table stats query (QUERY)
- child: compute column stats query (QUERY)
The new child query concept is necessary to decouple child query fetches
from parent query fetches, i.e., we could not execute a child query as
part of the original compute stats query, because then a client could
fetch the results we need for updating the Metastore statistics. The
reason why our existing CTAS works without this decoupling
is that its insert 'child query' is not fetchable.
Change-Id: I560533e3cb09bcbbdb3eea7fcf0b460bc6b36dcd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/873
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins