impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 00:02:28 -05:00

Author	SHA1	Message	Date
Alex Behm	a41710a0c8	Use unique_database fixture in test_compute_stats.py. This patch makes it a little easier to use the unique_database fixture with .test files. The RESULTS section can now contain $DATABASE which is replaced with the current database by the test framework. Testing: - ran the test locally on exhaustive - ran the test on hdfs and the local filesystem on Jenkins Change-Id: I8655eb769003f88c0e1ec1b254118e4ec3353b48 Reviewed-on: http://gerrit.cloudera.org:8080/2947 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:50 -07:00
Alex Behm	c908ba1b7e	IMPALA-1136: Support loading Avro tables without an explicit Avro schema Hive allows creating Avro tables without an explicit Avro schema since 0.14.0. For such tables, the Avro schema is inferred from the column definitions, and not stored in the metadata at all (no Avro schema literal or Avro schema file). This patch adds support for loading the metadata of such tables, although Impala currently cannot create such tables (expect a follow-on patch). Change-Id: I9e66921ffbeff7ce6db9619bcfb30278b571cd95 Reviewed-on: http://gerrit.cloudera.org:8080/538 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-07-31 12:13:37 +00:00
Shant Hovsepian	6d87fe090c	Improve Hll estimate for small cardinalities. Based on Google's HyperLogLog++ paper. Uses a bias correcting interpolation as a sub algorithm for Hll estimates within a specific range. Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9 Reviewed-on: http://gerrit.cloudera.org:8080/415 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2015-07-16 19:38:17 +00:00
Henry Robinson	f22b8659fd	IMPALA-1595: Add 'location' to SHOW [TABLE STATS\|PARTITIONS] for HDFS tables This patch adds a 'location' column to the output of SHOW TABLE STATS / SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE SET LOCATION commands, particularly for partitions, and is easier to identify than the output of DESCRIBE FORMATTED. Some existing tests in alter-table.test have been updated to include checking the location output before and after a SET LOCATION command. The tests in show.test have also been updated to check for the location; all other tests that use SHOW [TABLE STATS\|PARTITIONS] use a generic regex to avoid overly verbose tests. Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb Reviewed-on: http://gerrit.cloudera.org:8080/316 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2015-04-21 19:27:50 +00:00
Alex Behm	adb19deece	Re-enable tests that had been temporarily removed to unblock the full data load. The following commits disabled tests to unblock the full data load: a00a9a5e53f7a8e7a1e3c931ea0e4b7db21c6f00 bf29d06f2e53bb924d250275d51f5ccd1213531d This patch re-enables those tests and adds new tests to guard against regressions to HIVE-6308. Unfortunately, we cannot completely remove the analysis check for HIVE-6308 in our code, because there is still one case where COMPUTE STATS will fail on a Hive-created Avro table: If there is a mismatch in column names between the Avro schema and the column defs given to a CREATE TABLE in Hive. Change-Id: I81ae6b526db02fdfc634e09eeb9d12036e2adfdd Reviewed-on: http://gerrit.cloudera.org:8080/180 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:38 -07:00
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
ishaan	8369c3b13b	Remove explicit references to functional_hbase tables from .test files. Additionally, this patch also disabled the hbase/none test dimension if the TARGET_FILESYSTEM environment variable is set to either s3 of isilon. Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb Reviewed-on: http://gerrit.cloudera.org:8080/74 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-02-23 23:32:41 +00:00
Alex Behm	ad6b9364c0	IMPALA-1629: Compute stats properly updates CHAR/VARCHAR column stats. The problem was that VARCHAR was not present in a few switch statements for updating/populating column stats in various places. Change-Id: I0b2b316b734d27a7ff08701b0986014be2473443 Reviewed-on: http://gerrit.cloudera.org:8080/65 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-19 22:53:42 +00:00
Juan Yu	a7e95e0992	IMPALA-1614: Compute stats fails if table name starts with number Change-Id: Iedac1ec0207a6e7b68ff9575c7c8473bbaf394cf Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5908 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: jenkins	2015-02-04 12:28:54 -08:00
Martin Grund	cee1e84c1e	IMPALA-1587: Extending caching directives for multiple replicas This patch adds the possibility to specify the number of replicas that should be cached in main memory. This can be useful in high QPS scenarios as the majority of the load is no longer the single cached replica, but a set of cached replicas. While the cache replication factor can be larger than the block replication factor on disk, the difference will be ignored by HDFS until more replicas become available. This extends the current syntax for specifying the cache pool in the following way: cached in 'poolName' is extended with the optional replication factor cached in 'poolName' with replication = XX By default, the cache replication factor is set to 1. As this value is not yet configurable in HDFS it's defined as a constant in the JniCatalog thrift specification. If a partitioned table is cached, all its child partitions inherit this cache replication factor. If child partitions have a custom cache replication factor, changing the cache replication factor on the partitioned table afterwards will overwrite this custom value. If a new partition is added to the table, it will again inherit the cache replication factor of the parent independent of the cache pool that is used to cache the partition. To review changes and status of the replication factor for tables and partitions the replication factor is part of output of the "show partitions" command. Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-01-26 20:30:59 -08:00
Henry Robinson	44f57e5fb6	IMPALA-1122: Compute stats with partition granularity This patch adds the ability to compute and drop column and table statistics at partition granularity. The following commands are added. Detail about the implementation follows. COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>] This variant of COMPUTE STATS will, ultimately, do the same thing as the traditional COMPUTE STATS statement, but does so by caching the intermediate state of the computation for each partition in the Hive MetaStore. If the PARTITION clause is added, the computation is performed for only that partition. If the PARTITION clause is omitted, incremental stats are updated only for those partitions with missing incremental stats (e.g. one column does not have stats, or incremental stats was never computed for this partition). In this patch, incremental stats are only invalidated when a DROP STATS variant is executed. Future patches can automatically invalidate the statistics after REFRESH or INSERT queries, etc. DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec> This variant of DROP stats removes the incremental statistics for the given table. It does not recalculate the statistics for the whole table, so this should be used only to invalidate the intermediate state for a partition which will shortly be subject to COMPUTE INCREMENTAL STATS. The point of this variant is to allow users to notify Impala when they believe a partition has changed significantly enough to warrant recomputation of its statistics. It is not necessary for new partitions; Impala will detect that they do not have any valid statistics. -------- This is achieved by adapting the existing HLL UDA via swapping its finalize method for a new one which returns the intermediate HLL buckets, rather than aggregating and then disposing of them. This intermediate state is then returned to Impala's catalog-op-executor.cc, which then passes the intermediate state back to the frontend to be ultimately stored in the HMS. This intermediate state is computed on a per-partition basis by grouping the input to the UDA by partition. Thus, the incremental computation produces one row for each partition selected (the set of which might be quite small, if there are few partitions without valid incremental stats: this is the point of the new commands). At the same time, the query coordinator aggregates the output of the UDA to produce table-level statistics. This computation incorporates any existing (and not re-computed) intermediate partition state which is passed to the coordinator by the frontend. The resulting statistics are saved to the table as normal. Intermediate statistics are serialised to the HMS by writing a Thrift structure's serialised form to the partition's 'parameters' map. There is a schema-imposed limit of 4000 characters to the serialised string, which is exacerbated by the fact that the Thrift representation must first be base-64 encoded to avoid type errors in the HMS. The current patch breaks the encoded structure into 4k chunks, and then recombines them on read. The alltypes table (11 columns) takes about three of these chunks. This may mean that incremental stats are not suitable for particularly wide tables: these structures could be zipped before encoding for some space savings. In the meantime, the NDV estimates are run-length encoded (since they are generally sparse); this can result in substantial space savings. Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408	2014-11-25 09:13:37 -08:00
Henry Robinson	6af7c8fe4a	IMPALA-1330: Fix column types for SHOW {table, partition} STATS Because we add 'total' to the last row in SHOW PARTITIONS, we set the partition key columns to be string. At least, that's what the comment said, but we didn't do that in fact. This patch also corrects the column type for max width, which should be INT. Change-Id: I787ab17be27f45107340119017e528c58a3daad3 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4678 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-10-06 15:16:56 -07:00
Alex Behm	3c650c3ba5	IMPALA-1104: Fixes for compute stats on Avro tables. Create Avro tables without col defs. This patch addresses the following issues: 1. Allow creating Avro tables without col defs in Impala. Compute stats works on them. 2. Handle table creation with inconsistent col defs and Avro schema as follows: The table creation will succeed and ignore the col defs in favor of the Avro schema. A warning is issued that the col defs and the Avro schema are inconsistent. Compute stats works on such tables. This patch does not address the issue of compute stats after Avro schema evolution. Change-Id: Iea6b737d238d81491dc2097012ebc149a89d03ba Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4182 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4250 Tested-by: jenkins	2014-09-11 20:50:05 -07:00
Dan Hecht	bc124b460a	IMPALA-883: COMPUTE STATS returns -1 for number of rows in empty partition. The query used to generate the stats does a GROUP BY on the partition keys, and so empty partitions will not get any results. Detect the empty partition case and set the number of rows to 0. Change-Id: I1ccb7d2016f35026aa1b418155c4534024f3cee5 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4029 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: jenkins (cherry picked from commit 128a02f508cdb280b53b8a8429e6b90491e43956) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4042	2014-08-26 13:48:07 -07:00
Alex Behm	bceeb834f3	IMPALA-677: Fix visibility of semi and anti-joined table references. Semi or anti-joined table references are now only visible inside the On-clause of the corresponding join. Change-Id: Id93e53ecdf2a74baf9736aa427fa7af15358ca27 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3789 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-08-17 12:45:45 -07:00
anusha	901f3504cc	Performance improvements: Alter table add partition and drop partition This patch improves the performance of the DDL queries "Alter table add partition" and "Alter table drop partition" as the number of partitions is scaled up. The issue was that every time a partition was added or dropped, the entire block metadata for that table was reloaded. This operation was highly expensive especially as the number of partitions became larger. This patch handles this by adding/dropping only the added/dropped partition's metadata to the hdfsTable (adding/dropping it to/from the internal partition list), and incrementally updating the corresponding data structures instead of refreshing them from scratch. The following are the time improvements observed. Number of partitions Time taken to add/drop Time to add/drop (existing) a new partition (before) a new partition (now) 1 1.02s 1.02s 10 0.27s 0.27s 100 0.14s 0.14s 500 0.35s 0.35s 1000 0.91s 0.51s 10000 11.72s 0.85s 20000 21.92s 0.87s Out of this total time (for the worst case), around 0.50s is spent in adding and dropping the partition to the hive meta store and rest of the time is spent in updating the catalog. Change-Id: I359ab0af921543c0fdcb975c14b05f80f93fe803 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3291 Reviewed-by: Anusha Dasarakothapalli <anusha.dasarakothapalli@cloudera.com> Tested-by: jenkins	2014-08-17 12:43:23 -07:00
Alex Behm	19bab59854	Create/alter/describe tables with complex types. This patch adds parsing of complex types and tests for using complex types in various exprs and create/alter/describe stmts. Change-Id: Ibc211a560c889f5ccfb616813700b923c89d8245 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3577 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3594	2014-07-23 17:26:14 -07:00
Lenni Kuff	7157f54bbe	Support DROP STATS <table name> Adds support for dropping all table and column stats from a table. Once incremental stats are supported, this will provide the user a way to force a recompute of all stats. Change-Id: I27e03d5986b64eb91852bfc3417ffa971d432d6b Reviewed-on: http://gerrit.ent.cloudera.com:8080/3533 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins (cherry picked from commit f1f074f24bfdc77c4cef147fe9d26f27df80ab81) Reviewed-on: http://gerrit.ent.cloudera.com:8080/3551	2014-07-21 10:28:16 -07:00
Ippokratis Pandis	e1ae5fe95a	IMPALA-1068: COMPUTE STATS should place -1 in #NULLs With IMPALA-1033 we disabled the counting of the number of NULLs in each column, and that gave a 2x speed-up in the computation. But erroneously the value 0 was being placed in the number of NULLs, instead of the correct -1 that indicates 'unknown'. Change-Id: Ib882eb2a87e7e2469f606081cb2881461b441a45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3377 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3378	2014-07-07 15:13:25 -07:00
Alex Behm	96722da3fe	Fix misplaced comment in testfile. Change-Id: I55dc7d0e8e74a4f8c9a99e9601b2578ef6b0390d Reviewed-on: http://gerrit.ent.cloudera.com:8080/3303 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3317	2014-06-30 10:17:26 -07:00
Ippokratis Pandis	6026f1ebe1	IMPALA-1055: Compute stats query statements don't quote DB and table names The compute stats statement was not quoting the DB and table names. If those names were aliasing with keywords, then the compute stats would not execute due to a syntax error. Change-Id: Ie08421246bb54a63a44eaf19d0d835da780b7033 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3170 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3198	2014-06-20 09:32:52 -07:00
Lenni Kuff	745c091fcc	[CDH5] Update SHOW TABLE STATS to include per-partition HDFS caching stats Change-Id: I71b01f84bbd308108d775e78c644e867b48e05be Reviewed-on: http://gerrit.ent.cloudera.com:8080/2621 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-05-28 08:54:54 -07:00
Lenni Kuff	c45e9a70d9	[CDH5] Add DDL support for HDFS caching This change adds DDL support for HDFS caching. The DDL allows the user to indicate a table or partition should be cached and which pool to cache the data into: * Create a cached table: CREATE TABLE ... CACHED IN 'poolName' * Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName' * Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED When a table/partition is marked as cached, a new HDFS caching request is submitted to cache the location (HDFS path) of the table/partition and the ID of that request is stored with in the table metadata (in the table properties). This is stored as: 'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS and persisted across HDFS restarts. When a cached table or partition is dropped it is important to uncache the cached data (drop the associated cache request). For partitioned tables, this means dropping all cache requests from all cached partitions in the table. Likewise, if a partitioned table is created as cached, new partitions should be marked as cached by default. It is desirable to know which cache pools exists early on (in analysis) so the query will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To support this, a new cache pool catalog object type was introduced. The catalog server caches the known pools (periodically refreshing the cache) and sends the known pools out in catalog updates. This allows impalads to perform analysis checks on cache pool existence going to HDFS. It would be easy to use this to add basic cache pool management in the future (ADD/DROP/SHOW CACHE POOL). Waiting for the table/partition to become cached may take a long time. Instead of blocking the user from access the time during this period we will wait for the cache requests to complete in the background and once they have finished the table metadata will be automatically refreshed. Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-05-27 16:47:15 -07:00
Alex Behm	ce40134ad0	IMPALA-867: Fail COMPUTE STATS in analysis for Avro tables affected by HIVE-6308. Avro tables that were not created with a column-definition list do not have their columns properly populated in the Metastore backend DB (HIVE-6308). For such tables COMPUTE STATS and Hive's ANALYZE TABLE cannot succeed. This patch fails COMPUTE STATS in analysis for such broken Avro tables and adds tests for Avro tables with mismatched a column-definition list and Avro schema. Change-Id: I561ecea944ae2f83d69950b7a1ab9edaa89bdcea Reviewed-on: http://gerrit.ent.cloudera.com:8080/1892 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1920	2014-03-14 23:24:55 -07:00
Alex Behm	3f68be2caa	IMP-1227: Ignore columns of unsupported types in compute stats. Enclose identifiers that are Impala keywords in quotes. Change-Id: Ie7fa6da2869090428c9229c44b973ecccbb49e8e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1357 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1368	2014-01-28 17:18:17 -08:00
Alex Behm	74164e8f99	IMPALA-688: Fix column stats computation for HBase row key. Use regex to fix flaky tests. Change-Id: I1d3fb915921bbc5366da0ee51608fd54aa237777 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1135 Tested-by: jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:54:33 -08:00
Alex Behm	e4ad086dee	Added max/avg length for string columns in COMPUTE STATS. Change-Id: I6f61de2323ee12681642684ec633ed4bb7506de2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1079 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:30 -08:00
Nong Li	ab21dde002	Update compute stats test to use regex for parquet/hbase file size. The parquet file stores the application version that wrote it so is different between our c4 and c5 branches. HBase storage is also not guaranteed to be identical across versions. Change-Id: I02984a55e0678756e50c1fff6db22c43788d3916 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1028 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:17 -08:00
Alex Behm	93e5b262c2	Added COMPUTE STATS command for gathering table and column stats. A compute stats command computes the table and column stats for a given table and persists them in the metastore. The table stats consist of the per-partition and per-table row count. The column stats are computed on a per-table basis and consist of the number of distinct values and the number of NULLs per column. This patch introduces a new 'child query' concept that compute stats utilizes. Child queries are cancelled if the parent query is cancelled. A compute stats stmt is executed by the following query hirarchy: parent: compute stats query (DDL) - child: compute table stats query (QUERY) - child: compute column stats query (QUERY) The new child query concept is necessary to decouple child query fetches from parent query fetches, i.e., we could not execute a child query as part of the original compute stats query, because then a client could fetch the results we need for updating the Metastore statistics. The reason why our existing CTAS works without this decoupling is that its insert 'child query' is not fetchable. Change-Id: I560533e3cb09bcbbdb3eea7fcf0b460bc6b36dcd Reviewed-on: http://gerrit.ent.cloudera.com:8080/873 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:14 -08:00

29 Commits