impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
Matthew Jacobs	296d1bba2f	IMPALA-1562: AnalyticEvalNode not properly handling nullable tuples When an analytic fn does not contain partition or order by exprs (i.e. empty OVER() clause), we should not be comparing the previous and current rows. It is not necessary because the analytic fn is applied to the entire input, and attempting to access the child tuples could reference invalid memory because there might be nullable tuples. When there are either partition or orderby exprs, then there is a sort node preceding the analytic node and the sort node always produces non-null tuples (though tuples may have all null slots). Change-Id: I5788295682b4c9a1dd8a3078e11da5767f12214c Reviewed-on: http://gerrit.cloudera.org:8080/129 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-02-28 05:01:15 +00:00
Ippokratis Pandis	d58aedff42	IMPALA-1820: Start with small pages for hash tables during repartitioning The change of the PARTITION_FANOUT from 32 to 16 exposed a pathological case due to the lack of coordination across concurrently executing spilling nodes of the same query. In particular, when we repartition a partition we try to initialize hash tables for the new partitions. But each hash table needs a block (for the nodes). In case there were not any IO-sized blocks available, because they had been consumed by other nodes, we would get into a loop trying to repartition those smaller partitions that couldn't initialize their hash table. Additional repartitions that, among others, would need additional blocks for the new streams. These partitions would end up being very small, still we would fail the query when we were reaching the MAX_PARTITION_DEPTH limit, which was fixed to 4. This patch fixes the problem by initializing the hash tables during repartitions with small pages. That is, the hash tables always first use a 64KB and a 512KB block for their nodes before switching to IO-sized blocks. This helps the partitioning algorithm to finish when we end up with partitions that can fit in those small pages. The performance may not be optimal, still the memory consumption is lower and the algorithm finishes. For example, without this patch and with PARTITION_FANOUT == 16 in order to run TPC-H Q18 and Q20 we needed 3.4GB and 3.1GB respectively. With this patch TPC-H Q18 needs ~1GB and Q20 975MB. This patch also removes the restriction of stopping repartitioning when we are reaching 4 levels of repartitioning. Instead, whenever we repartition we compare the size of the input partition to the size of the largest new partition. If there is no reduction on the size we stop the algorithm. Otherwise, we keep on repartitioning. That should help in cases of skew (e.g. due to bad hashing). There is a new MAX_PARTITION_DEPTH limit of 16. It is very unlikely we will ever hit this limit. Change-Id: Ib33fece10585448bc2d07bb39d0535d78b168ccc Reviewed-on: http://gerrit.cloudera.org:8080/119 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-02-28 00:42:04 +00:00
Dan Hecht	99d3caacb7	Enable local filesystem tables The S3 work really enabled any Hadoop FileSystem to work with Impala, but a small tweak is needed for LocalFileSystem due to how the Hadoop Path code deals with URIs that don't have an authority component. While we aren't claiming support for arbitrary FileSystem's at this tiem, it is useful to test this. Since the S3 testing is done as a nightly test rather than pre-checkin, we can use the LocalFileSystem to regression test that: 1) Impala can access table data living on a secondary filesystem, i.e. not the filesystem specified by fs.defaultFS. 2) Impala does not make assumptions that the filesystem has type DistributedFileSystem. Change-Id: Ie9b858ea440c9b3b332602e034c8052b168c57da Reviewed-on: http://gerrit.cloudera.org:8080/121 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-02-27 18:48:56 +00:00
Alex Behm	37ca6b81ae	IMPALA-1567: Ignore 'hidden' files with special suffixes. Currently, we only consider files hidden if they have the special prefixes "." or "_". However, some tools use special suffixes to indicate a file is being operated on, and should be considered invisible. This patch adds the following hidden suffixes: '.tmp' - Flume's default for temp files '.copying' - hdfs put may produce these Change-Id: I151eafd0286fa91e062407e12dd71cfddd442430 Reviewed-on: http://gerrit.cloudera.org:8080/80 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 10:55:22 +00:00
Alex Behm	ffd124b48e	IMPALA-1711: Save, drop and restore col stats when renaming a table across dbs. The HMS does not properly migrate column stats when moving a table across databases (HIVE-9720). Apart from losing the stats, this issue also prevents the newly renamed table from being dropped. To work around the issue, this patch manually drops+adds the column stats when renaming a table across databases. Change-Id: If901c5d1e9a6b2cedc35034a537f18c361c8ffa1 Reviewed-on: http://gerrit.cloudera.org:8080/72 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 00:10:35 +00:00
ishaan	8369c3b13b	Remove explicit references to functional_hbase tables from .test files. Additionally, this patch also disabled the hbase/none test dimension if the TARGET_FILESYSTEM environment variable is set to either s3 of isilon. Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb Reviewed-on: http://gerrit.cloudera.org:8080/74 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-02-23 23:32:41 +00:00
Alex Behm	ad6b9364c0	IMPALA-1629: Compute stats properly updates CHAR/VARCHAR column stats. The problem was that VARCHAR was not present in a few switch statements for updating/populating column stats in various places. Change-Id: I0b2b316b734d27a7ff08701b0986014be2473443 Reviewed-on: http://gerrit.cloudera.org:8080/65 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-19 22:53:42 +00:00
casey	dbc504fad1	IMPALA-1579: UNIX_TIMESTAMP() should return BIGINTs instead of INTs This should fix the last y2k38 problem. Previously calling unix_timestamp() with a input of '2038-01-19 03:14:08' or later would return a negative value due to a 32 bit int overflow. This patch switches from 32 to 64 bit ints. Change-Id: Ic9180887d6c828f6ecd25435be86fd0bd52d3f0d Reviewed-on: http://gerrit.cloudera.org:8080/61 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-02-16 00:59:34 +00:00
Martin Grund	5578c0668f	IMPALA-1750: Fix Validate HDFS Cache Parameters for Upgrades In the case, that a cached table was created in a version of Impala that did not have the property for the cache replication factor, the loading of the table will fail until the table is un-cached and cached again. This patch fixes this behavior and ignores this missing parameter. Change-Id: I118020dd5bd7fb203d91853d5ef946f2c4c8a695 Reviewed-on: http://gerrit.cloudera.org:8080/48 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-02-14 00:54:22 +00:00
Martin Grund	fdafbc5709	IMPALA-1645 and IMPALA-1632: Verify Cache Directives When a table is loaded in the catalog, we will now perform a check to verify that the cache directive ID and cache replication factor is still valid and the data is current. If the cache directive does no longer exist, we issue a error message and mark the table / partition as uncached. Furthermore, the replication factor is updated with the information from the actual cache directive. In the case of insert statement there is a special situation as the catalog update is happening synchronously and will try to access the cache directive information that might be stale. Thus in this insert path, we catch the possible not found exception and reset the caching information. Change-Id: I882041ce5395b8a3d17e9fc2750053393340df65 Reviewed-on: http://gerrit.cloudera.org:8080/40 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-02-11 03:35:46 +00:00
Juan Yu	a7e95e0992	IMPALA-1614: Compute stats fails if table name starts with number Change-Id: Iedac1ec0207a6e7b68ff9575c7c8473bbaf394cf Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5908 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: jenkins	2015-02-04 12:28:54 -08:00
casey	fd09294b74	TimestampValue refactor and cleanup (part 2) (IMPALA-1623) This is preparation for fixing IMPALA-97. These changes are mostly non-functional to bring the code closer to styling standards. The biggest functional changes should be: 1) IMPALA-1623 was caused by a misuse of a constructor and that code didn't compile after the refactor so the bug was fixed. 2) TimestampValue.Hash() seems to have been hashing the time twice instead of the time and date. 3) Timings using TimestampValue.time() would not be accurate when crossing midnight (time and date are separate fields). 4) Timings using local time should use UTC to avoid daylight savings problems. 5) Use system monotonic clock in util/time.h. Some timings may still be affected by #3 & 4 above but fixing those isn't the purpose of this change. Change-Id: I26056c876c4361e898acc3656aa98abf6f153a6b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5779 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2015-02-03 01:49:55 -08:00
Martin Grund	b1bbc2e15f	IMPALA-1587: Patch to fix exhaustive test for compute stats This is a fix for the exhaustive test of compute stats that was introduced by the new layout of the show partitions output. In the exhaustive test one column too many was added to the expected output. Change-Id: Ie83c114cd8ac1a711da64de3c82578020eb332af Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5865 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: jenkins	2015-01-27 22:33:25 -08:00
Martin Grund	cee1e84c1e	IMPALA-1587: Extending caching directives for multiple replicas This patch adds the possibility to specify the number of replicas that should be cached in main memory. This can be useful in high QPS scenarios as the majority of the load is no longer the single cached replica, but a set of cached replicas. While the cache replication factor can be larger than the block replication factor on disk, the difference will be ignored by HDFS until more replicas become available. This extends the current syntax for specifying the cache pool in the following way: cached in 'poolName' is extended with the optional replication factor cached in 'poolName' with replication = XX By default, the cache replication factor is set to 1. As this value is not yet configurable in HDFS it's defined as a constant in the JniCatalog thrift specification. If a partitioned table is cached, all its child partitions inherit this cache replication factor. If child partitions have a custom cache replication factor, changing the cache replication factor on the partitioned table afterwards will overwrite this custom value. If a new partition is added to the table, it will again inherit the cache replication factor of the parent independent of the cache pool that is used to cache the partition. To review changes and status of the replication factor for tables and partitions the replication factor is part of output of the "show partitions" command. Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-01-26 20:30:59 -08:00
Dan Hecht	3735ea94a0	S3: Don't seek/read past file end DistributedFileSystem is lenient about seeking past the end of the file. Other FileSystem implementations, such as NativeS3FileSystem, return an error on this condition. That leads to a scary looking message in the query warnings. So, when creating scan ranges, let's require that the ranges fall within the file bounds (at least according to what the HdfsFileDesc indicates is the length). There were a couple of kinds of AllocateScanRange() callsites that needed to be fixed up: 1) When a stream wants to read past a scan range, be careful not to read past the end of the file. 2) When Impala needs to "guess" at the length of a range, use the file_length as an upper bound on the guess. We were already doing this someplaces but not everywhere. 3) When the scan range is derived from parquet metadata, validate the metadata against file_length and issue appropriate errors. This will give better diagnostics for corrupt files. Note that we can't rely on this for safety (HdfsFileDesc file_length may be stale), but it does mean that when metadata is up-to-date Impala will no longer try to access beyond the end of files (and so we'll no longer get false positive errors from the filesystem). Additionally, this change revealed a pre-existing problem with files that have multiple row-groups. The first time through InitColumns(), stream_ was set to NULL. But, stream_->filename could potentially be accessed when constructing error statuses for subsequent row-groups. Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: jenkins	2015-01-08 16:19:35 -08:00
Skye Wanderman-Milne	cfd4ff2546	IMPALA-1589: allow up to 8 non-variadic arguments in the interpreted UDF path Change-Id: Ie17763366311554ee1a58ed6b8a8d40973ae20d9 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5604 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-12-16 18:53:16 -08:00
Skye Wanderman-Milne	0f8ebbc5be	Add missing "drop function" from Hive UDF test Change-Id: Ieb073556be1a68de0a7b5574b43608c2f368c290 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5553 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins	2014-12-10 23:02:47 -08:00
Dimitris Tsirogiannis	57132bf021	IMPALA-1535: Partition pruning with NULL This commit fixes the issue where partition pruning returns wrong results when a binary predicate contains a NULL literal. Change-Id: I24c647184dcef49d12d6ff422e28667777df7784 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5443 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5549	2014-12-10 17:33:11 -08:00
Alex Behm	325f5a4551	[CDH5] Fix exhaustive test runs: Correct malformed test section. Change-Id: Ief7128b8d21144199c629ee002c81b0930d2fc14 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5496 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-12-04 18:23:00 -08:00
Alex Behm	f696861c5c	Throw error on unrecognized test sections. Our .test file parser used to not abort tests when there is a malformed test/section. This patch changes that behavior to report an error and treat the test as failed. Quite a few tests were not well-formed, and were not executed as a result. This patch fixes those tests. Arguably, the test file parser should be more flexible in which places to accept comments, but this patch does not address that problem. Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-12-02 18:08:09 -08:00
Henry Robinson	13b9cdd6b0	More test coverage for incremental stats Change-Id: I17778dcf019c2a219baa678211f221b7e04813bb Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5446 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins (cherry picked from commit 2f5ec47e0b5dd26bc9dfe884481d3a316201be2d) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5460 Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-12-01 17:26:39 -08:00
Alex Behm	c0f2e043b4	Fix exhaustive test runs: Preserve types when substituting root output exprs. A recent change (3ccee71) to fix resetAnalysisState() of NullLiterals exposed another bug during exhaustive test runs. For insert queries into Parquet, the types in the schema of the generated Parquet files are based on the insert exprs, correctly assuming that the FE handles all the necessary casting to make sure the Parquet file schema and the table schema match. Since we apply an smap on the output exprs towards the end of planning, NullLiterals were reset to the NULL_TYPE, causing the Parquet schema to incorrectly have BOOLEAN columns (we cast naked NULL_LITERALS to BOOLEAN in toThrift()), leading to a mismatch of the Parquet schema and the table schema. Subsequent queries on such a table failed, correctly reporting a type mismatch. The fix is to preserve types when doing the substitution on the output exprs. Change-Id: I135f1b826b06a6a200df7b73343d2eb1fb4b7b80 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5453 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5455	2014-11-30 01:08:08 -08:00
Henry Robinson	98064c4da8	Fix crash when columns are dropped between compute stats calls FinalizePartitionedColumnStats() should have iterated over the list of columns present in the table, rather than the list in the existing stats data structure. If a column was dropped, but still persisted in the old structure, it was possible that we could index off the end of an array. Change-Id: Ib1ab7690ffae05afff826b9d1a15871337691739 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5437 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins (cherry picked from commit cee8305cd2878c8f00622d39ddd43b7a5dfbbc0d) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5447 Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-11-26 23:25:11 -08:00
Dimitris Tsirogiannis	30a5d1d452	IMPALA-1526: Invalid tuple idx from IsNullPredicate exprs cause Impala to crash This commit fixes the issue where the tuple ids of semi-joined tuples are falsely added in the list of materialized tuple ids of IsNullPredicate exprs, causing Impala to crash. The fix is to exclude semi-joined tuple ids from the list of materialized tuples ids of select statements. Change-Id: I93712be9d03dd54dc9172f51a5ba99e85aa05455 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5405 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5434 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>	2014-11-26 14:42:54 -08:00
Matthew Jacobs	e004307bbe	IMPALA-1419, IMPALA-1542: Fix NullLiteral to reset its type in resetAnalysisState Queries with arithmetic exprs containing a NullLiteral child failed (IMPALA-1419) or crashed (IMPALA-1542) because re-analysis of these exprs was incorrect. Change-Id: Ice3461aed53863123bcf8f38af123d89ad3b7d6a Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5429 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins	2014-11-26 14:29:48 -08:00
Alex Behm	4ad15bb2be	IMPALA-1524: Materialize all tuples produced by an EmptySetNode. Change-Id: I3b151ace464c67634104f84f7223c948fed8909e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5406 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins (cherry picked from commit c2959485a066b5c0b40e8b0790d526726236d0c9) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5409 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-11-25 23:21:02 -08:00
Skye Wanderman-Milne	390e773a44	rand() is not a constant expr Also fixes a bug in Expr::DebugString() Change-Id: I32b53072755781d0858481187864d2319b9ae1cb Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5400 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 6de9fab17a5032dd7c9d1ef6b8071703c67d223f) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5425 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-11-25 18:38:27 -08:00
Skye Wanderman-Milne	8ad6ba9f8c	IMPALA-1528: TupleIsNullPredicate is never constant We were treating it as constant before since it has no children and we didn't override Expr::IsConstant(). However, it's not constant since it depends on the input tuple, which caused it to blow up when we tried to evaluate it as a constant expr. Change-Id: Ic2c3489ba605f03a7644e6ac9107d4310dd0aa7b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5399 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins (cherry picked from commit 10db8f1056e8887dc99b4a334283d4d37d5f757c) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5419 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-11-25 18:18:45 -08:00
Henry Robinson	44f57e5fb6	IMPALA-1122: Compute stats with partition granularity This patch adds the ability to compute and drop column and table statistics at partition granularity. The following commands are added. Detail about the implementation follows. COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>] This variant of COMPUTE STATS will, ultimately, do the same thing as the traditional COMPUTE STATS statement, but does so by caching the intermediate state of the computation for each partition in the Hive MetaStore. If the PARTITION clause is added, the computation is performed for only that partition. If the PARTITION clause is omitted, incremental stats are updated only for those partitions with missing incremental stats (e.g. one column does not have stats, or incremental stats was never computed for this partition). In this patch, incremental stats are only invalidated when a DROP STATS variant is executed. Future patches can automatically invalidate the statistics after REFRESH or INSERT queries, etc. DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec> This variant of DROP stats removes the incremental statistics for the given table. It does not recalculate the statistics for the whole table, so this should be used only to invalidate the intermediate state for a partition which will shortly be subject to COMPUTE INCREMENTAL STATS. The point of this variant is to allow users to notify Impala when they believe a partition has changed significantly enough to warrant recomputation of its statistics. It is not necessary for new partitions; Impala will detect that they do not have any valid statistics. -------- This is achieved by adapting the existing HLL UDA via swapping its finalize method for a new one which returns the intermediate HLL buckets, rather than aggregating and then disposing of them. This intermediate state is then returned to Impala's catalog-op-executor.cc, which then passes the intermediate state back to the frontend to be ultimately stored in the HMS. This intermediate state is computed on a per-partition basis by grouping the input to the UDA by partition. Thus, the incremental computation produces one row for each partition selected (the set of which might be quite small, if there are few partitions without valid incremental stats: this is the point of the new commands). At the same time, the query coordinator aggregates the output of the UDA to produce table-level statistics. This computation incorporates any existing (and not re-computed) intermediate partition state which is passed to the coordinator by the frontend. The resulting statistics are saved to the table as normal. Intermediate statistics are serialised to the HMS by writing a Thrift structure's serialised form to the partition's 'parameters' map. There is a schema-imposed limit of 4000 characters to the serialised string, which is exacerbated by the fact that the Thrift representation must first be base-64 encoded to avoid type errors in the HMS. The current patch breaks the encoded structure into 4k chunks, and then recombines them on read. The alltypes table (11 columns) takes about three of these chunks. This may mean that incremental stats are not suitable for particularly wide tables: these structures could be zipped before encoding for some space savings. In the meantime, the NDV estimates are run-length encoded (since they are generally sparse); this can result in substantial space savings. Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408	2014-11-25 09:13:37 -08:00
Dimitris Tsirogiannis	4b748ef5da	IMPALA-1371: Predicate applied incorrectly when FULL OUTER JOIN is present This commit fixes the issue where predicates are not applied to the correct query tree nodes when the query contains full outer joins. To address this issue, we register information about the tuple ids that are outer joined by full outer joins and use that information to guide the assignment of predicates. Change-Id: I854c05c159d86c0aaabfc12b7dd5c5982c5ece4b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5284 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins	2014-11-23 21:36:31 -08:00
Skye Wanderman-Milne	2bfb69523f	IMPALA-1508: don't JIT TimestampFunctions::DateAddSub For some reason, the try/catch added to fix IMPALA-1493 doesn't work when we JIT the function. Fixing this in the JIT'd code will take some time, so for now just don't JIT the function. Change-Id: I7b2801027db0a9deb19b477c1a4ca0bdad77a825 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5383 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-11-23 21:36:03 -08:00
Dimitris Tsirogiannis	99ce6176be	IMPALA-1387: On-clause conjuncts of anti joins must be evaluated by the anti join. This commit fixes the issue where conjuncts from the On-clause of an anti join are not assigned to the anti join. Change-Id: Id23f86b2979f996f46af90757b06a031855de0b8 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5330 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5371	2014-11-21 12:08:21 -08:00
Alex Behm	c1d8c22862	IMPALA-1483: Substitute TupleIsNullPredicates to refer to physical analytic output. The bug: TupleIsNullPredicates generated when substituting exprs against outer-joined inline views containing analytic functions refer to the logical tuple id of the analytics. These logical tuple ids are not materialized and should not be referenced by any expr during BE evaluation, including TupleIsNullPredicates. The fix: Substitute TupleIsNullPredicates referring to the logical analytic output with TupleIsNullPredicates referring to the physical output. Change-Id: I10bbd869279f01f15a83deeadc7675352c7daaf9 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5317 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5362	2014-11-21 01:08:27 -08:00
Ippokratis Pandis	39e90bef8a	Temp fix for IMPALA-1488: disable spilling hash tables with matches in right joins. Right joins (right outer, right semi, right anti, and full outer) depend on the matched flag for the build side, information stored in the hash tables. Thus, in right joins if we spill a hash table that has matches, then we are going to lose this information and return wrong results. This patch adds a flag in the hash table which is set in case of right joins that had at least one match. Then, the SpillPartition() algorithm won't spill partitions that had hash table matches. If there are no partitions to spill the query will gracefully fail with OOM. Change-Id: I736400768529019bb10c2541de552d958eb90044 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5306 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5335	2014-11-20 13:58:44 -08:00
Ippokratis Pandis	87502f829c	IMPALA-1471: Bug in spilling of PHJ that was affecting left anti and outer joins. In cases where we had to spill the probe side of PHJs, we were not only appending the probe row to the tuple stream to be spilled, but we were also getting into the regular processing loop with the iterator set to End(). In the case of left anti and left outer joins, the result was to incorrectly output this row, since it did not have a match. This bug had a small perf impact for all spilling joins because we were doing an unnecessary loop for each probe row we had to spill. This patch solves the problem by immediately going to the next probe row if the current row is spilled. Additionally, it fixes a bug in the block mgr where there was a code path we were not counting correctly the number of pinned buffers. It also adds tpch-q21 in the set of queries to run in the spilling test. Change-Id: I762f5c41fe468e4485a4b31dabe2e53f6b49ae24 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5313 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5334	2014-11-20 02:21:14 -08:00
Victor Bittorf	4339133887	Adding SEQUENCEFILE compressed record format Currently we do not support per record compression for SEQUENCEFILE; we do support no compression and block compression. Per record compression is typically very slow (since the compressor is invoked per record in the table) and not widely used. We chose to add support for per record compression as part of our effort to use Impala for all of our testdata loading infrastructure. We have per record compressed tables in testdata, so even though there is no customer demand for per record compression, we need it to migrate our data loading off of Hive. Change-Id: I6ea98ae0d31cceff8236b4b006c3a9fc00f64131 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5302 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit f62a76f8d00b8dbc2846deb36ee5f65031ad846e) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5322	2014-11-19 17:21:36 -08:00
Nong Li	fa774bfb85	IMPALA-1392: Fix crash from UDFs that throw exceptions. Change-Id: Ic8775d6344aba9655511f99c0a1760e8e148d0cf Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5243 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-11-17 15:03:14 -08:00
casey	24ce8cfada	IMPALA-1456: Hive UDFs with String args would crash impalad The wrong buffer was being used. Change-Id: I18bf9040eaeda871d1d0baee2e276749a3a38615 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5185 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2014-11-17 15:02:30 -08:00
casey	4915ea4ac9	IMPALA-1134: Use copyBytes() to get value from o.a.h.io.Text This affects java UDFs. Previously it was possible that the length of the string returned from a java udf didn't match the actual data. Per the Text.getBytes() documentation "... only data up to getLength() is valid.". Impala just needs to use copyBytes() which is a convenience function for this situation. The same should be done for BytesWritable. Before: Query: select length(echo('12345678901234567890')) +-------------------------------------------+ \| length(java.echo('12345678901234567890')) \| +-------------------------------------------+ \| 22 \| +-------------------------------------------+ After: Query: select length(echo('12345678901234567890')) +-------------------------------------------------+ \| length(functional.echo('12345678901234567890')) \| +-------------------------------------------------+ \| 20 \| +-------------------------------------------------+ Change-Id: If9671278df8abf7529d3bc470c5f9d037ac3da1b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4897 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2014-11-17 15:02:24 -08:00
Victor Bittorf	3f75bd6735	Reintroduce SEQUENCEFILE writer tests The sequence writer test had an issue with zlib on certain cluster machines, making this a flaky test. This has passed several times locally and in private builds. This re-enables the test because the failures could not be produced in private builds. Change-Id: I0aeea3a2d000e711e5a84427a7b40592e1eef75b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5077 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-11-17 11:19:16 -08:00
casey	516d7483dd	IMPALA-1300: Allow subqueries in UNION operands This enables the existing subquery rewrite rules to rewrite UNION statements. UNION rewriting is easily done by simply calling the rewriter for each operand in the UNION. At least one TPC-DS query requires this functionality (IMPALA-1365). The more difficult case of a UNION within a subquery is still not supported. Change-Id: I7f83eed0eb8ae81565e629f09f6918a4ba86ee13 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4859 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2014-11-17 11:19:09 -08:00
Alex Behm	7b6ecbeea5	Fix exhaustive test run: Modify test to produce identical results on HBase. Change-Id: I7187f9aca63f61ea1686820b3cbec277240da191 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4866 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins	2014-11-17 11:19:01 -08:00
Dan Hecht	4bf6a21a9e	S3: Qualify DataSource paths Impala qualifies all paths stored in the metastore except for the DataSource jar path. Use a qualified path here as well, which will allow datasources to live on the non-default FS. In CreateDataSrcStmt, use the post-analyzed qualified path rather than the user passed string. Then, fix CreateTableDataSrcStmt so that it doesn't strip out the scheme://authority portion of the URI, but instead uses the qualified path string directly. Note that the metastore may still contain unqualified paths in DataSource tables' properties that were generated by previous versions. That's okay though since the backend won't assume all paths are qualified in case other components generate (or have in the past) metadata with unqualified paths. Change-Id: I905d8f6a7bf1793cfccf720b6ab5dc845d7dd5fa Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5201 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: jenkins (cherry picked from commit 86c75be01d0f5654291acdbc1c68f5a76915028c) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5239	2014-11-13 12:42:32 -08:00
Skye Wanderman-Milne	c693fbc48c	Misc. diagnostic/debugging improvements - Add number of files in table to query plan - Add number of remote scan ranges to runtime profile - Clean up logging in ClientCache Change-Id: I0580fe435ac0a52548aedb4e0ffa875ce9b9dede Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5166 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-11-06 22:04:11 -08:00
Nong Li	e2d7fb6402	Some test case cleanup. Change-Id: Ic29b7c1f5fd714a1e2cc41bf0e55c0d11c782862 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4791 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5090 Reviewed-by: Nong Li <nong@cloudera.com>	2014-11-03 22:33:08 -08:00
Matthew Jacobs	164687ad81	IMPALA-1357: Analysis of WithClause pollutes global state The analysis of a with clause should have its own global state so the local view(s) can be analyzed without polluting the global state of the parent QueryStmt. This might not always matter, but in a complex query involving a with clause that contained a subquery, re-analysis of the WithClause after the subquery rewrite resulted in an invalid Exists conjunct being registered in the parent analyzer's global state. The Exists conjunct was assigned to a scan node which then failed a pre-condition check. Change-Id: Ib020787b2e1ff202d96fe1b92bd9740897ab32a0 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4825 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 629a8652c5a290054a8e582cc5cb5768a3ee67a8) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5038	2014-10-30 16:50:00 -07:00
Martin Grund	6e0c1c26c9	IMPALA-1424: abs() function retains input type This patch modifies the abs() built-in function so that it retains the type of the input argument for the return type in the same way as Postgres does. Change-Id: I1750237b85bedbc3ce9d52330ac4d458b0aada3a Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4980 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: jenkins (cherry picked from commit 424b359ab0a4f621f2865844c3293f2c80e0867f) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4996	2014-10-28 08:07:21 -07:00
Skye Wanderman-Milne	4a722980e5	IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to stitch together header buffer Change-Id: I4f33b90e845e9bef1ac929bf4ebb8e98eaff985c Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4961 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins (cherry picked from commit c3a90183b2f03434a9604f3aa2ef6dd08c9ba97c) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4981 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-10-27 16:30:56 -07:00
Matthew Jacobs	56611601a3	IMPALA-1395: Add test case back, but commented out Change-Id: I157db82dd016afd54a55512225e8cd6025ec161d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4936 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4943	2014-10-24 10:31:48 -07:00

1 2 3 4 5 ...

472 Commits