impala

mirror of https://github.com/apache/impala.git synced 2026-01-06 06:01:03 -05:00

Author	SHA1	Message	Date
Dimitris Tsirogiannis	4eceeacf16	IMPALA-1550: Invalid rewrite when EXISTS subqueries contain aggregate functions This commit fixes an issue where a [NOT] EXISTS subquery that contains an aggregate function will sometimes be incorrectly rewritten into a join, thereby returning incorrect results. Change-Id: I18b211d76ee3de77d8061603ff5bb1fbceae2e60 Reviewed-on: http://gerrit.cloudera.org:8080/266 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-04-02 19:11:00 +00:00
Dimitris Tsirogiannis	3a7ed7c59e	CDH-26149: No navigator lineage for CREATE/ALTER VIEW statements This commit fixes the issue where no lineage events are generated for create and alter view statements. Change-Id: Ib8c4513219569f62eb26a0eb09a8c2a762054b70 Reviewed-on: http://gerrit.cloudera.org:8080/265 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-03-25 00:05:39 +00:00
Juan Yu	e121bc9b0a	IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line. I did a local benchmark and there's minimal performance impact(<1%) Change-Id: I8d84a145acad886c52587258b27d33cff96ea399 (cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0) Reviewed-on: http://gerrit.cloudera.org:8080/189 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 19:58:50 -07:00
Skye Wanderman-Milne	9d6586cdb8	Addendum to IMPALA-1755 patch This patch introduces SetLookup functionality for timestamp and decimal types, as well addressing remaining code review comments. Change-Id: Ied40d2d55adbdea891ff2ab97b30f0d3986645f9 Reviewed-on: http://gerrit.cloudera.org:8080/245 Tested-by: Internal Jenkins Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2015-03-20 14:37:23 -07:00
Matthew Jacobs	e8527ddb8e	IMPALA-1888: FIRST_VALUE may produce incorrect results with preceding windows Fixes a bug where FIRST_VALUE may produce incorrect results (or a DCHECK failure in debug) when there is a window like "ROWS X PRECEDING Y PRECEDING", such that X < Y and X > the size of a partition. For windows with an end boundary that is PRECEDING (i.e. the entire window is before a row), there is some special handling between partitions, and the logic was not correct in some corner cases for FIRST_VALUE. Change-Id: Ied5d440684e99dcaf60b47489c90300891f09b91 Reviewed-on: http://gerrit.cloudera.org:8080/236 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 14:37:19 -07:00
Skye Wanderman-Milne	5118c55a0a	IMPALA-1810: IN predicate was not comparing DecimalVals correctly The IN predicate wasn't using the decimal type when comparing decimal values. I benchmarked this on a modified version of TPCDS-Q8 (i.e. a query with a huge decimal IN predicate) and there is a ~5% performance degradation with codegen enabled (surprisingly, there appears to be a slight performance gain with codegen disabled). We should be able to remove this penalty when we add constant injection via codegen. Change-Id: Ie1296fd50c68d06a343701442da49fe8d3cd16dd Reviewed-on: http://gerrit.cloudera.org:8080/230 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 14:37:18 -07:00
Alex Behm	745e64a096	IMPALA-1837: Handle truncation when implicitly casting a literal to a decimal. Implicit casting to decimals allows truncating digits from the left of the decimal point (see TypesUtil). A literal that is implicitly cast to a decimal with truncation is wrapped into a CastExpr so the BE can evaluate it and report a warning. This behavior is consistent with casting/overflow of non-constant exprs that return decimal. IMPALA-1837: Without the CastExpr wrapping, such literals can exceed the max expected byte size sent to the BE in toThrift(). Change-Id: Icd7b8751b39b8031832eec04bd8eac7d7000ddf8 Reviewed-on: http://gerrit.cloudera.org:8080/195 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 19:58:58 -07:00
Matthew Jacobs	7558a4752b	IMPALA-1502: Fix and re-enable broken data errors tests Re-enables data error tests which were not being included in run-tests.py. Broken tests were updated, with one exception which is tracked by IMPALA-1862. Depends on a related change to Impala-lzo. Change-Id: I4c42498bdebf9155a8722695a3305b63ecc6e5f3 Reviewed-on: http://gerrit.cloudera.org:8080/194 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:40 -07:00
Ippokratis Pandis	e36c436fa6	Adding tests with right joins and duplicates Those tests were added as part of the new hash table implementation, as we didn't have tests with right joins and duplicates (and other conjuncts) as well as aggregation distinct queries with group bys on multiple columns. Adding them as a separate patch, To improve testing coverage in the 2.2 release branch. Change-Id: Id1b4f27fa6e587b2031635974ac9d2d39a1b015a Reviewed-on: http://gerrit.cloudera.org:8080/193 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:40 -07:00
Alex Behm	32f644820d	IMPALA-1860: INSERT/CTAS evaluates and applies constant predicates. This patch fixes a regression introduced in: c6907e4c2eabf5d73f83cc8e16b7f35a13c3b59f IMPALA-1376: Split up Planner into multiple classes. The problem was that in single-node planning the root analyzer was passed to generate the plan for an INSERT/CTAS' query stmt. The fix is to instead pass the stmt's analyzer that contains information about evaluated constant predicates. Change-Id: I551f471c978bc1f6bdff0d98e4826856d1e4860f Reviewed-on: http://gerrit.cloudera.org:8080/191 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:39 -07:00
Dan Hecht	25b54eac1e	S3: Fix test_multiple_filesystems.py The filesizes changed slightly, causing the S3 CI build to fail. Let's regex the file sizes in the compute stats expected results. Change-Id: Ie95bdf3a253a28aa2b6f3deb281948780ca2cc6a Reviewed-on: http://gerrit.cloudera.org:8080/200 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Dan Hecht <dhecht@cloudera.com>	2015-03-11 16:39:39 -07:00
Dan Hecht	2916132283	S3: enable more tests for S3 As needed, fix up file paths and other misc things to get more test cases running against S3. Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad Reviewed-on: http://gerrit.cloudera.org:8080/167 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:39 -07:00
Alex Behm	adb19deece	Re-enable tests that had been temporarily removed to unblock the full data load. The following commits disabled tests to unblock the full data load: a00a9a5e53f7a8e7a1e3c931ea0e4b7db21c6f00 bf29d06f2e53bb924d250275d51f5ccd1213531d This patch re-enables those tests and adds new tests to guard against regressions to HIVE-6308. Unfortunately, we cannot completely remove the analysis check for HIVE-6308 in our code, because there is still one case where COMPUTE STATS will fail on a Hive-created Avro table: If there is a mismatch in column names between the Avro schema and the column defs given to a CREATE TABLE in Hive. Change-Id: I81ae6b526db02fdfc634e09eeb9d12036e2adfdd Reviewed-on: http://gerrit.cloudera.org:8080/180 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:38 -07:00
Dan Hecht	5aa8195534	S3: add end-to-end test for multiple filesystems Verify DDL and queries when a table spans multiple filesystems and across tables that live on different filesystems. Change-Id: I4258bebae4a5a2758666f5c2e283eb2d205c995e Reviewed-on: http://gerrit.cloudera.org:8080/166 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:38 -07:00
Matthew Jacobs	7216d09fe7	IMPALA-1808: AnalyticEvalNode cannot handle partition/order by exprs with NaN Analytic function evaluation was broken when partition or order by exprs evaluated to NaN (i.e. 0/0). We were relying on the comparison of the current row with the previous row to be equal (i.e. x == x), but x != x if x is NaN, and in the case of the very first row in the stream, some logic breaks if x != x. The fix is to handle the very first row specially. Change-Id: I1c33445d55a70c7f107f05eeadef272b7973ee11 Reviewed-on: http://gerrit.cloudera.org:8080/179 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:37 -07:00
Dimitris Tsirogiannis	d04f190973	IMPALA-1802: Impala produces incorrect count(distinct) result with limit clause This commit fixes the issue where if a query contains a count(distinct) expression in conjunction with a limit, the limit is incorrectly applied to the wrong place in the generated plan, thereby producing incorrect results. Change-Id: I776e1b78461323e7ab72d491dcec7a9acd9e75f9 Reviewed-on: http://gerrit.cloudera.org:8080/196 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-10 19:41:32 +00:00
Alex Behm	7615ec9b98	Temporarily disable a single insert planner test to unblock the full data load. The issue here is that the file sizes for alltypes in seq/snap in the current snapshot are different from the ones generated by the new Hive. After we have generated a new snapshot, I will restore the test as part of: http://gerrit.cloudera.org:8080/180 Change-Id: I96187587e490098a3c600e0e20f0c39ffb74a7fd Reviewed-on: http://gerrit.cloudera.org:8080/184 Reviewed-by: Henry Robinson <henry@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com> Reviewed-on: http://gerrit.cloudera.org:8080/187	2015-03-07 19:09:58 +00:00
Dimitris Tsirogiannis	c88d179413	IMPALA-1636: Generalize index-based partition pruning to allow constant expressions This commit enables fast partition pruning for cases where constant expressions appear in binary or IN predicates. During partition pruning, the constant expressions are evaluated in the BE and are replaced by the computed results as LiteralExprs. Change-Id: Ie8a2accf260391117559dc6c0a565f907c516478 Reviewed-on: http://gerrit.cloudera.org:8080/144 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-03-07 09:51:27 +00:00
Henry Robinson	146fe64a26	IMPALA-1615: Don't drop row count during DROP INCREMENTAL STATS Change-Id: I1ae23ca9d70eeb58a3c7c8c59fb633832edcff58 Reviewed-on: http://gerrit.cloudera.org:8080/148 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 20:22:49 +00:00
Dan Hecht	41e3b6b61e	S3: fix grant_revoke test to run against S3 1) Fix up locations to take FILESYSTEM_PREFIX into account so we can run the test against non-default FS. 2) Fix up results and catch sections. 3) Since S3 doesn't support INSERT, split the test into another version that expects different results for the INSERT part. The rest of the test is identical, and we can remove this new .test file once INSERT is supported. Change-Id: I50d21048b846aa985d1eefc50fc33bda05ebe509 Reviewed-on: http://gerrit.cloudera.org:8080/146 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 18:16:45 +00:00
zuowang	16792b28be	IMPALA-1437: Implement SHOW FILES IN <table> Query:SHOW FILES IN db.table Result: +---------------------------------------------+------+---------------------+ \| path \| size \| partition \| +---------------------------------------------+------+---------------------+ \| hdfs://namenode/path/to/partition/file1.dat \| 128B \| year=2010, month=11 \| \| hdfs://namenode/path/to/partition/file2.dat \| 256B \| year=2010, month=12 \| \| hdfs://namenode/path/to/partition/file3.dat \| 1.3G \| year=2011, month=1 \| +---------------------------------------------+------+---------------------+ Query:SHOW FILES IN db.table PARTITION(year=2010, month=12) Result: +---------------------------------------------+------+---------------------+ \| path \| size \| partition \| +---------------------------------------------+------+---------------------+ \| hdfs://namenode/path/to/partition/file2.dat \| 256B \| year=2010, month=12 \| +---------------------------------------------+------+---------------------+ Only support Hdfs tables. Will throw exceptions for other kinds of table. Partition is optional. Will throw exception if specified partition cannot be found. Change-Id: I6480ed87ab6cdfb02a60bffa72a8047a161f92ab Reviewed-on: http://gerrit.cloudera.org:8080/19 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 05:13:50 +00:00
Matthew Jacobs	27209e4cb1	Fix exhaustive tests: move analytic fn tests using decimal_tbl to decimal.test Change-Id: Iaaa5bd59b27d2db2736874e96d38cb823f6e4a56 Reviewed-on: http://gerrit.cloudera.org:8080/147 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 03:05:49 +00:00
Matthew Jacobs	0c8022b9bc	IMPALA-1559: FIRST_VALUE rewrite fn type might not match slot type In some cases, the lookup for the first_value_rewrite function could return a different fn with the wrong (yet technically compatible) type. first_value_rewrite takes 2 parameters, the first is the parameter to first_value, and the second is an integer explicitly added by the FE during analysis. first_value_rewrite has signatures where the first parameter can be of any type and the second is always a BIGINT. When the FE inserts the second parameter, it may not actually be a BIGINT (e.g. it could be a TINYINT, SMALLINT, etc.). In this case, the function arguments will not match one of the registered signatures exactly and some compatable function will be returned. For example, FIRST_VALUE(1.1) has a DECIMAL parameter and if a TINYINT/SMALLINT/INT parameter is added for the rewrite, then the first_value_rewrite fn lookup happened to match the fn taking a FLOAT and BIGINT. (Ideally DECIMAL would not be implicitly castable to a FLOAT/DOUBLE, but NumericLiterals do allow this casting.) As a result, the agg fn actually returned a FLOAT while the AnalyticExpr was of the type DECIMAL, and the analytic tuple contained a DECIMAL slot which would throw a DCHECK in the BE (or perhaps crash in retail builds). This fixes the issue by setting the NumericLiterals to be explicitlyCast, i.e. the type will not change if re-analyzed. Then the correct fn signature is found. Change-Id: I1cefa3e29734ae647bd690263bb63f08f10ea8b9 Reviewed-on: http://gerrit.cloudera.org:8080/136 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-04 05:22:53 +00:00
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
Matthew Jacobs	99219488d7	IMPALA-1705: Support writing values larger than 64KB to Parquet files Allow values larger than 64KB to be written to Parquet files. This was previously limited by a fixed data page size. This commit removes that limitation by allowing the page size to grow when necessary. This occurs when there are enough unique values to switch from dictionary encoding to plain encoding, and then there are huge values larger than the default 64KB page size. In this case, it may be possible to write files larger than one HDFS block, but this is an edge case and not worth introducing additional complexity to handle. Change-Id: I165ef44ba48ff0c3c3203860157a61c45f77df8b Reviewed-on: http://gerrit.cloudera.org:8080/120 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 05:44:55 +00:00
Matthew Jacobs	296d1bba2f	IMPALA-1562: AnalyticEvalNode not properly handling nullable tuples When an analytic fn does not contain partition or order by exprs (i.e. empty OVER() clause), we should not be comparing the previous and current rows. It is not necessary because the analytic fn is applied to the entire input, and attempting to access the child tuples could reference invalid memory because there might be nullable tuples. When there are either partition or orderby exprs, then there is a sort node preceding the analytic node and the sort node always produces non-null tuples (though tuples may have all null slots). Change-Id: I5788295682b4c9a1dd8a3078e11da5767f12214c Reviewed-on: http://gerrit.cloudera.org:8080/129 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-02-28 05:01:15 +00:00
Ippokratis Pandis	d58aedff42	IMPALA-1820: Start with small pages for hash tables during repartitioning The change of the PARTITION_FANOUT from 32 to 16 exposed a pathological case due to the lack of coordination across concurrently executing spilling nodes of the same query. In particular, when we repartition a partition we try to initialize hash tables for the new partitions. But each hash table needs a block (for the nodes). In case there were not any IO-sized blocks available, because they had been consumed by other nodes, we would get into a loop trying to repartition those smaller partitions that couldn't initialize their hash table. Additional repartitions that, among others, would need additional blocks for the new streams. These partitions would end up being very small, still we would fail the query when we were reaching the MAX_PARTITION_DEPTH limit, which was fixed to 4. This patch fixes the problem by initializing the hash tables during repartitions with small pages. That is, the hash tables always first use a 64KB and a 512KB block for their nodes before switching to IO-sized blocks. This helps the partitioning algorithm to finish when we end up with partitions that can fit in those small pages. The performance may not be optimal, still the memory consumption is lower and the algorithm finishes. For example, without this patch and with PARTITION_FANOUT == 16 in order to run TPC-H Q18 and Q20 we needed 3.4GB and 3.1GB respectively. With this patch TPC-H Q18 needs ~1GB and Q20 975MB. This patch also removes the restriction of stopping repartitioning when we are reaching 4 levels of repartitioning. Instead, whenever we repartition we compare the size of the input partition to the size of the largest new partition. If there is no reduction on the size we stop the algorithm. Otherwise, we keep on repartitioning. That should help in cases of skew (e.g. due to bad hashing). There is a new MAX_PARTITION_DEPTH limit of 16. It is very unlikely we will ever hit this limit. Change-Id: Ib33fece10585448bc2d07bb39d0535d78b168ccc Reviewed-on: http://gerrit.cloudera.org:8080/119 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-02-28 00:42:04 +00:00
Dan Hecht	99d3caacb7	Enable local filesystem tables The S3 work really enabled any Hadoop FileSystem to work with Impala, but a small tweak is needed for LocalFileSystem due to how the Hadoop Path code deals with URIs that don't have an authority component. While we aren't claiming support for arbitrary FileSystem's at this tiem, it is useful to test this. Since the S3 testing is done as a nightly test rather than pre-checkin, we can use the LocalFileSystem to regression test that: 1) Impala can access table data living on a secondary filesystem, i.e. not the filesystem specified by fs.defaultFS. 2) Impala does not make assumptions that the filesystem has type DistributedFileSystem. Change-Id: Ie9b858ea440c9b3b332602e034c8052b168c57da Reviewed-on: http://gerrit.cloudera.org:8080/121 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-02-27 18:48:56 +00:00
Dimitris Tsirogiannis	3852155feb	CDH-24093: Impala should produce column lineage info needed by Navigator This change adds support for column lineage logging in Impala to be consumed by Navigator. This feature is disabled by default and is enabled by setting the -lineage_event_log_dir flag. When lineage logging is enabled, the serialized column lineage graph is computed for each query and stored in a specialized log file in JSON format. Change-Id: Ib8d69cdbcc435be1e9c9694998c1d33ec1245b10 Reviewed-on: http://gerrit.cloudera.org:8080/70 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-02-27 11:30:13 +00:00
ishaan	ad0d723170	Fix hbase/joins Planner Tests to account for the new default regionserver ports. Change-Id: Id7988afaaaf1073551ee90c366da78fafa4f7858	2015-02-25 23:13:11 -08:00
Skye Wanderman-Milne	d2dcda5421	Nested types: BE changes for Parquet struct support These are the backend changes necessary for reading structs in Parquet files. I wrote this against Alex's preliminary frontend work, and ad-hoc tables containing structs work. We won't be able to add automated tested until the FE changes are in as well, but I'd like to get these changes in so we can at least get converage of our existing workloads. The bulk of the changes are in the Parquet scanner. The rest is around changing the column index of a slot descriptor to a column path, in order to support nested columns. Change-Id: Ifbd865b52c2b4679d81643184b1f36bf539ffcfd Reviewed-on: http://gerrit.cloudera.org:8080/62 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-02-26 00:19:25 +00:00
Alex Behm	37ca6b81ae	IMPALA-1567: Ignore 'hidden' files with special suffixes. Currently, we only consider files hidden if they have the special prefixes "." or "_". However, some tools use special suffixes to indicate a file is being operated on, and should be considered invisible. This patch adds the following hidden suffixes: '.tmp' - Flume's default for temp files '.copying' - hdfs put may produce these Change-Id: I151eafd0286fa91e062407e12dd71cfddd442430 Reviewed-on: http://gerrit.cloudera.org:8080/80 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 10:55:22 +00:00
Alex Behm	ffd124b48e	IMPALA-1711: Save, drop and restore col stats when renaming a table across dbs. The HMS does not properly migrate column stats when moving a table across databases (HIVE-9720). Apart from losing the stats, this issue also prevents the newly renamed table from being dropped. To work around the issue, this patch manually drops+adds the column stats when renaming a table across databases. Change-Id: If901c5d1e9a6b2cedc35034a537f18c361c8ffa1 Reviewed-on: http://gerrit.cloudera.org:8080/72 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 00:10:35 +00:00
ishaan	8369c3b13b	Remove explicit references to functional_hbase tables from .test files. Additionally, this patch also disabled the hbase/none test dimension if the TARGET_FILESYSTEM environment variable is set to either s3 of isilon. Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb Reviewed-on: http://gerrit.cloudera.org:8080/74 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-02-23 23:32:41 +00:00
Dan Hecht	cc8c9cf089	S3: Synthesize file block metadata for "other" filesystems Some Hadoop filesystems, like the S3-based ones, are not block based. Since Impala derives scan ranges from file blocks, synthesize file blocks for these filesystems. Otherwise, files are always assigned a single scan range, limiting parallelism. An alternate approach would be to modify the planner's compute scan range code. However, there would be some downsides to that approach: (a) we'd need to plumb through more information from the catalog to the frontend, increasing the catalog size, and (b) we'd be doing more work on each query rather than once at metadata load time. Change-Id: If53cba6e25506545eae78190601fbee0147547b3 Reviewed-on: http://gerrit.cloudera.org:8080/54 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-02-20 03:00:27 +00:00
Alex Behm	ad6b9364c0	IMPALA-1629: Compute stats properly updates CHAR/VARCHAR column stats. The problem was that VARCHAR was not present in a few switch statements for updating/populating column stats in various places. Change-Id: I0b2b316b734d27a7ff08701b0986014be2473443 Reviewed-on: http://gerrit.cloudera.org:8080/65 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-19 22:53:42 +00:00
casey	dbc504fad1	IMPALA-1579: UNIX_TIMESTAMP() should return BIGINTs instead of INTs This should fix the last y2k38 problem. Previously calling unix_timestamp() with a input of '2038-01-19 03:14:08' or later would return a negative value due to a 32 bit int overflow. This patch switches from 32 to 64 bit ints. Change-Id: Ic9180887d6c828f6ecd25435be86fd0bd52d3f0d Reviewed-on: http://gerrit.cloudera.org:8080/61 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-02-16 00:59:34 +00:00
Martin Grund	5578c0668f	IMPALA-1750: Fix Validate HDFS Cache Parameters for Upgrades In the case, that a cached table was created in a version of Impala that did not have the property for the cache replication factor, the loading of the table will fail until the table is un-cached and cached again. This patch fixes this behavior and ignores this missing parameter. Change-Id: I118020dd5bd7fb203d91853d5ef946f2c4c8a695 Reviewed-on: http://gerrit.cloudera.org:8080/48 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-02-14 00:54:22 +00:00
Martin Grund	fdafbc5709	IMPALA-1645 and IMPALA-1632: Verify Cache Directives When a table is loaded in the catalog, we will now perform a check to verify that the cache directive ID and cache replication factor is still valid and the data is current. If the cache directive does no longer exist, we issue a error message and mark the table / partition as uncached. Furthermore, the replication factor is updated with the information from the actual cache directive. In the case of insert statement there is a special situation as the catalog update is happening synchronously and will try to access the cache directive information that might be stale. Thus in this insert path, we catch the possible not found exception and reset the caching information. Change-Id: I882041ce5395b8a3d17e9fc2750053393340df65 Reviewed-on: http://gerrit.cloudera.org:8080/40 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-02-11 03:35:46 +00:00
Juan Yu	a7e95e0992	IMPALA-1614: Compute stats fails if table name starts with number Change-Id: Iedac1ec0207a6e7b68ff9575c7c8473bbaf394cf Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5908 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: jenkins	2015-02-04 12:28:54 -08:00
casey	fd09294b74	TimestampValue refactor and cleanup (part 2) (IMPALA-1623) This is preparation for fixing IMPALA-97. These changes are mostly non-functional to bring the code closer to styling standards. The biggest functional changes should be: 1) IMPALA-1623 was caused by a misuse of a constructor and that code didn't compile after the refactor so the bug was fixed. 2) TimestampValue.Hash() seems to have been hashing the time twice instead of the time and date. 3) Timings using TimestampValue.time() would not be accurate when crossing midnight (time and date are separate fields). 4) Timings using local time should use UTC to avoid daylight savings problems. 5) Use system monotonic clock in util/time.h. Some timings may still be affected by #3 & 4 above but fixing those isn't the purpose of this change. Change-Id: I26056c876c4361e898acc3656aa98abf6f153a6b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5779 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2015-02-03 01:49:55 -08:00
Martin Grund	b1bbc2e15f	IMPALA-1587: Patch to fix exhaustive test for compute stats This is a fix for the exhaustive test of compute stats that was introduced by the new layout of the show partitions output. In the exhaustive test one column too many was added to the expected output. Change-Id: Ie83c114cd8ac1a711da64de3c82578020eb332af Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5865 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: jenkins	2015-01-27 22:33:25 -08:00
Martin Grund	cee1e84c1e	IMPALA-1587: Extending caching directives for multiple replicas This patch adds the possibility to specify the number of replicas that should be cached in main memory. This can be useful in high QPS scenarios as the majority of the load is no longer the single cached replica, but a set of cached replicas. While the cache replication factor can be larger than the block replication factor on disk, the difference will be ignored by HDFS until more replicas become available. This extends the current syntax for specifying the cache pool in the following way: cached in 'poolName' is extended with the optional replication factor cached in 'poolName' with replication = XX By default, the cache replication factor is set to 1. As this value is not yet configurable in HDFS it's defined as a constant in the JniCatalog thrift specification. If a partitioned table is cached, all its child partitions inherit this cache replication factor. If child partitions have a custom cache replication factor, changing the cache replication factor on the partitioned table afterwards will overwrite this custom value. If a new partition is added to the table, it will again inherit the cache replication factor of the parent independent of the cache pool that is used to cache the partition. To review changes and status of the replication factor for tables and partitions the replication factor is part of output of the "show partitions" command. Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-01-26 20:30:59 -08:00
Dan Hecht	3735ea94a0	S3: Don't seek/read past file end DistributedFileSystem is lenient about seeking past the end of the file. Other FileSystem implementations, such as NativeS3FileSystem, return an error on this condition. That leads to a scary looking message in the query warnings. So, when creating scan ranges, let's require that the ranges fall within the file bounds (at least according to what the HdfsFileDesc indicates is the length). There were a couple of kinds of AllocateScanRange() callsites that needed to be fixed up: 1) When a stream wants to read past a scan range, be careful not to read past the end of the file. 2) When Impala needs to "guess" at the length of a range, use the file_length as an upper bound on the guess. We were already doing this someplaces but not everywhere. 3) When the scan range is derived from parquet metadata, validate the metadata against file_length and issue appropriate errors. This will give better diagnostics for corrupt files. Note that we can't rely on this for safety (HdfsFileDesc file_length may be stale), but it does mean that when metadata is up-to-date Impala will no longer try to access beyond the end of files (and so we'll no longer get false positive errors from the filesystem). Additionally, this change revealed a pre-existing problem with files that have multiple row-groups. The first time through InitColumns(), stream_ was set to NULL. But, stream_->filename could potentially be accessed when constructing error statuses for subsequent row-groups. Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: jenkins	2015-01-08 16:19:35 -08:00
Alex Behm	76118bd000	Re-enable TPCDS planner tests. These tests had been 'temporarily' disabled when moving the TPCDS schema on CDH5 to a partitioned store_sales with DECIMAL. The intention was to to re-enable these tests shortly after the schema change, but it was never actually done. Process to restore this test: I started off with the tpcds-all.test file from our CDH4 branch and ran it on CDH5. I investigated the following plan differences: - Table sizes are slightly different on CDH5 - Several plan changes, e.g., join order, analytic order. All plan differences are due to the size difference between DECIMAL and FLOAT. The CDH4 tables use FLOAT, and the CDH5 tables use DECIMAL. Some plans had aggregates on those columns, and the size difference between, e.g., a DECIMAL(38,2) and a DOUBLE was significant enough to change the plan choice in several instances. I concluded that all the plan differences are legitimate and should be accepted. Change-Id: I11f36a543e9a5041d569c6f633fdfd296b72d31e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5672 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-12-29 14:44:33 -08:00
ishaan	dee6911b20	Enable loading metadata from the hive metastore snapshot and cleanup build scripts. This patch contains the following changes: - Add a metastore_snapshot_file parameter to build.sh - Enable skipping loading the metadata. - create-load-data.sh is refactored into functions. - A lot of scripts source impala-config, which creates a lot of log spew. This has now been muted. - Unecessary log spew from compute-table-stats has been muted. - build_thirdparty.sh determins its parallelism from the system, it was previously hard coded to 4 - Only force load data of the particular dataset if a schema change is detected. Change-Id: I909336451e5c1ca57d21f040eb94c0e831546837 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5540 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-12-19 13:41:00 -08:00
Skye Wanderman-Milne	cfd4ff2546	IMPALA-1589: allow up to 8 non-variadic arguments in the interpreted UDF path Change-Id: Ie17763366311554ee1a58ed6b8a8d40973ae20d9 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5604 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-12-16 18:53:16 -08:00
Skye Wanderman-Milne	0f8ebbc5be	Add missing "drop function" from Hive UDF test Change-Id: Ieb073556be1a68de0a7b5574b43608c2f368c290 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5553 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins	2014-12-10 23:02:47 -08:00
Dimitris Tsirogiannis	57132bf021	IMPALA-1535: Partition pruning with NULL This commit fixes the issue where partition pruning returns wrong results when a binary predicate contains a NULL literal. Change-Id: I24c647184dcef49d12d6ff422e28667777df7784 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5443 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5549	2014-12-10 17:33:11 -08:00
Alex Behm	325f5a4551	[CDH5] Fix exhaustive test runs: Correct malformed test section. Change-Id: Ief7128b8d21144199c629ee002c81b0930d2fc14 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5496 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-12-04 18:23:00 -08:00

1 2 3 4 5 ...

679 Commits