impala

mirror of https://github.com/apache/impala.git synced 2026-01-06 06:01:03 -05:00

Author	SHA1	Message	Date
Lenni Kuff	9d71dd3d0c	IMP-1158: Dropping a database in Impala does not cleanup the db's HDFS directory We need to pass a flag to the metastore for the cleanup to happen. Previously we were passing 'false' when we need to pass 'true' to get the same behavior as Hive when dropping databases. Added a test case to validate the cleanup when dropping databases and tables. Change-Id: I500a3d3ac52c1b2031fae842403a670cfe43fa98 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1035 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:18 -08:00
Alex Behm	93e5b262c2	Added COMPUTE STATS command for gathering table and column stats. A compute stats command computes the table and column stats for a given table and persists them in the metastore. The table stats consist of the per-partition and per-table row count. The column stats are computed on a per-table basis and consist of the number of distinct values and the number of NULLs per column. This patch introduces a new 'child query' concept that compute stats utilizes. Child queries are cancelled if the parent query is cancelled. A compute stats stmt is executed by the following query hirarchy: parent: compute stats query (DDL) - child: compute table stats query (QUERY) - child: compute column stats query (QUERY) The new child query concept is necessary to decouple child query fetches from parent query fetches, i.e., we could not execute a child query as part of the original compute stats query, because then a client could fetch the results we need for updating the Metastore statistics. The reason why our existing CTAS works without this decoupling is that its insert 'child query' is not fetchable. Change-Id: I560533e3cb09bcbbdb3eea7fcf0b460bc6b36dcd Reviewed-on: http://gerrit.ent.cloudera.com:8080/873 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:14 -08:00
Lenni Kuff	71df58de49	Fix bug in run-test.py causing tests to be run twice if passed a .py file The problem is that we were running the "verifiers" with the command line parameters, which means the custom .py file was passed to them as well causing the duplicate test runs. Change-Id: I36f87e9b71ad49a05246af8006d4096c04541c27 Reviewed-on: http://gerrit.ent.cloudera.com:8080/981 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:54:13 -08:00
ishaan	17761f1f5e	IMPALA-571 The shell should be able to cancel a query during an rpc. This change makes the fetch rpc interruptable. If the user cancels the query in the middle of a fetch, the shell reconnects to the impalad and closes the query. It also includes some code consolidation. Change-Id: Iaaf0dfd4cba9ce2557e4a7d0447bc9c3ffda5e29 Reviewed-on: http://gerrit.ent.cloudera.com:8080/717 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:12 -08:00
Lenni Kuff	9717b7af28	Rename SYNCED_DDL query option to SYNC_DDL Change-Id: I0b5e08694a271c40ac55d8e695cf3a74a012ce06 Reviewed-on: http://gerrit.ent.cloudera.com:8080/972 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:11 -08:00
Lenni Kuff	6282d364a8	IMP-1134: DoAsUser and impersonator are reversed in audit logs The audit logs currently have the "impersonator" field set to what we call the doAsUser and the "user" field set as the connected user. They should be reversed. Added basic tests to validate the correct event gets audited. Change-Id: Idfa0aaa6c88debedc4993bd0489dbd3f696fcf17 Reviewed-on: http://gerrit.ent.cloudera.com:8080/958 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:03 -08:00
Skye Wanderman-Milne	9147cd7518	IMPALA-525: Adjust IO buffer size based on read length and other memory fixes We were previously wasting memory by always reading into 8MB IO buffers, even when the data read was much less than 8MB. With this patch, the IO manager picks a buffer size closer to the actual amount being read (we don't use the exact size so we can continue to recycle buffers). The minimum IO buffer size is determined via the --min_buffer_size flag, and the max IO buffer size via the --read_size flag. This technique also helps with IMPALA-652, since short columns will not use as much memory as before (we will not use considerably more memory than the size of the table). This patch also changes StringBuffer to use a doubling strategy so it doesn't end up allocating many large unused buffers, and has the scanner context use the requested length as the sync read size if it's larger than the size produced by read_past_size_cb(). These changes help prevent the boundary buffer in the scanner context from allocating excess memory. Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50 Reviewed-on: http://gerrit.ent.cloudera.com:8080/938 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:01 -08:00
Lenni Kuff	6bba0c8ffe	Fix bug cleaning up removed Functions and fix test_ddl to create all test dbs When dropping functions, we neeed to remove the function from the list of Functions with that name AND remove the list from the Function map if the list is empty. The second part wasn't happening. Also fixes the test_ddl to properly create all test databases. Change-Id: Id85af7d5db74a31161f48bea3816bdf734063133 Reviewed-on: http://gerrit.ent.cloudera.com:8080/952 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:00 -08:00
Lenni Kuff	39f77b8b8f	Add support for cluster-synchronized catalog operations This change adds support for cluster-synchronized catalog operations. This provides the guaranteethat after a catalog op completes, all other subscribers to the catalog topic have also processed that update. This is useful when load balancing, because a common workflow is to target a different impalad for each statement executed. For example if each of the following were executed sequentially, but targeting a different node: 1) CREATE TABLE Foo 2) INSERT INTO Foo 3) SELECT * FROM Foo 4) INSERT INTO Foo .... Since both the INSERT and the CREATE update the catalog, it would not work as expected without this patch. The user might either get a "table not found" error or would be missing partition information from the INSERT. The downside is that this approach to DDL takes a bit longer because we need to wait until all subscribers have processed an update. If all nodes are healthy, this overhead should not be significantly longer than the current DDL time. However, a single bad node might slow down or completely block the completion of all DDL operations. By default this feature is disabled, but it can be enabled using a new query option: SYNCED_DDL=1 To test this, the base test suite was updated to support selecting a random impalad to execute each query section in a query test file. This is currently only enabled for the insert and DDL tests, but could be leveraged by more tests in the future. TODO: Add additional failure tests around this functionality. TODO: Add an explicit "sync" statement so users do not need to run all their DDL in this mode (since it is slower). Change-Id: I45e757a931bf2a4740cc0cdd1e76ce49a1e22b83 Reviewed-on: http://gerrit.ent.cloudera.com:8080/899 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:58 -08:00
Lenni Kuff	35817f6a17	Support faster DDL operations via the CatalogServer This change adds support for faster DDL via the CatalogServer by directly returning the TCatalogObject from each catalog operation and using this result to update the local impalad's catalog cache directly, rather than waiting for a state store heartbeat that contains the change. Because the Impalad's catalog can now be updated in two ways, it means that we need to be careful when applying updates to ensure no work gets "undone". For example, consider the following sequence of events: t1: [Direct Update] - Add item A - (Catalog Version 9) t2: [Direct Update] - Drop item A - (Catalog Version 10) t3: [StateStore Update] - (From Catalog Version 9) In this case, we need to ensure that the state store update in t3 does not undo the drop in t2, even though that update will contain the change to "add item A". To support this, we now check the catalog versions before adding any item to ensure that an existing item does not overwrite an item with a newer catalog version. To handle the case of removals, a new CatalogUpdateLog is introduced. This log tracks the catalog version each item was removed from the catalog. When adding a new catalog object, it is checked to see if this object was removed in a catalog version > than the version of the current object. If so, the update is ignored. This covers most updates, but there is still one concurrency issue that is not covered with this change. If someone issues an "invalidate metadata" concurrently with a direct catalog operation, it may briefly set the catalog back in time. This seems like okay behavior to me (the command is invalidating the catalog metadata). If we want to address this the CatalogUpdateLog could be extended to track additions to the catalog and we could replay the log after invalidating the metadata (as one possible solution). Change-Id: Icc9bdecc3c32436708bf9e9e7974f91d40e514f2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/864 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:58 -08:00
Lenni Kuff	48c1efad8a	Fix test import error due to bad merge Change-Id: Ia08afdc1fed66469e8142da2d5929f2b7337eb8e Reviewed-on: http://gerrit.ent.cloudera.com:8080/948 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:57 -08:00
Henry Robinson	9bc840dc85	Support for custom cluster configurations in some tests Test suites that derive from common.CustomClusterTestSuite have a brand new cluster for every tests case, which they can configure as they wish with custom arguments using the @with_args() decorator. A future improvement is to optionally only have one cluster per test suite, to allow multiple tests to run more quickly if they share configuration options. Change-Id: I6abd5740e644996d7ca2800edf4ff11b839d1bc4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/882 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:57 -08:00
Lenni Kuff	8bb0010415	IMPALA-597: Do not crash when multiple partitions have the same LOCATION This patch fixes an issue where Impala would crash if two partitions had the same HDFS location. This is now fixed in hdfs-scan-node. It also includes some cleanup and bug fixes to the FE partition related classes and adds tests. There is still a problem where partition location metadata is not sent to the BE for INSERT statements, but that will be resolved in a separate patch. Change-Id: I0f1c3113d654f7d2b410f00e793ff6b0cae1ae18 Reviewed-on: http://gerrit.ent.cloudera.com:8080/876 Reviewed-by: Alan Choi <alan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:57 -08:00
Henry Robinson	94c72936e0	Temporarily disable flakey statestore restart test This test has problems under HEAPCHECK / ASAN, but not under other test environments. Change-Id: I9868232725386f5ee4ff12531ba24878251920a9 Reviewed-on: http://gerrit.ent.cloudera.com:8080/942 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:55 -08:00
Henry Robinson	f241782966	IMPALA-620: Fix re-registration starvation bug in statestore This patch fixes a slightly pathological state that occurs when the statestore is under heavy load. The result of the bug is that subscribers cannot successfully re-register because the statestore never marks them as failed. The exact sequence of events is as follows: 1. Subscriber registers with state-store. 2. Statestore does not send heartbeats in timely fashion to subscriber. Subscriber times-out. 3. Subscriber is restarted quickly. Statestore does not detect restart. 4. Subscriber's RegisterSubscriber() call fails, because statestore detects duplicate registration. 5. Subscriber restarts again. Since state-store is slow to send heartbeats, the state-store has not detected the restart and the subscriber receives a heartbeat message from the statestore and does not reject it. 6. Statestore continues to believe subscriber is alive, since the heartbeats are not being rejected. To fix this, we add a registration ID to each successfully registered subscriber that is known to both subscriber and statestore. If the subscriber should restart and re-register, it receives a new registration ID. Whenever a heartbeat arrives, it compares its registration ID to that sent by the statestore with the heartbeat, and rejects the heartbeat if they do not match. We also allow re-registration of existing subscribers (getting rid of the dreaded "Duplicate subscription" message). A new registration overwrites an old one. Change-Id: Ie32df3a586ccb375375ebfbcbec1aaeb930b6bfe Reviewed-on: http://gerrit.ent.cloudera.com:8080/778 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-01-08 10:53:53 -08:00
Matthew Jacobs	51bfc99c63	IMPALA-395: Impala "show create table" statement Adds support for "show create table", a DDL statement that outputs a DDL statement that creates the specified table. In general, the output DDL works in Impala, so a user can copy the output and execute it to create the same table. However, there are a few special cases that output Hive DDL because we do not support creating some tables in Impala: HBase tables and tables with LZO compressed text. When we do support creating these tables in Impala, users should be able to execute the DDL in Impala as well. Change-Id: I8c130297a657810dea5b994bf99d72b0e61b847b Reviewed-on: http://gerrit.ent.cloudera.com:8080/842 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:53:53 -08:00
Lenni Kuff	2336ed99a4	Re-enable process failure tests + add simple failure tests for catalogd This brings back online the process failure tests and adds a basic failure test for the catalog service. The timeouts had to be adjusted to account for the extra time it takes to load the the catalog and also there is an additional state store subscriber. Note: the statestore 'live.backends' metric which is used in these tests needs to be renamed, it really means 'live.subscribers'. However, it requires some coordination with other teams to make the change. Also updated start-impala-cluster to check the catalog.ready flag to ensure the impalad catalog is ready to accept queries. Change-Id: If22e25dba7dc83aa40bec937b5f82b815bed4645 Reviewed-on: http://gerrit.ent.cloudera.com:8080/730 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:52 -08:00
Alex Behm	1497002013	Added SHOW TABLE/COLUMN STATS command. Fixed the following stats-related bugs: - Per-partition row count was not distributed properly via CatalogService - HBase column stats were not loaded and distributed properly Enhancements to test framework: - Allow regex specification of expected row or column values - Fixed expected results of some tests because the test framework did not catch that they were incorrect Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51 Reviewed-on: http://gerrit.ent.cloudera.com:8080/813 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:51 -08:00
ishaan	f6f8d9d19d	Fix query_executor to set user specified Impala query options. This is currently broken (query options do not get set via run-workload). If any query options are provided to run-workload, it exits with an error. This patch re-enables setting query options through run-workload and also moves their validation to impala_beeswax. Change-Id: I1df010990f9e57ebd4cf59ada5d9646a883df380 Reviewed-on: http://gerrit.ent.cloudera.com:8080/820 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:49 -08:00
Skye Wanderman-Milne	9d05d6d03a	Allow UDF tests to run in parallel. Change-Id: I9512d4a6920c4a71383d9374eb5feb303c3db85d Reviewed-on: http://gerrit.ent.cloudera.com:8080/727 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:47 -08:00
Lenni Kuff	927c486e0c	Imply EOL match character for workload runner query name regex matches This adds the EOL match character '$' to the end of all query names regex string to make the matching behaviour a bit more user friendly. This way if the user inputs "TPCH-Q1" it will not match TPCH-Q11/Q12/etc which is probably what they want. The user can still do a wildcard match using "TPCH-Q1." or "TPCH-Q1.$" Change-Id: Icfb6a111aa464353387e9b631168c44127a7896f Reviewed-on: http://gerrit.ent.cloudera.com:8080/784 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:46 -08:00
Lenni Kuff	8218c19528	Match query names in run-workload using regex This change allows for matching query names in run-workload using a regex strings. For example, the user can now pass run-workload a query name string like: --query_names=tpcds-q.,^tpch. Change-Id: I5b13858ec32cf10769a4c4f2afc49adfeb98ec93 Reviewed-on: http://gerrit.ent.cloudera.com:8080/777 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:46 -08:00
Lenni Kuff	d698881f71	Improve test run throughput by executing more tests in parallel This updates the tests to run more test cases in parallel and also removes some unneeded "invalidate metadata" calls. This cut down the 'serial' execution time for me by 10+ minutes. Change-Id: I04b4d6db508a26a1a2e4b972bcf74f4d8b9dde5a Reviewed-on: http://gerrit.ent.cloudera.com:8080/757 Tested-by: jenkins Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:46 -08:00
Henry Robinson	8ea9811f50	Close refresh / refresh_table queries correctly in Python Beeswax client This patch ensures that calls to ImpalaBeeswaxClient.refresh* are closed after completion, to ensure the query doesn't hang around for ever. Change-Id: I1ac126755678c30d274454615f2db26cc2df7322 Reviewed-on: http://gerrit.ent.cloudera.com:8080/734 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2014-01-08 10:53:43 -08:00
Lenni Kuff	af6d381401	IMPALA-565: Support user impersonation for HS2 authorization requests This change adds support for user impersonation for HS2 authorization requests. It adds a new flag (--authorized_proxy_user_config) that if set, allows users (ex. hue) to impersonate as another user. The user they wish to impersonate as is passed using the HS2 configuration property, 'impala.doas.user'. The configuration allows for specifying the list of users a proxy user can impersonate as well, or '' to allow the proxy user to impersonate any user. For example: hue=user1,user2,admin= Change-Id: I2a13e31e5bde2e6df47134458c803168415d0437 Reviewed-on: http://gerrit.ent.cloudera.com:8080/574 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:39 -08:00
Henry Robinson	89a0beb56a	IMPALA-449: Better cleanup after an INSERT fails This patch goes some way to improving recovery after an INSERT fails. Inserts now write intermediate results to <table_dir>/.impala_insert_staging. After execution completes, either successfully or not, the query-specific directory under that directory is deleted. This doesn't complete the job for better cleanup (although this goes as far as IMPALA-449 suggests). Two things to do in the future: * Have each backend delete its own staging files on error. The difficulty getting there now is that backends don't know if they are cancelled in error or because a LIMIT was reached. * If the operation to move files to their final destinations should fail during FinalizeQuery(), the coordinator should perform compensation actions and delete the files that made it. Note: We also considered a query-wide and impalad-wide option to change the staging dir. There are advantages to this (all intermediate results go to a known location which is easy to clean up on failure), but also security and other operational concerns. Worth revisiting in the future. Change-Id: Ia54cf36db6a382e359877f87d7d40aad7fdb77be Reviewed-on: http://gerrit.ent.cloudera.com:8080/670 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:37 -08:00
Lenni Kuff	8b2acf5c22	IMPALA-425: Detect read-only tables and disable INSERT/LOAD operations on these tables With this change we now detect if a table is read-only and disable INSERT/LOAD operations on these tables. A table is read-only if Impala does not have write permission on the HDFS base directory of the table or any one of the partition directories (if the table is partitioned). Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/713 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:37 -08:00
Lenni Kuff	34ccec8626	IMPALA-592: Improve performance of ALTER TABLE ADD PARTITION ALTER TABLE ADD PARTITION had performance problems as it scaled to a large number of partitions. This was because Impala was always The reason Impala is hit this is because after executing the ALTER DDL, Impala "refreshes" the table metadata. As part of the refresh() Impala tries to reuse any metadata it can that already is cached. In this case the lastDdlTime has changed so all partitions are reloaded from the metastore using the listPartitions() RPC. This call does not scale well as the number of partitions grows, for a table with 2000 partitions ADD PARTITION can take over 20 seconds. This change significantly improves the performance of ALTER TABLE ADD PARTITION by slightly changing how incremental refreshes work. We now check the list of partition names in the metastore and load only the delta of what is new (and remove any partitions that have been dropped) by checking a new "isDirty()" flag on HdfsPartition. Additionally, the lastDdlTime is now updated on the local (cached) copy after each ALTER operation so we can detect against external modification to the table. With these changes we are able to add partitions at a pretty much constant time (~1s / partition), even for tables that have a large number of partitions. Change-Id: Idc48618d4061ea3c56d9b6dae2c431a7ac49d5d9 Reviewed-on: http://gerrit.ent.cloudera.com:8080/495 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:33 -08:00
Alex Behm	1d460474af	IMPALA-432: Impose a safe maximum expr-tree depth and a safe maximum number of expr children. Change-Id: Ib519ffd5cd069b676850598b0b30b50b368cb23b Reviewed-on: http://gerrit.ent.cloudera.com:8080/692 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:28 -08:00
Henry Robinson	80b4a3b306	IMPALA-619: Add full session metadata to HS2 metadata operations Change-Id: Ib1ed3710781a4f530b16272af66f5fb48520c628 Reviewed-on: http://gerrit.ent.cloudera.com:8080/679 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-01-08 10:53:27 -08:00
Lenni Kuff	72e211ca4a	Use Hive Metastore Service instead of HiveServer 1 in test infrastructure Change-Id: I4e2ba02b2101bae95d196ab13f9453e1b3a9d7be Reviewed-on: http://gerrit.ent.cloudera.com:8080/689 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:26 -08:00
Lenni Kuff	3ad10e4459	Fix build break due to missing udf_test database Change-Id: Iaa973088ef317f5e42c1550766fc6f99a43c4370 Reviewed-on: http://gerrit.ent.cloudera.com:8080/691 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:25 -08:00
ishaan	0cb16863ee	run-workload should log a warning to console and not fail if abort_on_query_error is False and the query fails. This change also disables printing the runtime_profile to the console. Change-Id: Ic7bc3406d6eddb67a514ecfb4a27add8c40a8604 Reviewed-on: http://gerrit.ent.cloudera.com:8080/687 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:25 -08:00
Nong Li	4800995d44	Add execution for Hive UDFs. Change-Id: I6a5ad96fed77e2b8a2701f21a917a8eb7a11d500 Reviewed-on: http://gerrit.ent.cloudera.com:8080/458 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:25 -08:00
Nong Li	904289d168	Add UDA execution. Change-Id: Ie5aab79742675fc62ed731c13abe83304df80991 Reviewed-on: http://gerrit.ent.cloudera.com:8080/642 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:24 -08:00
Nong Li	7f9b16c241	Remove library from lib-cache on DROP FUNCTION. Change-Id: I30089324a862660cc4dd2894478925df2f701a13 Reviewed-on: http://gerrit.ent.cloudera.com:8080/681 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:24 -08:00
ishaan	a5096cb150	Disable test_describe_formatted because hs2 occasionally does not start. Change-Id: Ib78737be3c2d165b72679bce4c604c484e979c68 Reviewed-on: http://gerrit.ent.cloudera.com:8080/656 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:20 -08:00
Nong Li	6b9a7de02e	Add symbol resolution during analysis for create function stmts. Before this, we had to specify the entire mangled symbol. This can be quite long and quite tedious (take a look at some of the create UDA test cases that specify all the symbols). This patch adds some code to convert from the user function signature to the mangled name. This means the user can specify the unmangled name and we can do the symbol lookup. The mangling rules are pretty convoluted but if it is messed up, the user can always specify the full symbol. Some other minor cleanup in: - JNI from FE to BE - UDFs/UDAs that are loaded as test data Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/624 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:20 -08:00
Lenni Kuff	13605ad834	Support catalogd in ImpalaCluster test library Adds basic support for catalogd to our ImpalaCluster test library/object model. This will allow us to write more programatic tests targeting the catalogd process including process failure tests and metric check validators. Change-Id: I8e5f7bc73f999f105437c6d3d52c6d436a354d2d Reviewed-on: http://gerrit.ent.cloudera.com:8080/617 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:16 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
Nong Li	78539ee531	Allow insert cancellation test to fail due to IMPALA-551 Change-Id: I5d98be1cc503cc51206051a7c6a493bf884ab5b3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/594 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:10 -08:00
ishaan	ee42aa8d36	Fix incorrect argument in the Impala test suite call to execute_using_jdbc execute_using_jdbc used to expect a query string. Its interface was recently changed to accept a query object. Additionally, change the interface of the Query() class to enable it to accept raw (qualified) query strings. Change-Id: I44693cd2cccf1041cab32a9821fb76b12d148375 Reviewed-on: http://gerrit.ent.cloudera.com:8080/577 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:09 -08:00
ishaan	a33c795de3	Fix build failure because of a function signature change in the test file parser. Change-Id: I329eca710459910a743d682c21a625672096aec0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/573 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:08 -08:00
ishaan	565d15579c	Add the ability to use a workload as the unit of execution in the Impala benchmark runner. At the moment, a query is the default unit of execution and parallelism in the Impala performance suite. With this change, we now have the ability to treat a workload as the unit of execution. A workload is defined as a unique combination of the dataset, scale factor, a subset (or all) of the queries in the dataset, and a table format (file format, compression codec and compression scheme). It introduces two new command line options in bin/run-workload.py: * --execution_scope The default scope is 'query', and it maintains previous semantics. The new scope is 'workload', which toggles the unit of execution to a workload. * --shuffle_query_exec_order. Shuffles the order in which queries are executed (only applicable when the execution_scope if workload), defaults to False. Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/503 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:07 -08:00
Skye Wanderman-Milne	b7f83bcd73	Add support for LLVM IR UDFs. This patch also adds a number of improvements to NativeUdfExpr. Highlights include: * Correctly handling the lowering of AnyVal struct types (required for ABI compatibility) * A rudimentary library cache for reusing handles produced by dlopen * More complicated test cases Change-Id: Iab9acdd7d7c4308e5d7ee3210f21b033fda5a195 Reviewed-on: http://gerrit.ent.cloudera.com:8080/540 Tested-by: jenkins Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:03 -08:00
Nong Li	e5ed8e4105	Move minicluster_xml_conf to HADOOP_CONF_DIR. The current location gets deleted if you rebuild, making you have to restart mini dfs. Change-Id: If71b144534255fa8df2bfa187c0814ffdf28463e Reviewed-on: http://gerrit.ent.cloudera.com:8080/550 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:03 -08:00
Alex Behm	e4a24c8c1d	Fixed the process failure test that was failing due to a race in reading/writing a query's profile web page. Change-Id: Ibf4a27aa17eb6439630d1616c2c719fc1ee2ba4e Reviewed-on: http://gerrit.ent.cloudera.com:8080/553 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:03 -08:00
ishaan	c0129a1683	Improve the Impala shell's behavior when attempting to connect to a keberized impalad. This change has the following additions: - If the user's connecting to a kerberized impalad, the Impala shell will check whether a valid ticket exists by running 'klist -s'. If a valid ticket is not found, then the shell will exit with an appropriate error message on the commandline. - If the user's connecting to a kerberized impalad without the '-k' option, the Impala Shell will issue a 'klist -s' to check if there are valid kerberos tickets in the credentials cache. If a valid ticket is found, it will retry the connection with kerberos enabled. - The Impala shell encodes strings entered on the commandline as unicode. The sasl module expects ascii strings as arguments. Explcitly encode any string sent to the sasl module to ascii. Change-Id: I1799b1e7988a19fa513b683afe1e3b66b68c1ffc Reviewed-on: http://gerrit.ent.cloudera.com:8080/535 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:02 -08:00
Henry Robinson	8cee9fa138	Fix failing test_shell_commandline Change-Id: Iea170885f740ceeb08e21e64ef88ab44584fa270 Reviewed-on: http://gerrit.ent.cloudera.com:8080/545 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:01 -08:00

1 2 3 4 5

244 Commits