impala

mirror of https://github.com/apache/impala.git synced 2026-01-03 15:00:52 -05:00

Author	SHA1	Message	Date
Lenni Kuff	76fa3b2ded	Update DDL to support 'STORED AS PARQUET' and 'STORED AS AVRO' syntax This change updates our DDL syntax support to allow for using 'STORED AS PARQUET' as well as 'STORED AS PARQUETFILE'. Moving forward we should prefer the new syntax, but continue to support the old. I made the same change for 'AVROFILE', but since we have not yet documented the 'AVROFILE' syntax I left out support for the old syntax. Change-Id: I10c73a71a94ee488c9ae205485777b58ab8957c9 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1053 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:18 -08:00
Nong Li	c1a64d6863	Add kill-mini-llama to CDH4 branch. This makes it easier to switch between our branches and a no-op if for those of us staying on CDH4. Change-Id: Ic07eb8a7ba7e48db118c06c221aabe5e124f3bfb Reviewed-on: http://gerrit.ent.cloudera.com:8080/1033 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:54:17 -08:00
Nong Li	ab21dde002	Update compute stats test to use regex for parquet/hbase file size. The parquet file stores the application version that wrote it so is different between our c4 and c5 branches. HBase storage is also not guaranteed to be identical across versions. Change-Id: I02984a55e0678756e50c1fff6db22c43788d3916 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1028 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:17 -08:00
Lenni Kuff	c3619d9581	Add targeted perf queries for columns materialized from inline-view The planner should see only c1 and c2 are being materialized from the inline-view in these queries. This will provide a significant performance improvement on Parquet format tables. Change-Id: If9a366000531a8383dc20ad6f40456ace2281b7d Reviewed-on: http://gerrit.ent.cloudera.com:8080/1017 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:17 -08:00
Lenni Kuff	01660374c6	Additional fe and testdata pom.xml cleanup This change cleans up our FE pom.xml file by removing unneeded dependencies and system dependencies (system dependencies are now pulled in from the Maven release repository). The upside is that our pom is cleaner and it will also help reduce the likelihood of broken dependencies since Maven will pull in the right versions. The downside is that we now pull in quite a few more JARs. Note: I was unable to find release artifacts for Sentry and Parquet so I leaving those as "system" for now. Change-Id: I0b917b09a02243d78d89747591ab6bccacf7cf38 Saving changes Change-Id: I3697a7b44884c40e077b3e354fef76625e1b881d Reviewed-on: http://gerrit.ent.cloudera.com:8080/1011 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:17 -08:00
Alex Behm	2325c8c923	Added [shuffle]/[noshuffle] plan hints for forcing/preventing repartitioning before an insert. Change-Id: I0647366815f4488cabbcb1fc7bc3cf851960c44e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1007 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:16 -08:00
Alex Behm	93e5b262c2	Added COMPUTE STATS command for gathering table and column stats. A compute stats command computes the table and column stats for a given table and persists them in the metastore. The table stats consist of the per-partition and per-table row count. The column stats are computed on a per-table basis and consist of the number of distinct values and the number of NULLs per column. This patch introduces a new 'child query' concept that compute stats utilizes. Child queries are cancelled if the parent query is cancelled. A compute stats stmt is executed by the following query hirarchy: parent: compute stats query (DDL) - child: compute table stats query (QUERY) - child: compute column stats query (QUERY) The new child query concept is necessary to decouple child query fetches from parent query fetches, i.e., we could not execute a child query as part of the original compute stats query, because then a client could fetch the results we need for updating the Metastore statistics. The reason why our existing CTAS works without this decoupling is that its insert 'child query' is not fetchable. Change-Id: I560533e3cb09bcbbdb3eea7fcf0b460bc6b36dcd Reviewed-on: http://gerrit.ent.cloudera.com:8080/873 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:14 -08:00
Skye Wanderman-Milne	49f4bd285a	Change test_metadata_query_statements.py::test_show to ignore Parquet file sizes since they may fluctuate slightly. Change-Id: I3ddb6ceebe6dcc86cc1c58b35b0cd96986ec43e1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/988 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:14 -08:00
Alex Behm	659382f334	Add proper CDH release repo to testdata pom. Change-Id: Ica709bc50b1f3b0124f95f4a89d8acde16c6e60f Reviewed-on: http://gerrit.ent.cloudera.com:8080/983 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:13 -08:00
Nong Li	7f08146b88	Add ndv (distinct estimate) as a builtin aggregate function. This is implemented in the BE using HLL (but we could change this in the future). These estimates usually work better than the other algorithm we have and we've not implemented all the improvements from the google paper. Change-Id: Ied715ddd0e1a7cbe7f5f90469f1ed3d4b9c537c7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/956 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:03 -08:00
Matthew Jacobs	8a55982105	Add OFFSET to skip rows returned with a LIMIT Adds support for skipping a number of rows with an ORDER BY clause and a LIMIT. Hive does not support OFFSET so creating a view with an OFFSET will not work in Hive. For example, "SELECT * FROM T1 ORDER BY ID LIMIT 20 OFFSET 5" will do the sorting, skip 5 rows, then return the next 20. OFFSET requires an ORDER BY clause. Note this is not very efficient as we must actually keep (limit+offset) rows in memory in the topn-node, and all child sort nodes must as well. Users should be careful when using this feature. Change-Id: I4d7021c278296e7bdbfa0e6f2699cd6f23eef59d Reviewed-on: http://gerrit.ent.cloudera.com:8080/900 Tested-by: jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:54:02 -08:00
Skye Wanderman-Milne	9147cd7518	IMPALA-525: Adjust IO buffer size based on read length and other memory fixes We were previously wasting memory by always reading into 8MB IO buffers, even when the data read was much less than 8MB. With this patch, the IO manager picks a buffer size closer to the actual amount being read (we don't use the exact size so we can continue to recycle buffers). The minimum IO buffer size is determined via the --min_buffer_size flag, and the max IO buffer size via the --read_size flag. This technique also helps with IMPALA-652, since short columns will not use as much memory as before (we will not use considerably more memory than the size of the table). This patch also changes StringBuffer to use a doubling strategy so it doesn't end up allocating many large unused buffers, and has the scanner context use the requested length as the sync read size if it's larger than the size produced by read_past_size_cb(). These changes help prevent the boundary buffer in the scanner context from allocating excess memory. Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50 Reviewed-on: http://gerrit.ent.cloudera.com:8080/938 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:01 -08:00
Lenni Kuff	6bba0c8ffe	Fix bug cleaning up removed Functions and fix test_ddl to create all test dbs When dropping functions, we neeed to remove the function from the list of Functions with that name AND remove the list from the Function map if the list is empty. The second part wasn't happening. Also fixes the test_ddl to properly create all test databases. Change-Id: Id85af7d5db74a31161f48bea3816bdf734063133 Reviewed-on: http://gerrit.ent.cloudera.com:8080/952 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:00 -08:00
Lenni Kuff	39f77b8b8f	Add support for cluster-synchronized catalog operations This change adds support for cluster-synchronized catalog operations. This provides the guaranteethat after a catalog op completes, all other subscribers to the catalog topic have also processed that update. This is useful when load balancing, because a common workflow is to target a different impalad for each statement executed. For example if each of the following were executed sequentially, but targeting a different node: 1) CREATE TABLE Foo 2) INSERT INTO Foo 3) SELECT * FROM Foo 4) INSERT INTO Foo .... Since both the INSERT and the CREATE update the catalog, it would not work as expected without this patch. The user might either get a "table not found" error or would be missing partition information from the INSERT. The downside is that this approach to DDL takes a bit longer because we need to wait until all subscribers have processed an update. If all nodes are healthy, this overhead should not be significantly longer than the current DDL time. However, a single bad node might slow down or completely block the completion of all DDL operations. By default this feature is disabled, but it can be enabled using a new query option: SYNCED_DDL=1 To test this, the base test suite was updated to support selecting a random impalad to execute each query section in a query test file. This is currently only enabled for the insert and DDL tests, but could be leveraged by more tests in the future. TODO: Add additional failure tests around this functionality. TODO: Add an explicit "sync" statement so users do not need to run all their DDL in this mode (since it is slower). Change-Id: I45e757a931bf2a4740cc0cdd1e76ce49a1e22b83 Reviewed-on: http://gerrit.ent.cloudera.com:8080/899 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:58 -08:00
Lenni Kuff	8bb0010415	IMPALA-597: Do not crash when multiple partitions have the same LOCATION This patch fixes an issue where Impala would crash if two partitions had the same HDFS location. This is now fixed in hdfs-scan-node. It also includes some cleanup and bug fixes to the FE partition related classes and adds tests. There is still a problem where partition location metadata is not sent to the BE for INSERT statements, but that will be resolved in a separate patch. Change-Id: I0f1c3113d654f7d2b410f00e793ff6b0cae1ae18 Reviewed-on: http://gerrit.ent.cloudera.com:8080/876 Reviewed-by: Alan Choi <alan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:57 -08:00
Matthew Jacobs	51bfc99c63	IMPALA-395: Impala "show create table" statement Adds support for "show create table", a DDL statement that outputs a DDL statement that creates the specified table. In general, the output DDL works in Impala, so a user can copy the output and execute it to create the same table. However, there are a few special cases that output Hive DDL because we do not support creating some tables in Impala: HBase tables and tables with LZO compressed text. When we do support creating these tables in Impala, users should be able to execute the DDL in Impala as well. Change-Id: I8c130297a657810dea5b994bf99d72b0e61b847b Reviewed-on: http://gerrit.ent.cloudera.com:8080/842 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:53:53 -08:00
ishaan	fcdcf1a9d8	Parallelize data loaded through Impala to speed up data loading. Currently, we execute all the queries involved in data loading serially. This change creates a separate .sql file for each file format, compression codec and compression scheme combination, and executes all the files in parallel. Additionally, we now store all the .sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note that only data loaded through Impala is parallelized, data loaded through hive and hbase remains serial. On our build machines, the time taken to load all the data from snapshot was on the order of 15 minutes. Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/804 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:53 -08:00
Alex Behm	1497002013	Added SHOW TABLE/COLUMN STATS command. Fixed the following stats-related bugs: - Per-partition row count was not distributed properly via CatalogService - HBase column stats were not loaded and distributed properly Enhancements to test framework: - Allow regex specification of expected row or column values - Fixed expected results of some tests because the test framework did not catch that they were incorrect Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51 Reviewed-on: http://gerrit.ent.cloudera.com:8080/813 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:51 -08:00
Lenni Kuff	498c2529d4	Test CR: Change spacing in run-all.sh Change-Id: I2362799213a7faca3892e38fb874bfbbd0c1718f Reviewed-on: http://gerrit.ent.cloudera.com:8080/803 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:50 -08:00
Matthew Jacobs	00bc971d34	IMPALA-531: Allow arithmetic expressions for LIMIT Change-Id: Ic1901e9dbaeee5fb0aef72a278b4aa262a2abcd7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/829 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:53:49 -08:00
Matthew Jacobs	65353fd9fb	IMPALA-598: Order by behavior for NULLs should be revisited This change modifies that behavior of NULL ordering such that nulls always compare greater than other values, but "nulls first" or "nulls last" can be used to explicitly specify if nulls should be sorted first or last regardless of the asc/desc. Change-Id: I92feda1e7f42249de4009afd39f8395a0a32a2f8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/812 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:53:48 -08:00
Skye Wanderman-Milne	9d05d6d03a	Allow UDF tests to run in parallel. Change-Id: I9512d4a6920c4a71383d9374eb5feb303c3db85d Reviewed-on: http://gerrit.ent.cloudera.com:8080/727 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:47 -08:00
Lenni Kuff	fc7733d530	Fix resolution of mismatched column names that come from the deserializer (ex. Avro tables) Fixes a bug (regression) where the catalog server was not properly resolving column names when a table's column definition did not match its Avro schema definition. The expected behavior in this case is that the the Avro scehma definition should be used instead of the table columns. We had no test tables that were mismatched so this wasn't caught. This loading of the schema and columns happens when a table's metadata is loaded, so the fix is to just add a toThrift() to Column and not reference metastore.getSd().getCols() directly since it might be the "wrong" set of columns. Change-Id: I341a3a8834f5748f90c246d2093ddb983ecfdd4f Reviewed-on: http://gerrit.ent.cloudera.com:8080/770 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:44 -08:00
Skye Wanderman-Milne	7e8e184acf	Allow UDFs in conjunct expressions. This patch refactors HDFSScanNode to copy and prepare all conjunct exprs in Prepare(), rather than in the scanner threads. This is necessary so the UDF exprs get codegen'd. Prepare() also only codegens the functions for the necessary file formats now, rather than for all file formats regardless of what's actually be scanned. Change-Id: Ic3220cbd0cba9a3baa138b1f50ecdc6889ed0cd1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/710 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:39 -08:00
Skye Wanderman-Milne	97a6b12e37	Fix UDFs used in partition pruning exprs. Exprs used for partition pruning are prepared/evaluated with a separate RuntimeState. If these exprs use UDFs, the runtime state needs access to the process's ExecEnv so we can use the LibCache and the IR produced by the UDF exprs needs to be optimized and jit'd. Change-Id: If7c1d6ebc0015ef3c21a0421c1a36cad4be66625 Reviewed-on: http://gerrit.ent.cloudera.com:8080/695 Tested-by: jenkins Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:39 -08:00
Nong Li	601f24a198	UDA execution loose ends. Unfortunately, the BE does not have the codegen path to execute UDAs. This puts some restrictions on the UDAs we can run. - No IR UDAs - No varargs - Must have 8 arguments or less. The code to do this is almost all there for UDFs but I'm not sure I'll get to it. Change-Id: I8a06e635a9138397c8474a5704c3e588bb92347b Reviewed-on: http://gerrit.ent.cloudera.com:8080/703 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:38 -08:00
Lenni Kuff	8b2acf5c22	IMPALA-425: Detect read-only tables and disable INSERT/LOAD operations on these tables With this change we now detect if a table is read-only and disable INSERT/LOAD operations on these tables. A table is read-only if Impala does not have write permission on the HDFS base directory of the table or any one of the partition directories (if the table is partitioned). Change-Id: I25515b2d0ffb7fe297359437fd937a3d6e0406a0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/713 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:37 -08:00
Nong Li	a944a1fe52	'Invalidate metadata' no longer clears user functions. Change-Id: I36de18fefa1d515a7960c2bf8c116d5217c388d6 Reviewed-on: http://gerrit.ent.cloudera.com:8080/726 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:36 -08:00
Alex Behm	b670b9f4f9	Fix switch to hive-exec from hive-builtins in UDF test file. Change-Id: Ibb75e129ea6c3da5ede9e8e399e537e3e561e814 Reviewed-on: http://gerrit.ent.cloudera.com:8080/723 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:35 -08:00
Alex Behm	51e914e911	Use hive-exec instead of hive-builtin because hive-builtin does not exist in CDH5 Hive. Change-Id: I11993c7eebc9f5f07f112810d7e81d07ce157193 Reviewed-on: http://gerrit.ent.cloudera.com:8080/715 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:33 -08:00
Lenni Kuff	01c8c43fec	Uniquify FUNCTION catalog topic entry keys by including parent database name Change-Id: I6aa49520f548ddfcd557e2f908a09be454765e8c Reviewed-on: http://gerrit.ent.cloudera.com:8080/698 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:29 -08:00
Alex Behm	b82880738c	IMPALA-617: Cast NULLs in INSERT statement due to incomplete permutation list to expected column type. Inserting NULLs with NULL_TYPE into Parquet tables cases a crash. Change-Id: I350c7ee2789c017cee5c4b6a1292c9fae36087f1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/696 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:29 -08:00
Skye Wanderman-Milne	b41ff0c8cd	Modify test-udfs.cc so there are no undefined symbols in shared library. AnalyzeDDLTest was failing because the fesupport binary couldn't resolve a function used in libTestUdfs.so (the function was defined in udf.cc, rather than udf.h). I couldn't figure out how to cleanly build udf.cc into the libTestUdfs.so, so instead I removed the use of the function in test-udfs.cc. Change-Id: I81243547584a5b49a5f9265d0d17e035e18d6110 Reviewed-on: http://gerrit.ent.cloudera.com:8080/694 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:53:27 -08:00
Nong Li	911cfc1bb9	Fix vararg UDFs. Change-Id: I0e202b984ece7de3d220b6ce89b0c0a4c9edcb45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/688 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:26 -08:00
Lenni Kuff	72e211ca4a	Use Hive Metastore Service instead of HiveServer 1 in test infrastructure Change-Id: I4e2ba02b2101bae95d196ab13f9453e1b3a9d7be Reviewed-on: http://gerrit.ent.cloudera.com:8080/689 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:26 -08:00
Nong Li	4800995d44	Add execution for Hive UDFs. Change-Id: I6a5ad96fed77e2b8a2701f21a917a8eb7a11d500 Reviewed-on: http://gerrit.ent.cloudera.com:8080/458 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:25 -08:00
Nong Li	904289d168	Add UDA execution. Change-Id: Ie5aab79742675fc62ed731c13abe83304df80991 Reviewed-on: http://gerrit.ent.cloudera.com:8080/642 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:24 -08:00
Alex Behm	3f54240fed	PlannerTest uses explain level 'normal'. Only add stats and costs to explain output in 'verbose' mode. Change-Id: I827b4c7085b5aa2dc5521f8748d8973178f43f4c Reviewed-on: http://gerrit.ent.cloudera.com:8080/678 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:23 -08:00
Alex Behm	c5c2ccb56c	Fix build break due to machine-dependent explain output. Change-Id: I6b72e4e6cf2a7b38d4687c6f0f860e9744c9cedb Reviewed-on: http://gerrit.ent.cloudera.com:8080/675 Tested-by: jenkins Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2014-01-08 10:53:22 -08:00
Alex Behm	4bb8b38cde	Added stats and cost estimates to explain output. Change-Id: I1273745a439fd25cefa4e08ecc075c98cc8bfc45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/602 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:22 -08:00
Skye Wanderman-Milne	8692e7df8d	Add timestamp support to CodegenAnyVal Change-Id: I2bbeae16660709c2c15d545e6d1c791912e880db Reviewed-on: http://gerrit.ent.cloudera.com:8080/655 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:21 -08:00
Nong Li	6b9a7de02e	Add symbol resolution during analysis for create function stmts. Before this, we had to specify the entire mangled symbol. This can be quite long and quite tedious (take a look at some of the create UDA test cases that specify all the symbols). This patch adds some code to convert from the user function signature to the mangled name. This means the user can specify the unmangled name and we can do the symbol lookup. The mangling rules are pretty convoluted but if it is messed up, the user can always specify the full symbol. Some other minor cleanup in: - JNI from FE to BE - UDFs/UDAs that are loaded as test data Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/624 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:20 -08:00
Nong Li	c031cd4e96	Update RLE encoding to pad literal groups to 8. Change-Id: I77cb2b80b888b569ff715c583f16aea4e39fe680 Reviewed-on: http://gerrit.ent.cloudera.com:8080/644 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:17 -08:00
ishaan	8a43426879	Sleep after starting the hiveserver2 service to guards against it not starting on time. Change-Id: I9a0de1cc63089cba2f9b59942ee45abc44b8662e Reviewed-on: http://gerrit.ent.cloudera.com:8080/643 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:17 -08:00
Lenni Kuff	b07a3ccfd6	Use an external Hive Metastore Service for local test runs Using an external Hive Metastore Service for local test runs has a number of benefits. Some of the benefits are that it helps separate the metastore logs from the impala logs, and that it is more representative of what is on real cluster environments. It also may help with some of the concurrency issues that we have been seeing when running directly against the backend database since we no longer spin up an in-process metastore server for each client connection. The metastore is started by running "run-hive-server.sh" which is invoked as part of "run-all.sh". Change-Id: If60fa97aa38e4ad5cf578b9b409eeea1e0e29375 Reviewed-on: http://gerrit.ent.cloudera.com:8080/628 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:15 -08:00
Lenni Kuff	92829b8400	IMPALA-587: Support implicit hbase column mapping keys The Hive HBase spec specifies that the key column mapping can either be defined explicitly (using the :key syntax) or left out completely in which case a mapping to the first table column is implied. This change updates Impala to support implicit key mappings and also adds some checks in our ALTER TABLE DDL to unsure we cannot get into this state by dropping a column from an Hbase table (a similar restriction that Hive puts in place) Change-Id: I920d642261659ee3e881da2553ffe83300923af8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/554 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:14 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
Nong Li	1eb2b7a964	Add execution for vararg UDFs. Change-Id: I46e5670c09ac0b8e62f39dfc832fe880dd1dc995 Reviewed-on: http://gerrit.ent.cloudera.com:8080/572 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:09 -08:00
Alex Behm	9065648d77	Improvements to cost estimation and explain output. Fixed cost estimation of union queries and exchange nodes. Fixed propagation of stats through cloning of exprs and plan nodes. Fixed propagation of expr stats to slots they are materialized into (e.g., grouping columns in multi-level aggs). Improved explain output for constant selects. Change-Id: I96d1652c00d48e4093b85ae7fc8bad28d74b8b81 Reviewed-on: http://gerrit.ent.cloudera.com:8080/547 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:08 -08:00

1 2 3 4 5 ...

403 Commits