impala

mirror of https://github.com/apache/impala.git synced 2025-12-23 21:08:39 -05:00

Author	SHA1	Message	Date
Lars Volker	933f2ce7fd	IMPALA-4876: Remove _test suffix from test names After IMPALA-4735 has been fixed, this suffix is no longer needed to make test names prefix-free. Change-Id: Ie63d145c94c02ec67e81d0c51a33d20685fba73e Reviewed-on: http://gerrit.cloudera.org:8080/5886 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-02-03 05:16:09 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Lars Volker	8ea21d099f	IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Reviewed-on: http://gerrit.cloudera.org:8080/4863 Reviewed-by: Lars Volker <lv@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-22 02:51:20 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Tim Armstrong	3810b7c413	IMPALA-3780: avoid many small reads past end of block The text scanner had some pathological behaviour when reading significantly past the end of it scan range. E.g. reading a 256mb string that's split across blocks. ScannerContext defaulted to issuing 1kb reads, even if the scan node requested significantly more data. E.g. if the Parquet scanner called ReadBytes(16mb), this was chopped up into 1kb reads, which were reassembled in boundary_buffer_. Increase the minimum read size in this case to 64kb. Reading that amount of data should not have any significant overhead even if we only read a few bytes past the end of the scan range. ScannerContext implements a saner default algorithm that will work better if scanners make many small reads: it starts with 64kb reads and doubles the size of each successive read past the end of the scan range. We also correct pass the 'read_past_size' into GetNextBuffer(), so that we always read the right amount of data. Also save some time by pre-sizing the boundary buffer to the correct size instead of reallocating it multiple times. Testing: Add test case that exercises the code paths for very large strings. Performance: The queries in the test case are vastly faster than before. E.g. 0.6s vs ~60s for the count(*) query. Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137 Reviewed-on: http://gerrit.cloudera.org:8080/3518 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-07-08 19:42:18 -07:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Michael Brown	2826df93d2	IMPALA-3427: test_insert_wide_table: use unique_database fixture test_insert_wide_table runs in parallel attempts to DROP/CREATE tables with the same name into the same database. A race thus exists in which one tests can stomp on each others' tables. The fix is to use the unique_database fixture, in which each test can run in parallel within its own database, and thus its own table. Change-Id: If3b22f1b089d20c6024d6d680af82f102342394e Reviewed-on: http://gerrit.cloudera.org:8080/2871 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Michael Brown <mikeb@cloudera.com>	2016-05-12 14:17:44 -07:00
Vlad Berindei	b6c20b2a40	Allow Impala to run against local filesystem. Allow Impala to start only with a running HMS (and no additional services like HDFS, HBase, Hive, YARN) and use the local file system. Skip all tests that need these services, use HDFS caching or assume that multiple impalads are running. To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has permissions since this is the location where the test data will be extracted. Test coverage (with core strategy) in comparison with HDFS and S3: HDFS 1348 tests passed S3 1157 tests passed Local Filesystem 1161 tests passed Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03 Reviewed-on: http://gerrit.cloudera.org:8080/1352 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins Readability: Alex Behm <alex.behm@cloudera.com>	2015-12-05 06:48:32 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
ishaan	09e5eaeda2	Introduce classes for pytest's skipif markers. This patch encapsulates pytests's skipif markers in classes. It leads to the following benefits: - Provide context and grouping for tests being skipped. - As we improve test reporting, annotations will give us a better idea of coverage. Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9 Reviewed-on: http://gerrit.cloudera.org:8080/297 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/343	2015-04-19 03:09:59 +00:00
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
ishaan	11cd7d1d46	Blacklist tests that don't work on s3 This patch introduces a new pytest marker that skip tests that currently don't work when s3 is used as the underlying file system. The set of blacklisted tests is a superset of tests that cannot be run with s3. Follow up patches will remove some of the test files from the blacklist. Change-Id: I39a58223d3435f0bd6496ffd00a2d483b751693d Reviewed-on: http://gerrit.cloudera.org:8080/82 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 01:43:28 +00:00
Nong Li	d52a620737	Add support for writing compressed text. Change-Id: I314b925594801ae4b5c47248d998801aa0b37270 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4205 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-07 22:08:30 -07:00
Victor Bittorf	820e1c070b	Support writing to Avro files Introduces support for writing tables stored as Avro files. This supports writing all data types except TIMESTAMP. Supports the following COMPRESSION_CODECs: NONE, DEFLATE, SNAPPY. Change-Id: Ica62063a4f172533c30dd1e8b0a11856da452467 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3863 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit 15c6066d05d5077bee0d5123d26777b0715eb9c6) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4056	2014-08-27 13:41:42 -07:00
Victor Bittorf	f2ef06bef1	SEQUENCEFILE: Add support for writing sequence files. This supports both uncompressed and block compressed formats. Row compressed formats are not supported. The type of compression is specified using a query parameter COMPRESSION_CODEC with values NONE, GZIP, BZIP2, and SNAPPY. Note: this patch only has basic testing. More extensive testing will be done when this avro writer is used in data loading. Change-Id: Id284bd4f3a28e27e49d56b1127cdc83c736feb61 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3541 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-08-17 12:45:10 -07:00
Nong Li	fd35cee887	Reorganize/reduce end to end test time. This patch does a few things: 1) Move the metadata tests into their own folder under tests/. I think it's useful to loosely categorize them so it's easier to run a subset of the tests that are most useful for the changes you are making. 2) Reduce the test vectors for query_tests. We should have identical coverage in the daily exhaustive runs but the normal runs should be much better. In particular, deemphasizing scanner tests since that code is more stable now. 3) Misc test cleanup/consolidate python test files/etc. Change-Id: I03c2f34877aed192c2a50665bd5e15fa85e12f1e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3831 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-08-17 12:43:57 -07:00
Ippokratis Pandis	15f26f3991	Fixing a bug in test_insert that was inserting dummy records in tinytable for text/codec Change-Id: I3e7ac5cd97de6ff4cd0eab411ac4ed4ff6c77508 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3751 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins	2014-08-05 13:19:50 -07:00
Ippokratis Pandis	99b23f1285	Skipping to run TestUnsupportedInsertFormats test for compressed test. Realized that if we run it with text/codec it will successfully insert a record ("hi", "there") in tinytable, which is undesirable. Skipping this test for compressed text. Change-Id: I8076f4fcac17d7f7530ac4993b0d4b3bd89d5fa0 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3668 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-08-05 13:19:49 -07:00
Ippokratis Pandis	572a4aed95	Updating the expectation of TestUnsupportedInsertFormats test. Previously it was expecting that inserting in any combination of text/codec where codec!=none would fail. With the read compressed text patch, such an insert will succeed by inserting uncompressed text, ignoring the compression type. This is a temporary fix for this (code coverage) test to succeed, until the text writers are in. Change-Id: Ib34bf661ee90e3bcbb76a225df8809ae13e835fa Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3659 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins	2014-08-05 13:19:49 -07:00
Lenni Kuff	b3ebfddadd	Allow tests to access query result column values by col alias or col position For example, you can now do something like: result_set = execute("select * from tbl") result_row = result_set[0] result_row['col_alias'] or result_row[4] to access column values. If the column alias/position does not exist an exception is thrown. Change-Id: Ie4b65619ed17fd90bf39e0966a7fc7e1180dbc5c Reviewed-on: http://gerrit.ent.cloudera.com:8080/2719 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2922	2014-06-09 23:24:26 -07:00
Skye Wanderman-Milne	60db4d4d82	CDH-18416: Don't inline ReadWriteUtil::ReadZLong() For wide Avro tables, ReadZLong() would get inlined many times into a single function body, causing LLVM to crash. Not inlining doesn't seem to have a performance impact on narrow tables, and helps with wide tables. This change also adds tests over wide (i.e. many-column) tables. The test tables are produced by specifying shell commands to generate test tables in functional_schema_template.sql, which are executed in generate-schema-statements.py. In the SQL templates, sections starting with a ` are treated as shell commands. The output of the shell command is then used as the section text. This is only a starting point; it isn't currently implemented for all sections, and may have to be tweaked if we use this mechanism for all tables. Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com> (cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353 Tested-by: jenkins	2014-04-28 15:58:15 -07:00
Henry Robinson	635dd7d289	IMPALA-875: Respect isAnalyzed_ in IntLiteral expressions Partition column expressions are analysed twice for INSERT statements - once to infer the type and so to add a possible cast, and once to compute stats on the resulting expr. However, this process resulted in an partition column expr that was a IntLiteral getting the smallest type that would contains its value, rather than retaining the column-compatible type that had been assigned to it. This patch does the minimum thing, which is make IntLiteral.analyze() idempotent. Doing the same thing to Expr and LiteralExpr unearths some other bugs, which we will have to fix in a follow-on patch (see IMPALA-884). Change-Id: Ie22fc5d3f4832c735a1ebc0ef78f50d736f597fd Reviewed-on: http://gerrit.ent.cloudera.com:8080/1931 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins (cherry picked from commit 1912d65ea21a5025d385948642f0d4aadad91abf) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1947	2014-03-17 17:35:12 -07:00
Skye Wanderman-Milne	be18bd8f76	IMPALA-752: Improve INSERT error message for unsupported file formats Change-Id: Ib16817d6e49d3df30643563eb9ec5573a920bba7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1911 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins (cherry picked from commit 9e93c237fde1877eb0d140e73b090f2b891f3474) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1941	2014-03-17 14:54:46 -07:00
Lenni Kuff	08417c875f	IMPALA-849: Impala does not work with boolean partition key columns This is because in HdfsTable we call call "expr.castTo(colType)", but BooleanLiteral (incorrectly) didn't implement "uncheckedCastTo()". This meant that instead of a BooleanLiteral being returned we got back a CastExpr, which cannot be cast to LiteralExpr. As part of this change it turns out Boolean partition columns are also broken in Hive. I filed HIVE-6590 for these issues and we decided to disable INSERT into a boolean partition column for Impala due to this bug. Change-Id: I3e295bb96aadc08d64faf551f6393a7128a7ef27 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1755 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-03-11 18:42:08 -07:00
Lenni Kuff	9717b7af28	Rename SYNCED_DDL query option to SYNC_DDL Change-Id: I0b5e08694a271c40ac55d8e695cf3a74a012ce06 Reviewed-on: http://gerrit.ent.cloudera.com:8080/972 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:11 -08:00
Lenni Kuff	39f77b8b8f	Add support for cluster-synchronized catalog operations This change adds support for cluster-synchronized catalog operations. This provides the guaranteethat after a catalog op completes, all other subscribers to the catalog topic have also processed that update. This is useful when load balancing, because a common workflow is to target a different impalad for each statement executed. For example if each of the following were executed sequentially, but targeting a different node: 1) CREATE TABLE Foo 2) INSERT INTO Foo 3) SELECT * FROM Foo 4) INSERT INTO Foo .... Since both the INSERT and the CREATE update the catalog, it would not work as expected without this patch. The user might either get a "table not found" error or would be missing partition information from the INSERT. The downside is that this approach to DDL takes a bit longer because we need to wait until all subscribers have processed an update. If all nodes are healthy, this overhead should not be significantly longer than the current DDL time. However, a single bad node might slow down or completely block the completion of all DDL operations. By default this feature is disabled, but it can be enabled using a new query option: SYNCED_DDL=1 To test this, the base test suite was updated to support selecting a random impalad to execute each query section in a query test file. This is currently only enabled for the insert and DDL tests, but could be leveraged by more tests in the future. TODO: Add additional failure tests around this functionality. TODO: Add an explicit "sync" statement so users do not need to run all their DDL in this mode (since it is slower). Change-Id: I45e757a931bf2a4740cc0cdd1e76ce49a1e22b83 Reviewed-on: http://gerrit.ent.cloudera.com:8080/899 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:58 -08:00
Lenni Kuff	35817f6a17	Support faster DDL operations via the CatalogServer This change adds support for faster DDL via the CatalogServer by directly returning the TCatalogObject from each catalog operation and using this result to update the local impalad's catalog cache directly, rather than waiting for a state store heartbeat that contains the change. Because the Impalad's catalog can now be updated in two ways, it means that we need to be careful when applying updates to ensure no work gets "undone". For example, consider the following sequence of events: t1: [Direct Update] - Add item A - (Catalog Version 9) t2: [Direct Update] - Drop item A - (Catalog Version 10) t3: [StateStore Update] - (From Catalog Version 9) In this case, we need to ensure that the state store update in t3 does not undo the drop in t2, even though that update will contain the change to "add item A". To support this, we now check the catalog versions before adding any item to ensure that an existing item does not overwrite an item with a newer catalog version. To handle the case of removals, a new CatalogUpdateLog is introduced. This log tracks the catalog version each item was removed from the catalog. When adding a new catalog object, it is checked to see if this object was removed in a catalog version > than the version of the current object. If so, the update is ignored. This covers most updates, but there is still one concurrency issue that is not covered with this change. If someone issues an "invalidate metadata" concurrently with a direct catalog operation, it may briefly set the catalog back in time. This seems like okay behavior to me (the command is invalidating the catalog metadata). If we want to address this the CatalogUpdateLog could be extended to track additions to the catalog and we could replay the log after invalidating the metadata (as one possible solution). Change-Id: Icc9bdecc3c32436708bf9e9e7974f91d40e514f2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/864 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:58 -08:00
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
Nong Li	a3bc1ce133	Some parquet encoder/decoder refactoring. Added dictionary to other types. Split out the encoder/type for parquet reader/writer. I think this puts us in a better place to support future encodings. On the tpch lineitem table, the results are: Before: BytesWritten: 236.45 MB Per Column Sizes: l_comment: 75.71 MB l_commitdate: 8.64 MB l_discount: 11.19 MB l_extendedprice: 33.02 MB l_linenumber: 4.56 MB l_linestatus: 869.98 KB l_orderkey: 8.99 MB l_partkey: 27.02 MB l_quantity: 11.58 MB l_receiptdate: 8.65 MB l_returnflag: 1.40 MB l_shipdate: 8.65 MB l_shipinstruct: 1.45 MB l_shipmode: 2.17 MB l_suppkey: 21.91 MB l_tax: 10.68 MB After: BytesWritten: 198.63 MB (84%) Per Column Sizes: l_comment: 75.71 MB (100%) l_commitdate: 8.64 MB (100%) l_discount: 2.89 MB (25.8%) l_extendedprice: 33.13 MB (100.33%) l_linenumber: 1.50 MB (32.89%) l_linestatus: 870.26 KB (100.032%) l_orderkey: 9.18 MB (102.11%) l_partkey: 27.10 MB (100.29%) l_quantity: 4.32 MB (37.31%) l_receiptdate: 8.65 MB (100%) l_returnflag: 1.40 MB (100%) l_shipdate: 8.65 MB (100%) l_shipinstruct: 1.45 MB (100%) l_shipmode: 2.17 MB (100%) l_suppkey: 10.11 MB (46.14%) l_tax: 2.89 MB (27.06%) The table is overall 84% as big (i.e. 16% smaller). A few columns got marginally bigger. If the file filled the 1 GB, I'd expect the overhead to decrease even more. The restructuring to use a virtual call doesn't seem to change things much and will go away when we codegen the scanner. Here's what they look like with this patch (note this is on the before data files, so only string cols are dictionary encoded). Before query times: Insert Time: 8.5 sec select : 2.3 sec select avg(l_orderkey): .33 sec After query times: Insert Time: 9.5 sec <-- Longer due to doing dictionary encoding select : 2.4 sec <-- kind of noisy, possibly a slight slow down select avg(l_orderkey): .33 sec Change-Id: I213fdca1bb972cc200dc0cd9fb14b77a8d36d9e6 Reviewed-on: http://gerrit.ent.cloudera.com:8080/238 Tested-by: jenkins <kitchen-build@cloudera.com> Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:52:16 -08:00
Nong Li	c2370c3a2d	Remove gzip from parquet testing.	2014-01-08 10:51:17 -08:00
Nong Li	7c6598066c	Add testing for different compression codecs with parquet.	2014-01-08 10:51:04 -08:00
Lenni Kuff	c74b7e41dd	Enable insert tests to run against parquet	2014-01-08 10:49:47 -08:00
Alex Behm	1b2e8280d4	Fix NULL issues.	2014-01-08 10:49:32 -08:00
Lenni Kuff	8d1674f638	Run only subset of tests with small batch_sizes + a few small fixes	2014-01-08 10:48:58 -08:00
Nong Li	02c329b97a	Update RC files to use io mgr and remove scanner support for non-io mgr.	2014-01-08 10:47:11 -08:00
Lenni Kuff	12d18631e3	Test enhancements: dynamic table format data loading, per-workload exploration stategies	2014-01-08 10:47:07 -08:00
Lenni Kuff	f8953ee8e6	Run end-to-end tests before planner tests This is needed because the planner test expected results depend on some tables being populated by running tests that do inserts (TPC-H, etc).	2014-01-08 10:46:58 -08:00
Lenni Kuff	30dbf59ef2	Final changes to enable Python test infrastructure and tests With this change the Python tests will now be called as part of buildall and the corresponding Java tests have been disabled. The new tests can also be invoked calling ./tests/run-tests.sh directly. This includes a fix from Nong that caused wrong results for limit on non-io manager formats.	2014-01-08 10:46:57 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00

1 2

90 Commits