impala

mirror of https://github.com/apache/impala.git synced 2026-01-04 00:00:56 -05:00

Author	SHA1	Message	Date
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Anuj Phadke	a915293109	IMPALA-1850: Allow fs.defaultFS to be set to a non-HDFS filesystem This change whitelists the supported filesystems which can be set as Default FS for Impala to run on. This patch configures Impala to use S3 as the default filesystem, rather than a secondary filesystem as before. Change-Id: I2f45bef6c94ece634045acb906d12591587ccfed Reviewed-on: http://gerrit.cloudera.org:8080/1121 Reviewed-by: anujphadke <aphadke@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:40 -07:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
Nong Li	fd35cee887	Reorganize/reduce end to end test time. This patch does a few things: 1) Move the metadata tests into their own folder under tests/. I think it's useful to loosely categorize them so it's easier to run a subset of the tests that are most useful for the changes you are making. 2) Reduce the test vectors for query_tests. We should have identical coverage in the daily exhaustive runs but the normal runs should be much better. In particular, deemphasizing scanner tests since that code is more stable now. 3) Misc test cleanup/consolidate python test files/etc. Change-Id: I03c2f34877aed192c2a50665bd5e15fa85e12f1e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3831 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-08-17 12:43:57 -07:00
Lenni Kuff	bb09b5270f	IMPALA-839: Update tests to be more thorough when run exhaustively Some tests have constraints that were there only to help reduce runtime which reduces coverage when running in exhaustive mode. The majority of the constraints are because it adds no value to run the test across additional dimensions (or it is invalid to run with those dimensions). Updates the tests that have legitimate constraints to use two new helper methods for constraining the table format dimension: create_uncompressed_text_dimension() create_parquet_dimension() These will create a dimension that will produce a single test vector, either uncompressed text or parquet respectively. Change-Id: Id85387c1efd5d192f8059ef89934933389bfe247 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2149 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins (cherry picked from commit e02acbd469bc48c684b2089405b4a20552802481) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2290	2014-04-18 20:11:31 -07:00
Lenni Kuff	35817f6a17	Support faster DDL operations via the CatalogServer This change adds support for faster DDL via the CatalogServer by directly returning the TCatalogObject from each catalog operation and using this result to update the local impalad's catalog cache directly, rather than waiting for a state store heartbeat that contains the change. Because the Impalad's catalog can now be updated in two ways, it means that we need to be careful when applying updates to ensure no work gets "undone". For example, consider the following sequence of events: t1: [Direct Update] - Add item A - (Catalog Version 9) t2: [Direct Update] - Drop item A - (Catalog Version 10) t3: [StateStore Update] - (From Catalog Version 9) In this case, we need to ensure that the state store update in t3 does not undo the drop in t2, even though that update will contain the change to "add item A". To support this, we now check the catalog versions before adding any item to ensure that an existing item does not overwrite an item with a newer catalog version. To handle the case of removals, a new CatalogUpdateLog is introduced. This log tracks the catalog version each item was removed from the catalog. When adding a new catalog object, it is checked to see if this object was removed in a catalog version > than the version of the current object. If so, the update is ignored. This covers most updates, but there is still one concurrency issue that is not covered with this change. If someone issues an "invalidate metadata" concurrently with a direct catalog operation, it may briefly set the catalog back in time. This seems like okay behavior to me (the command is invalidating the catalog metadata). If we want to address this the CatalogUpdateLog could be extended to track additions to the catalog and we could replay the log after invalidating the metadata (as one possible solution). Change-Id: Icc9bdecc3c32436708bf9e9e7974f91d40e514f2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/864 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:58 -08:00
Henry Robinson	79b36a5eb3	IMPALA-375: Add column permutation clause to INSERT statement	2014-01-08 10:50:59 -08:00

7 Commits