Files
impala/testdata/workloads/functional-query/queries/QueryTest/insert.test
Sailesh Mukil 27815818b9 IMPALA-3452: S3: Disable Impala staging for INSERTs via flag for speedup
INSERTs on S3 are slower because of double buffering where we buffer
once locally and once in a staging directoy in S3 before moving the
file(s) to the final location. Also, moving the file from the staging
directory to the final location in HDFS is a quick rename which is
only a metadata operation. However, on S3, renames are not supported,
thus becoming a full file copy instead of just a metadata rename
operation.

This patch instroduces a boolean query option "s3_skip_insert_staging"
which avoids the staging step on S3 and allows the sinks to write to
the final location directly.

This trades in consistency for the sake of performance. If a node(s)
fails during the query, then we will end up with inconsistent results
in the final location.

P.S: This option is disabled for INSERT OVERWRITE queries as that
would require cleaning the destination directory before moving the
final files there. However, the coordinator is responsible for the
cleaning which takes place only after the table sinks have moved
the files to the final location. Thus, INSERT OVERWRITE queries must
still have their files moved to a staging location by the table sinks.

Performance gains:
 - For non-partitioned tables, the INSERT queries run 4-4.5x faster on
   S3. (Tested on a 63GB INSERT to a table)
 - For heavily partitioned tables, there is considerable improvement
   in the order of 4-5 minutes on queries that take ~27 minutes but
   queries are still slow because of IMPALA-3482 where the catalog
   takes too long to update all the metadata. (Tested with a query
   that creates 2.4K partitions in a table totalling ~19GB).

Change-Id: Iff9620d41ba0d5fb1aa0c9f4abb48866fc2b0698
Reviewed-on: http://gerrit.cloudera.org:8080/2905
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:18:00 -07:00

31 KiB