mirror of https://github.com/apache/impala.git synced 2025-12-30 03:01:44 -05:00

Go to file

Lars Volker 8ea21d099f IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Reviewed-on: http://gerrit.cloudera.org:8080/4863
Reviewed-by: Lars Volker <lv@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins

2016-11-22 02:51:20 +00:00

IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-22 02:51:20 +00:00

bin

IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-22 02:51:20 +00:00

cmake_modules

IMPALA-3676: Use clang as a static analysis tool

2016-11-04 00:13:12 +00:00

common

IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-22 02:51:20 +00:00

docs

IMPALA-3398: Add docs to main Impala branch.

2016-11-17 22:38:44 +00:00

ext-data-source

IMPALA-4259: build Impala without any test cluster setup.

2016-10-13 05:45:47 +00:00

IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-22 02:51:20 +00:00

infra

IMPALA-3872: allow providing PyPi mirror for python packages

2016-11-08 05:34:50 +00:00

shell

IMPALA-3713,IMPALA-4439: Fix Kudu DML shell reporting

2016-11-17 04:13:25 +00:00

ssh_keys

Move ssh keys from bin directory to fix packaging build break

2014-01-08 10:44:12 -08:00

testdata

IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-22 02:51:20 +00:00

tests

IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-22 02:51:20 +00:00

www

IMPALA-1169: Admission control info on the queries debug webpage

2016-11-07 23:26:02 +00:00

.clang-format

Match .clang-format more closely to actual practice.

2016-10-14 00:08:17 +00:00

.clang-tidy

IMPALA-3676: Use clang as a static analysis tool

2016-11-04 00:13:12 +00:00

.gitignore

Remove vim plugin config file from .gitignore

2016-10-31 22:44:54 +00:00

buildall.sh

IMPALA-3676: Use clang as a static analysis tool

2016-11-04 00:13:12 +00:00

CMakeLists.txt

IMPALA-3676: Use clang as a static analysis tool

2016-11-04 00:13:12 +00:00

DISCLAIMER

IMPALA-3808: Add incubating DISCLAIMER from the Incubator Branding Guide

2016-09-02 02:12:45 +00:00

EXPORT_CONTROL.md

IMPALA-4406: Add cryptography export control notice

2016-11-04 18:26:40 +00:00

LICENSE.txt

IMPALA-4230: ASF policy issues from 2.7.0 rc3.

2016-10-19 23:59:02 +00:00

LOGS.md

Consolidate test and cluster logs under a single directory.

2016-03-28 19:23:22 +00:00

NOTICE.txt

IMPALA-3918: Remove Cloudera copyrights and add ASF license header

2016-08-09 08:19:41 +00:00

README.md

IMPALA-4406: Add cryptography export control notice

2016-11-04 18:26:40 +00:00

README.md

Welcome to Impala

Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.

Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:

Best of breed performance and scalability.
Support for data stored in HDFS, Apache HBase and Amazon S3.
Wide analytic SQL support, including window functions and subqueries.
On-the-fly code generation using LLVM to generate CPU-efficient code tailored specifically to each individual query.
Support for the most commonly-used Hadoop file formats, including the Apache Parquet (incubating) project.
Apache-licensed, 100% open source.

More about Impala

To learn more about Impala as a business user, or to try Impala live or in a VM, please visit the Impala homepage.

If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.

Supported Platforms

Impala only supports Linux at the moment.

Build Instructions

./buildall.sh -notests

Export Control Notice

This distribution uses cryptographic software and may be subject to export controls. Please refer to EXPORT_CONTROL.md for more information.

Languages

C++ 49.6%

Java 29.9%

Python 14.6%

JavaScript 1.4%

C 1.2%

Other 3.2%