mirror of https://github.com/apache/impala.git synced 2026-01-07 00:02:28 -05:00

Go to file

Zoltan Borok-Nagy f8015ff68d IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list

Minor compactions can compact several delta directories into a single
delta directory. The current directory filtering algorithm had to be
modified to handle minor compacted directories and prefer those over
plain delta directories. This happens in the Frontend, mostly in
AcidUtils.java.

Hive Streaming Ingestion writes similar delta directories, but they
might contain rows Impala cannot see based on its valid write id list.

E.g. we can have the following delta directory:

full_acid/delta_0000001_0000010/0000 # minWriteId: 1
                                     # maxWriteId: 10

This delta dir contains rows with write ids between 1 and 10. But maybe
we are only allowed to see write ids less than 5. Therefore we need to
check the ACID write id column (named originalTransaction) to determine
which rows are valid.

Delta directories written by Hive Streaming don't have a visibility txn
id, so we can recognize them based on the directory name. If there's
a visibilityTxnId and it is committed => every row is valid:

full_acid/delta_0000001_0000010_v01234 # has visibilityTxnId
                                       # every row is valid

If there's no visibilityTxnId then it was created via Hive Streaming,
therefore we need to validate rows. Fortunately Hive Streaming writes
rows with different write ids into different ORC stripes, therefore we
don't need to validate the write id per row. If we had statistics,
we could validate per stripe, but since Hive Streaming doesn't write
statistics we validate the write id per ORC row batch (an alternative
could be to do a 2-pass read, first we'd read a single value from each
stripe's 'currentTransaction' field, then we'd read the stripe if the
write id is valid).

Testing
 * the frontend logic is tested in AcidUtilsTest
 * the backend row validation is tested in test_acid_row_validation

Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d
Reviewed-on: http://gerrit.cloudera.org:8080/15818
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2020-05-20 21:00:44 +00:00

IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list

2020-05-20 21:00:44 +00:00

bin

IMPALA-9708: Remove Sentry support

2020-05-20 17:43:40 +00:00

cmake_modules

IMPALA-9335 (part 2): Fix rebased KRPC to compile

2020-02-04 23:03:58 +00:00

common

IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list

2020-05-20 21:00:44 +00:00

docker

IMPALA-9679: Remove some jars from Docker images

2020-05-16 22:39:40 +00:00

docs

IMPALA-9541: [DOCS] add steps to dynamically change log levels

2020-05-14 02:47:18 +00:00

ext-data-source

IMPALA-9008: Serialize Maven invocations to deflake impala-minimal-hive-exec

2019-10-09 15:58:04 +00:00

IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list

2020-05-20 21:00:44 +00:00

impala-parent

IMPALA-9708: Remove Sentry support

2020-05-20 17:43:40 +00:00

infra

IMPALA-9708: Remove Sentry support

2020-05-20 17:43:40 +00:00

lib/python

IMPALA-9157: Make helper function exec_local_command python 2.6 compatible

2019-11-25 22:52:15 +00:00

query-event-hook-api

IMPALA-9008: Serialize Maven invocations to deflake query-event-hook-api

2019-10-15 10:58:33 +00:00

security

KUDU-2305: Limit sidecars to INT_MAX and fortify socket code

2018-03-22 10:27:08 +00:00

shaded-deps

IMPALA-9548: UdfExecutorTest failures after HIVE-22893

2020-03-26 00:48:46 +00:00

shell

Revert "IMPALA-9718: Delete pkg_resources from IMPALA_HOME/shell/"

2020-05-07 23:15:32 +00:00

ssh_keys

Move ssh keys from bin directory to fix packaging build break

2014-01-08 10:44:12 -08:00

testdata

IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list

2020-05-20 21:00:44 +00:00

tests

IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list

2020-05-20 21:00:44 +00:00

www

IMPALA-6360: Don't show full query statement on Impala WebUI by default

2020-03-23 17:59:40 +00:00

.clang-format

IMPALA-8047 Support .proto files in .clang-format

2019-01-07 21:55:23 +00:00

.clang-tidy

IMPALA-9128: part 2: dump traces for slow RPCs

2019-11-14 20:24:58 +00:00

.gitattributes

IMPALA-8879: upgrade bootstrap for debug page to 4.3.1

2019-08-31 21:58:31 +00:00

.gitignore

Update .gitignore with VSCode artifacts

2020-01-17 20:56:57 +00:00

buildall.sh

IMPALA-9708: Remove Sentry support

2020-05-20 17:43:40 +00:00

CMakeLists.txt

IMPALA-3926: part 2: avoid setting LD_LIBRARY_PATH

2020-05-07 08:50:44 +00:00

EXPORT_CONTROL.md

IMPALA-4406: Add cryptography export control notice

2016-11-04 18:26:40 +00:00

LICENSE.txt

Revert "IMPALA-9718: Delete pkg_resources from IMPALA_HOME/shell/"

2020-05-07 23:15:32 +00:00

LOGS.md

Consolidate test and cluster logs under a single directory.

2016-03-28 19:23:22 +00:00

NOTICE.txt

2019-01-19 01:20:41 +00:00

README-build.md

IMPALA-9708: Remove Sentry support

2020-05-20 17:43:40 +00:00

README.md

IMPALA-9646: clean up README

2020-04-21 19:27:34 +00:00

setup.cfg

Ignore flake8 W503 about breaking before operators

2018-08-13 19:39:38 +00:00

README.md

Welcome to Impala

Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.

Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:

Best of breed performance and scalability.
Support for data stored in HDFS, Apache HBase, Apache Kudu, Amazon S3, Azure Data Lake Storage, Apache Hadoop Ozone and more!
Wide analytic SQL support, including window functions and subqueries.
On-the-fly code generation using LLVM to generate lightning-fast code tailored specifically to each individual query.
Support for the most commonly-used Hadoop file formats, including Apache Parquet and Apache ORC.
Support for industry-standard security protocols, including Kerberos, LDAP and TLS.
Apache-licensed, 100% open source.

More about Impala

To learn more about Impala as a business user, or to try Impala live or in a VM, please visit the Impala homepage. Detailed documentation for administrators and users is available at Apache Impala documentation.

If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.

Supported Platforms

Impala only supports Linux at the moment.

Export Control Notice

This distribution uses cryptographic software and may be subject to export controls. Please refer to EXPORT_CONTROL.md for more information.

Build Instructions

See Impala's developer documentation to get started.

Detailed build notes has some detailed information on the project layout and build.

Languages

C++ 49.3%

Java 30.4%

Python 14.5%

JavaScript 1.3%

C 1.2%

Other 3.2%