mirror of https://github.com/apache/impala.git synced 2026-01-09 15:00:11 -05:00

Go to file

Tim Armstrong a2e88f0e6c IMPALA-3495: incorrect join result due to implicit cast in Murmur hash

We observed that some spilling joins started returning incorrect
results. The behaviour seems to happen when a codegen'd insert and a
non-codegen'd probe function is used (or vice-versa). This only seems to
happen in a subset of cases.

The bug appears to be a result of the implicit cast of the uint32_t seed
value to the int32_t hash argument to HashTable::Hash(). The behaviour
is unspecified if the uint32_t does not fit in the int32_t. In Murmur
hash, this value is subsequently cast to a uint64_t, so we have a chain
of uint32_t->int32_t->uint64_t conversions. It would require a very
careful reading of the C++ standard to understand what the expected
result is, and whether we're seeing a compiler bug or just unspecified
behaviour, but we can avoid it entirely by keeping the values unsigned.

Testing:
I was able to reproduce the issue under a very specific of circumstances,
listed below. Before this change it consistently returned 0 rows. After the
change it consistently returned the correct results. I haven't had much
luck creating a suitable regression test.

* 1 impalad
* --disable_mem_pools=true
* use tpch_20_parquet;
* set mem_limit=1275mb;
* TPC-H query 7:

select
  supp_nation,
  cust_nation,
  l_year,
  sum(volume) as revenue
from (
  select
    n1.n_name as supp_nation,
    n2.n_name as cust_nation,
    year(l_shipdate) as l_year,
    l_extendedprice * (1 - l_discount) as volume
  from
    supplier,
    lineitem,
    orders,
    customer,
    nation n1,
    nation n2
  where
    s_suppkey = l_suppkey
    and o_orderkey = l_orderkey
    and c_custkey = o_custkey
    and s_nationkey = n1.n_nationkey
    and c_nationkey = n2.n_nationkey
    and (
      (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')
      or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')
    )
    and l_shipdate between '1995-01-01' and '1996-12-31'
  ) as shipping
group by
  supp_nation,
  cust_nation,
  l_year
order by
  supp_nation,
  cust_nation,
  l_year

Change-Id: I952638dc94119a4bc93126ea94cc6a3edf438956
Reviewed-on: http://gerrit.cloudera.org:8080/3034
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins

2016-05-12 23:06:35 -07:00

IMPALA-3495: incorrect join result due to implicit cast in Murmur hash

2016-05-12 23:06:35 -07:00

bin

Add ninja support for faster incremental builds

2016-05-12 14:17:53 -07:00

cmake_modules

IMPALA-3166: basic perf support and asm dumps for codegened code

2016-05-12 14:18:03 -07:00

common

IMPALA-3480: Add query options for min/max filter sizes

2016-05-12 23:06:35 -07:00

ext-data-source

IMPALA-3384: add missing frontend -> ext-data-source dependency.

2016-05-12 14:17:47 -07:00

IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution skew

2016-05-12 14:18:04 -07:00

infra

IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

2016-05-12 14:17:49 -07:00

llvm-ir

Misc. codegen utilties

2016-02-10 04:44:31 +00:00

shell

IMPALA-3397: Source query files from shell.

2016-05-12 14:17:54 -07:00

ssh_keys

Move ssh keys from bin directory to fix packaging build break

2014-01-08 10:44:12 -08:00

testdata

IMPALA-3480: Add query options for min/max filter sizes

2016-05-12 23:06:35 -07:00

tests

IMPALA-3490: Add flag to reduce minidump size

2016-05-12 14:18:04 -07:00

www

IMPALA-2198: Differentiate queries in exceptional states in web UI

2016-05-12 14:17:50 -07:00

.gitignore

Add .impala_compiler_opts to .gitignore

2016-05-12 14:17:58 -07:00

buildall.sh

Add ninja support for faster incremental builds

2016-05-12 14:17:53 -07:00

CMakeLists.txt

IMPALA-2686: Add breakpad crash handler to all daemons

2016-05-12 14:17:52 -07:00

LICENSE.txt

Add text of Apache license

2014-05-08 11:16:53 -07:00

LOGS.md

Consolidate test and cluster logs under a single directory.

2016-03-28 19:23:22 +00:00

NOTICE.txt

Add NOTICE.txt file to Impala repo

2014-07-02 15:23:24 -07:00

README.md

Add explanation that Incubator repo is not buildable yet

2016-04-12 14:15:55 -07:00

README.md

Welcome to Impala

Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.

Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:

Best of breed performance and scalability.
Support for data stored in HDFS, Apache HBase and Amazon S3.
Wide analytic SQL support, including window functions and subqueries.
On-the-fly code generation using LLVM to generate CPU-efficient code tailored specifically to each individual query.
Support for the most commonly-used Hadoop file formats, including the Apache Parquet (incubating) project.
Apache-licensed, 100% open source.

More about Impala

To learn more about Impala as a business user, or to try Impala live or in a VM, please visit the Impala homepage.

If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.

Building Impala

This Apache Incubator repository is currently not buildable but has the complete source code for Impala minus some third-party dependences. See https://github.com/cloudera/Impala for the buildable Impala source and https://issues.cloudera.org/browse/IMPALA-3223 to track progress on making this repository buildable.

Languages

C++ 49.2%

Java 30.4%

Python 14.5%

JavaScript 1.4%

C 1.2%

Other 3.2%