Commit Graph

5 Commits

Author SHA1 Message Date
Sailesh Mukil
50bd015f2d IMPALA-5333: Add support for Impala to work with ADLS
This patch leverages the AdlFileSystem in Hadoop to allow
Impala to talk to the Azure Data Lake Store. This patch has
functional changes as well as adds test infrastructure for
testing Impala over ADLS.

We do not support ACLs on ADLS since the Hadoop ADLS
connector does not integrate ADLS ACLs with Hadoop users/groups.

For testing, we use the azure-data-lake-store-python client
from Microsoft. This client seems to have some consistency
issues. For example, a drop table through Impala will delete
the files in ADLS, however, listing that directory through
the python client immediately after the drop, will still show
the files. This behavior is unexpected since ADLS claims to be
strongly consistent. Some tests have been skipped due to this
limitation with the tag SkipIfADLS.slow_client. Tracked by
IMPALA-5335.

The azure-data-lake-store-python client also only works on CentOS 6.6
and over, so the python dependencies for Azure will not be downloaded
when the TARGET_FILESYSTEM is not "adls". While running ADLS tests,
the expectation will be that it runs on a machine that is at least
running CentOS 6.6.
Note: This is only a test limitation, not a functional one. Clusters
with older OSes like CentOS 6.4 will still work with ADLS.

Added another dependency to bootstrap_build.sh for the ADLS Python
client.

Testing: Ran core tests with and without TARGET_FILESYSTEM as
'adls' to make sure that all tests pass and that nothing breaks.

Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542
Reviewed-on: http://gerrit.cloudera.org:8080/6910
Tested-by: Impala Public Jenkins
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
2017-05-25 19:35:24 +00:00
Thomas Tauber-Marshall
ee9fbeca90 IMPALA-5340: Query profile displays stale query state
Previously, updates to the query state in ClientRequestState were
not immediately reflected in the query profile, potentially
leading to the profile showing an incorrect state for an extended
perioud during execution.

In particular, queries were being shown in the 'CREATED' state
long after they had started 'RUNNING'.

The fix is to update the profile whenever the state is updated.

Testing:
- Extended existing hs2 tests and added a beeswax test to check
  for expected query states in the profile

Change-Id: I952319b7308a24d4e2dff924199c0c771bce25b3
Reviewed-on: http://gerrit.cloudera.org:8080/6923
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-20 03:17:59 +00:00
Thomas Tauber-Marshall
f195b7577c IMPALA-5305: test_observability.py failing on s3, localFS and Isilon
A test that was recently added, test_observability::test_scan_summary,
uses an HBase table. It needs to be restricted not to run on S3,
localFS or Isilon.

Change-Id: I9863cf3f885eb1d2152186de34e093497af83d99
Reviewed-on: http://gerrit.cloudera.org:8080/6859
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-12 19:34:16 +00:00
Thomas Tauber-Marshall
49b6af54c8 IMPALA-4499: Table name missing from exec summary
For scan nodes, previously only HDFS tables showed the name
of the table in the 'Detail' section for the scan node. This
change adds the table name for all scan node types (Kudu,
HBase, and DataSource).

Testing:
- Added an e2e test in test_observability.

Change-Id: If4fd13f893aea4e7df8a2474d7136770660e4324
Reviewed-on: http://gerrit.cloudera.org:8080/6832
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-10 22:56:38 +00:00
Thomas Tauber-Marshall
7fad3e5dc3 IMPALA-3002/IMPALA-1473: Cardinality observability cleanup
IMPALA-3002:
The shell prints an incorrect value for '#Rows' in the exec
summary for broadcast nodes due to incorrect logic around
whether to use max or agg stats. This patch makes the behavior
consistent with the way the be treats exec summaries in
summary-util.cc. This incorrect logic was also duplicated in
the impala_beeswax test framework.

IMPALA-1473:
When there is a merging exchange with a limit, we may copy rows
into the output batch beyond the limit. In this case, we currently
update the output batch's size to reflect the limit, but we also
need to update ExecNode::num_rows_returned_ or the exec summary
may show that the exchange node returned more rows than it really
did.

Additionally, PlanFragmentExecutor::GetNext does not update
rows_produced_counter_ in some cases, leading the runtime profile
to display an incorrect value for 'RowsProduced'.

Change-Id: I386719370386c9cff09b8b35d15dc712dc6480aa
Reviewed-on: http://gerrit.cloudera.org:8080/4679
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-10-15 01:25:51 +00:00