Commit Graph

12 Commits

Author SHA1 Message Date
Fucun Chu
157086cb80 IMPALA-10771: Add Tencent COS support
This patch adds support for COS(Cloud Object Storage). Using the
hadoop-cos, the implementation is similar to other remote FileSystems.

New flags for COS:
- num_cos_io_threads: Number of COS I/O threads. Defaults to be 16.

Follow-up:
- Support for caching COS file handles will be addressed in
   IMPALA-10772.
- test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   COS (IMPALA-10773).

Tests:
 - Upload hdfs test data to a COS bucket. Modify all locations in HMS
   DB to point to the COS bucket. Remove some hdfs caching params.
   Run CORE tests.

Change-Id: Idce135a7591d1b4c74425e365525be3086a39821
Reviewed-on: http://gerrit.cloudera.org:8080/17503
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-12-08 16:32:02 +00:00
stiga-huang
2dfc68d852 IMPALA-7712: Support Google Cloud Storage
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.

New flags for GCS:
 - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.

Follow-up:
 - Support for spilling to GCS will be addressed in IMPALA-10561.
 - Support for caching GCS file handles will be addressed in
   IMPALA-10568.
 - test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   GCS (IMPALA-10562).
 - Some tests are skipped due to issues introduced by /etc/hosts setting
   on GCE instances (IMPALA-10563).

Tests:
 - Compile and create hdfs test data on a GCE instance. Upload test data
   to a GCS bucket. Modify all locations in HMS DB to point to the GCS
   bucket. Remove some hdfs caching params. Run CORE tests.
 - Compile and load snapshot data to a GCS bucket. Run CORE tests.

Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-13 11:20:08 +00:00
Anurag Mantripragada
f49f8d8a32 IMPALA-9132: Explain statements should not cause nullptr in
LogLineageRecord()

For DDLs LogLineageRecord() adds certain fields in the backend before
flushing the lineage. It uses ddl_exec_response() to get these fields.
However, explain is a special kind of DDL which does not have an
associated catalog_op_executor_. This causes explain statements to
throw NPE when ddl_exec_response() is called.

Currently, tools like atlas do not track lineages for explain
statements. This change skips lineage logging for explain statements.
In general, adds a nullptr check for catalog_op_executor_.

Testing:
Added a test to verify lineage is not created for explain statements.

Change-Id: Iccc20fd5a80841c820ebeb4edffccebea30df76e
Reviewed-on: http://gerrit.cloudera.org:8080/14646
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-06 22:14:04 +00:00
Sahil Takiar
e8fda1f224 IMPALA-9117, IMPALA-7726: Fixed a few unit tests for ABFS
This test makes the following changes / fixes when running Impala tests
on ABFS:
* Skips some tests in test_lineage.py that don't work on ABFS / ADLS
(they were already skipped for S3)
* Skips some tests in test_mt_dop.py; the test creates a directory that
ends with a period (and ABFS does not support writing files or
directories that end with a period)
* Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with
purge tests fail against ABFS due to trash misbehavior"); I removed
these flags and looped the tests overnight with no failures, so it is
likely whatever bug was causing this has now been fixed
* Now that HADOOP-15860 has been resolved, and the agreed upon behavior
for ABFS is that it will fail if a client tries to write a file /
directory that ends with a period, I added a new entry to the SkipIfABFS
class called file_or_folder_name_ends_with_period and applied it where
necessary

Testing:
* Ran core tests on ABFS

Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285
Reviewed-on: http://gerrit.cloudera.org:8080/14636
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-06 05:44:01 +00:00
Anurag Mantripragada
378a108ecc IMPALA-9070: Include table location in lineage for 'CREATE EXTERNAL
TABLE' DDL.

Atlas needs table location to establish lineage between a newly
created external table and its table location.

The table location information is not available until the createTable
catalog op succeeds. After this change, location information is sent
to the backend in the TDDLExecResponse message which adds it to the
lineage graph. This information is sent only for create external
table queries.

Testing:
Added a test to verify the tableLocation field is populated for a
create external table query lineage. Also, modified the
lineage.test file to include location information for all lineages.

Change-Id: If02b0cc16d52c1956298171628f5737cab62ce9f
Reviewed-on: http://gerrit.cloudera.org:8080/14515
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-10-21 20:12:47 +00:00
Anurag Mantripragada
b64efa76f7 IMPALA-9053: DDLs should generate lineage graphs.
DDLs like 'create table' should generate minimal lineage graphs so
that consumers like Atlas can use information like 'queryText' to
establish lineages.

This change adds a call to the computeLineageGraph() method during
analysis phase of createTable which populates the graph with basic
information like queryText. If it is a CTAS, this graph is enhanced
in the "insert" phase with dependencies.

Testing:
Add an EE test to verify lineage information and also to check it
is flushed to disk properly.

Change-Id: Ia6c7ed9fe3265fd777fe93590cf4eb2d9ba0dd1e
Reviewed-on: http://gerrit.cloudera.org:8080/14458
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-10-17 20:11:06 +00:00
Bharath Vissapragada
d155ef01f5 IMPALA-8572: Skip test_lineage_output on S3
The test has a hbase dependency that does not run in S3.

Change-Id: I781c2dc42c042747eed6134cea4f3f0879a40294
Reviewed-on: http://gerrit.cloudera.org:8080/14230
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Sahil Takiar <stakiar@cloudera.com>
2019-09-16 17:04:17 +00:00
Bharath Vissapragada
32dbf5d78a IMPALA-8572: Log query events before unregister.
Currently, the query events (audits and lineages) are logged as a
part of query unregistration. This delays the event logging in cases
where the Unregister() is delayed by client for some reason (ex: Hue
does not call Unregister until the browser tab is closed) or the client
goes away without calling Unregister and the query timeout kicks in.

This patch moves this event logging to an earlier stage in the query
lifecycle. Moved the event logging related code into ClientRequestState
for easier code refactoring.

The conditions under which the events are logged are slightly
modified by this patch. Without the patch, events are logged for
unsuccessful queries if atleast a single fetch is perfomed. This patch
relaxes this guarantee to log events for any query that reaches
the FINISHED state (rows are available to fetch by the client) and does
not wait for a fetch to be performed. This simplifies the coordinator
state machine by avoiding unnecessary synchronization.

Added some test coverage for coordinator side code paths for writing
lineages. fe specific lineage tests only verified the correctness of
lineage created but did not test whether it was being flushed correctly
to the disk.

Change-Id: I639b9c1acb9806b29292cd85be2863688453ca2e
Reviewed-on: http://gerrit.cloudera.org:8080/14143
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-09-12 02:02:44 +00:00
Fredy Wijaya
d4c05b96e2 IMPALA-8638: Fix flaky TestLineage::test_create_table_timestamp
This patch fixes the bug TestLineage::test_create_table_timestamp where
it uses the same lineage log directory created by
TestLineage::test_start_end_timestamp, which could fail the query ID
assertion. The fix is to use a different lineage log directory than the
one used by TestLineage::test_start_end_timestamp.

Testing:
- Ran test_lineage.py multiple times and it still passed.

Change-Id: I5813e4a570c181ba196b9ddf0210c8a0d92e21e8
Reviewed-on: http://gerrit.cloudera.org:8080/13560
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-07 23:23:06 +00:00
Fredy Wijaya
d9af99589f IMPALA-8564: Add table/view create time in the lineage graph
This patch adds table/view create time in the lineage graph. This is
needed for Impala/Atlas integration. See ATLAS-3080.

Below is an example of the updated lineage graph.
{
    "queryText":"create table lineage_test_tbl as select int_col, tinyint_col from functional.alltypes",
    "queryId":"0:0",
    "hash":"407f23b24758ffcb2ac445b9703f5c44",
    "user":"dummy_user",
    "timestamp":1547867921,
    "edges":[
        {
            "sources":[
                1
            ],
            "targets":[
                0
            ],
            "edgeType":"PROJECTION"
        },
        {
            "sources":[
                3
            ],
            "targets":[
                2
            ],
            "edgeType":"PROJECTION"
        }
    ],
    "vertices":[
        {
            "id":0,
            "vertexType":"COLUMN",
            "vertexId":"int_col",
            "metadata":{
                "tableName":"default.lineage_test_tbl",
                "tableCreateTime":1559151337
            }
        },
        {
            "id":1,
            "vertexType":"COLUMN",
            "vertexId":"functional.alltypes.int_col",
            "metadata":{
                "tableName":"functional.alltypes",
                "tableCreateTime":1559151317
            }
        },
        {
            "id":2,
            "vertexType":"COLUMN",
            "vertexId":"tinyint_col",
            "metadata":{
                "tableName":"default.lineage_test_tbl",
                "tableCreateTime":1559151337
            }
        },
        {
            "id":3,
            "vertexType":"COLUMN",
            "vertexId":"functional.alltypes.tinyint_col",
            "metadata":{
                "tableName":"functional.alltypes",
                "tableCreateTime":1559151317
            }
        }
    ]
}

Testing:
- Updated lineage tests in PlannerTest
- Updated test_lineage.py
- Ran all FE tests

Change-Id: If4f578d7b299a76c30323b10a883ba32f8713d82
Reviewed-on: http://gerrit.cloudera.org:8080/13399
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-05 06:39:04 +00:00
Tianyi Wang
c505a8159b IMPALA-6210: Add query id to lineage graph logging
Some tools use lineage graph logging to collect query metrics. Currently
only query hash is present in this log. Adding query id into it makes
such accounting easier.

Testing: The equality of query id in the query profile and lineage log
is checked in test_lineage.py. A test for TUniqueIdUtil is added to the
FE tests.

Change-Id: I4adbd02df37a234dbb79f58b7c46ca11a914229f
Reviewed-on: http://gerrit.cloudera.org:8080/8589
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-06 00:52:19 +00:00
Dan Hecht
035b775a6d IMPALA-4440: lineage timestamps can go backwards across daylight savings transitions
Using TimestampValue (or equivalent string representation) for
timestamps that require a point in time doesn't work because the same
time can represent multiple point in times.  For example, the timestamp:
'2016-11-13 01:01 AM' occurred twice last weekend.

Instead, we should use unix time directly rather than trying to derive
unix time from a (timezone-less) timestamp.

Note that there are other questionable uses of TimestampValue for
internal Impala service stuff, but I want to fix them separately as they
are not as important and fixing does add some risk.

While I'm here, remove a template TimestampValue constructor that was
unused and is confusing.

We don't have any end-to-end tests that exercise column lineage, so add
a simple custom cluster test that enables lineage and verifes the start
and end unix times are within appropriate bounds.  The other column
lineage graph fields are at least tested via planner tests.

Automated regression testing for the specifc daylight savings issue is
difficult as we'd have to cross the daylight savings boundary at just
the right time during query execution in order to reproduce
reliably. But open to ideas.

Testing:
- loop the new test overnight without any failures.
- exhaustive run.

Change-Id: I34e435fc3511e65bc62906205cb558f2c116a8a9
Reviewed-on: http://gerrit.cloudera.org:8080/5129
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-21 22:18:37 +00:00