mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Updated all the associated topics. Change-Id: Id4c5e9aa4d060ceaa426908a444d280a5564749d Reviewed-on: http://gerrit.cloudera.org:8080/17469 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
123 lines
5.5 KiB
XML
123 lines
5.5 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="lineage" rev="2.2.0">
|
|
|
|
<title>Viewing Lineage Information for Impala Data</title>
|
|
<titlealts audience="PDF"><navtitle>Viewing Lineage Info</navtitle></titlealts>
|
|
<prolog>
|
|
|
|
<metadata>
|
|
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Lineage"/>
|
|
<data name="Category" value="Governance"/>
|
|
<data name="Category" value="Data Management"/>
|
|
<data name="Category" value="Navigator"/>
|
|
<data name="Category" value="Administrators"/>
|
|
|
|
</metadata>
|
|
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p rev="2.2.0">
|
|
<indexterm audience="hidden">lineage</indexterm>
|
|
<indexterm audience="hidden">column lineage</indexterm>
|
|
<term>Lineage</term> is a feature that helps you track where data originated, and how
|
|
data propagates through the system through SQL statements such as
|
|
<codeph>SELECT</codeph>, <codeph>INSERT</codeph>, and <codeph>CREATE
|
|
TABLE AS SELECT</codeph>.
|
|
</p>
|
|
<p>
|
|
This type of tracking is important in high-security configurations, especially in
|
|
highly regulated industries such as healthcare, pharmaceuticals, financial services and
|
|
intelligence. For such kinds of sensitive data, it is important to know all
|
|
the places in the system that contain that data or other data derived from it; to verify who has accessed
|
|
that data; and to be able to doublecheck that the data used to make a decision was processed correctly and
|
|
not tampered with.
|
|
</p>
|
|
|
|
<section id="column_lineage">
|
|
|
|
<title>Column Lineage</title>
|
|
|
|
<p>
|
|
<term>Column lineage</term> tracks information in fine detail, at the level of
|
|
particular columns rather than entire tables.
|
|
</p>
|
|
|
|
<p>
|
|
For example, if you have a table with information derived from web logs, you might copy that data into
|
|
other tables as part of the ETL process. The ETL operations might involve transformations through
|
|
expressions and function calls, and rearranging the columns into more or fewer tables
|
|
(<term>normalizing</term> or <term>denormalizing</term> the data). Then for reporting, you might issue
|
|
queries against multiple tables and views. In this example, column lineage helps you determine that data
|
|
that entered the system as <codeph>RAW_LOGS.FIELD1</codeph> was then turned into
|
|
<codeph>WEBSITE_REPORTS.IP_ADDRESS</codeph> through an <codeph>INSERT ... SELECT</codeph> statement. Or,
|
|
conversely, you could start with a reporting query against a view, and trace the origin of the data in a
|
|
field such as <codeph>TOP_10_VISITORS.USER_ID</codeph> back to the underlying table and even further back
|
|
to the point where the data was first loaded into Impala.
|
|
</p>
|
|
|
|
<p>
|
|
When you have tables where you need to track or control access to sensitive information at the column
|
|
level, see <xref href="impala_authorization.xml#authorization"/> for how to implement column-level
|
|
security. You set up authorization using the Ranger framework, create views that refer to specific sets of
|
|
columns, and then assign authorization privileges to those views rather than the underlying tables.
|
|
</p>
|
|
|
|
</section>
|
|
|
|
<section id="lineage_data">
|
|
|
|
<title>Lineage Data for Impala</title>
|
|
|
|
<p>
|
|
The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage
|
|
graph is computed for each query and stored in a specialized log file in JSON format.
|
|
</p>
|
|
|
|
<p>
|
|
Impala records queries in the lineage log if they complete successfully, or fail due to authorization
|
|
errors. For write operations such as <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph>,
|
|
the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage
|
|
feature tracks data that was accessed by successful queries, or that was attempted to be accessed by
|
|
unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data
|
|
that really was accessed, or where the attempted access could represent malicious activity.
|
|
</p>
|
|
|
|
<p>
|
|
Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are
|
|
cancelled before they reach the stage of requesting rows from the result set.
|
|
</p>
|
|
|
|
<p>
|
|
To enable or disable this feature, set or remove the <codeph>-lineage_event_log_dir</codeph>
|
|
configuration option for the <cmdname>impalad</cmdname> daemon.
|
|
</p>
|
|
|
|
</section>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|