mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
132 lines
6.4 KiB
XML
132 lines
6.4 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="lineage" rev="2.2.0">
|
|
|
|
<title>Viewing Lineage Information for Impala Data</title>
|
|
<titlealts audience="PDF"><navtitle>Viewing Lineage Info</navtitle></titlealts>
|
|
<prolog>
|
|
|
|
<metadata>
|
|
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Lineage"/>
|
|
<data name="Category" value="Governance"/>
|
|
<data name="Category" value="Data Management"/>
|
|
<data name="Category" value="Navigator"/>
|
|
<data name="Category" value="Administrators"/>
|
|
|
|
</metadata>
|
|
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p rev="2.2.0">
|
|
<indexterm audience="hidden">lineage</indexterm>
|
|
<indexterm audience="hidden">column lineage</indexterm>
|
|
<term>Lineage</term> is a feature in the Cloudera Navigator data
|
|
management component that helps you track where data originated, and how
|
|
data propagates through the system through SQL statements such as
|
|
<codeph>SELECT</codeph>, <codeph>INSERT</codeph>, and <codeph>CREATE
|
|
TABLE AS SELECT</codeph>. Impala is covered by the Cloudera Navigator
|
|
lineage features in <keyword keyref="impala22_full"/> and higher. </p>
|
|
|
|
<p>
|
|
This type of tracking is important in high-security configurations, especially in highly regulated industries
|
|
such as healthcare, pharmaceuticals, financial services and intelligence. For such kinds of sensitive data, it is important to know all
|
|
the places in the system that contain that data or other data derived from it; to verify who has accessed
|
|
that data; and to be able to doublecheck that the data used to make a decision was processed correctly and
|
|
not tampered with.
|
|
</p>
|
|
|
|
<p>
|
|
You interact with this feature through <term>lineage diagrams</term> showing relationships between tables and
|
|
columns. For instructions about interpreting lineage diagrams, see
|
|
<xref audience="integrated" href="cn_iu_lineage.xml" /><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cn_iu_lineage.html" scope="external" format="html"/>.
|
|
</p>
|
|
|
|
<section id="column_lineage">
|
|
|
|
<title>Column Lineage</title>
|
|
|
|
<p>
|
|
<term>Column lineage</term> tracks information in fine detail, at the level of
|
|
particular columns rather than entire tables.
|
|
</p>
|
|
|
|
<p>
|
|
For example, if you have a table with information derived from web logs, you might copy that data into
|
|
other tables as part of the ETL process. The ETL operations might involve transformations through
|
|
expressions and function calls, and rearranging the columns into more or fewer tables
|
|
(<term>normalizing</term> or <term>denormalizing</term> the data). Then for reporting, you might issue
|
|
queries against multiple tables and views. In this example, column lineage helps you determine that data
|
|
that entered the system as <codeph>RAW_LOGS.FIELD1</codeph> was then turned into
|
|
<codeph>WEBSITE_REPORTS.IP_ADDRESS</codeph> through an <codeph>INSERT ... SELECT</codeph> statement. Or,
|
|
conversely, you could start with a reporting query against a view, and trace the origin of the data in a
|
|
field such as <codeph>TOP_10_VISITORS.USER_ID</codeph> back to the underlying table and even further back
|
|
to the point where the data was first loaded into Impala.
|
|
</p>
|
|
|
|
<p>
|
|
When you have tables where you need to track or control access to sensitive information at the column
|
|
level, see <xref href="impala_authorization.xml#authorization"/> for how to implement column-level
|
|
security. You set up authorization using the Sentry framework, create views that refer to specific sets of
|
|
columns, and then assign authorization privileges to those views rather than the underlying tables.
|
|
</p>
|
|
|
|
</section>
|
|
|
|
<section id="lineage_data">
|
|
|
|
<title>Lineage Data for Impala</title>
|
|
|
|
<p>
|
|
The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage
|
|
graph is computed for each query and stored in a specialized log file in JSON format.
|
|
</p>
|
|
|
|
<p>
|
|
Impala records queries in the lineage log if they complete successfully, or fail due to authorization
|
|
errors. For write operations such as <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph>,
|
|
the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage
|
|
feature tracks data that was accessed by successful queries, or that was attempted to be accessed by
|
|
unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data
|
|
that really was accessed, or where the attempted access could represent malicious activity.
|
|
</p>
|
|
|
|
<p>
|
|
Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are
|
|
cancelled before they reach the stage of requesting rows from the result set.
|
|
</p>
|
|
|
|
<p>
|
|
To enable or disable this feature on a system not managed by Cloudera Manager, set or remove the
|
|
<codeph>-lineage_event_log_dir</codeph> configuration option for the <cmdname>impalad</cmdname> daemon. For
|
|
information about turning the lineage feature on and off through Cloudera Manager, see
|
|
<xref audience="integrated" href="datamgmt_impala_lineage_log.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/datamgmt_impala_lineage_log.html" scope="external" format="html"/>.
|
|
</p>
|
|
|
|
</section>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|