Files
impala/docs/topics/impala_compression_codec.xml
John Russell 8377b9949c Global search/replace: audience="Cloudera" -> audience="hidden".
For this change to land in master, the audience="hidden" code review
needs to be completed first. Otherwise, the doc build would still work
but the audience="hidden" content would be visible rather than hidden as
desired.

Some work happening in parallel might introduce additional instances of
audience="Cloudera". I suggest addressing those in a followup CR so this
global change can land quickly.

Since the changes apply across so many different files, but are so
narrow in scope, I suggest that the way to validate (check that no
extraneous changes were introduced accidentally) is to diff just the
changed lines:

git diff -U0 HEAD^ HEAD

In patch set 2, I updated other topics marked audience="Cloudera"
by CRs that were pushed in the meantime.

Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8
Reviewed-on: http://gerrit.cloudera.org:8080/5613
Reviewed-by: John Russell <jrussell@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-18 19:31:57 +00:00

117 lines
4.1 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.0.0" id="compression_codec">
<title>COMPRESSION_CODEC Query Option (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>COMPRESSION_CODEC</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Compression"/>
<data name="Category" value="File Formats"/>
<data name="Category" value="Parquet"/>
<data name="Category" value="Snappy"/>
<data name="Category" value="Gzip"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<!-- The initial part of this paragraph is copied straight from the #parquet_compression topic. -->
<!-- Could turn into a conref. -->
<p rev="2.0.0">
<indexterm audience="hidden">COMPRESSION_CODEC query option</indexterm>
When Impala writes Parquet data files using the <codeph>INSERT</codeph> statement, the underlying compression
is controlled by the <codeph>COMPRESSION_CODEC</codeph> query option.
</p>
<note>
Prior to Impala 2.0, this option was named <codeph>PARQUET_COMPRESSION_CODEC</codeph>. In Impala 2.0 and
later, the <codeph>PARQUET_COMPRESSION_CODEC</codeph> name is not recognized. Use the more general name
<codeph>COMPRESSION_CODEC</codeph> for new code.
</note>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<codeblock>SET COMPRESSION_CODEC=<varname>codec_name</varname>;</codeblock>
<p>
The allowed values for this query option are <codeph>SNAPPY</codeph> (the default), <codeph>GZIP</codeph>,
and <codeph>NONE</codeph>.
</p>
<note>
A Parquet file created with <codeph>COMPRESSION_CODEC=NONE</codeph> is still typically smaller than the
original data, due to encoding schemes such as run-length encoding and dictionary encoding that are applied
separately from compression.
</note>
<p></p>
<p>
The option value is not case-sensitive.
</p>
<p>
If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option
setting, not just queries involving Parquet tables. (The value <codeph>BZIP2</codeph> is also recognized, but
is not compatible with Parquet tables.)
</p>
<p>
<b>Type:</b> <codeph>STRING</codeph>
</p>
<p>
<b>Default:</b> <codeph>SNAPPY</codeph>
</p>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<codeblock>set compression_codec=gzip;
insert into parquet_table_highly_compressed select * from t1;
set compression_codec=snappy;
insert into parquet_table_compression_plus_fast_queries select * from t1;
set compression_codec=none;
insert into parquet_table_no_compression select * from t1;
set compression_codec=foo;
select * from t1 limit 5;
ERROR: Invalid compression codec: foo
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
For information about how compressing Parquet data files affects query performance, see
<xref href="impala_parquet.xml#parquet_compression"/>.
</p>
</conbody>
</concept>