Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.
Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This patch introduces changes to run tests against Isilon, combined with minor cleanup of
the test and client code.
For Isilon, it:
- Populates the SkipIfIsilon class with appropriate pytest markers.
- Introduces a new default for the hdfs client in order to connect to Isilon.
- Cleans up a few test files take the underlying filesystem into account.
- Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl
On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically:
- delete_file_dir does not throw an error if the file does not exist.
- get_file_dir_status automatically strips the leading '/'
Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f
Reviewed-on: http://gerrit.cloudera.org:8080/370
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
libhdfs hdfsListDirectory API documentation is wrong. It says it returns NULL
when there is an error. But it will return NULL as well when the directory
is empty. Impala needs to check errno to make sure if an error happened.
The HDFS issue is addressed by HDFS-8407.
Change-Id: I9574c321a56fe339d4ccc3bb5bea59bc41f48ac4
(cherry picked from commit 20da688af19ca41576c82fd7b7d49b4346dbae92)
Reviewed-on: http://gerrit.cloudera.org:8080/394
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
This patch encapsulates pytests's skipif markers in classes. It leads to the following
benefits:
- Provide context and grouping for tests being skipped.
- As we improve test reporting, annotations will give us a better idea of coverage.
Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9
Reviewed-on: http://gerrit.cloudera.org:8080/297
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
Reviewed-on: http://gerrit.cloudera.org:8080/343
Add skip markers for S3 that can be used to categorize the tests that
are skipped against S3 to help see what coverage is missing. Soon
we'll be reworking some tests and/or adding new tests to get back the
important gaps.
Also, add a mechanism to parameterize paths in the .test files, and
start using these new variables. This is a step toward enabling some
more tests against S3.
Finally, a fix for buildall.sh to stop the minicluster before applying
the metastore snapshot. Otherwise, this fails since the ms db is in
use.
Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0
Reviewed-on: http://gerrit.cloudera.org:8080/127
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Impala only check the first applicable acl entry. but a user can be a member of more
than one group. If any of these matching group entries contain the requested permissions,
access should be granted.
Change-Id: I16164ee906cf147e2f1f2fd389762593e85a1e84
Reviewed-on: http://gerrit.cloudera.org:8080/104
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
This patch introduces a new pytest marker that skip tests that currently don't work when
s3 is used as the underlying file system. The set of blacklisted tests is a superset of
tests that cannot be run with s3. Follow up patches will remove some of the test files
from the blacklist.
Change-Id: I39a58223d3435f0bd6496ffd00a2d483b751693d
Reviewed-on: http://gerrit.cloudera.org:8080/82
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
DataSink should check if there is any row to output before creating data file or
sending data so it doesn't generate zero size data file or create empty partition
directory. It also fixes the issue reported in IMPALA-1432.
Change-Id: I58c995f7d5cda203c23bdd9d09776e4cf35c2246
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5545
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: jenkins
Although Hive ignores directories with '_' or '.' as prefixes, some
tools only ignore those beginning with '_'.
Change-Id: I499491b0cb1919c4b3a46efcc45b57ad56bfdf86
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4985
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 469d4bd85b33fd0282594197055bb5dce47ecc9e)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4999
Impala needs to combine file permission and getAclStatus output to get full
Acl list and use that to check permissions.
Change-Id: I6d5884932423573e545680a2747d85bdf5793909
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4683
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Juan Yu <jyu@cloudera.com>
This patch forces LOAD and INSERT to check ACLs during analysis. We
mimic the behaviour of HDFS's ACL checking by adding code to
FsPermissionChecker.
Change-Id: I42660db1da13ceaef63f582cff2c2078e08f90a1
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4428
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
The tests pass every time locally (in a 60 minute run), but fail
intermittently on our build machines.
Change-Id: I62d5ea0df8c42728a538b29bd16006be3179bfd3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3489
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
If a partition had a location that did not exist in HDFS, Impala would
refuse to load its metadata. This meant a typo could render a table
unloadable. We fix this problem by removing the existence check from the
frontend, and by inheriting access from the first extant parent of the
partition directory.
Fixing this exposed a second issue, where Impala wouldn't create
directories for partitions in the right place after an INSERT if the
partition location had been changed. To get this right we have to plumb
the partition ID through to Coordinator::FinalizeSuccessfulInsert(), so
that the coordinator can look up the partition's location from the
query-wide descriptor table. As a by-product, this patch rationalises
the per-partition, per-fragment statistics gathering a little bit by
putting almost all the per-partition stats into TInsertPartitionStatus.
Change-Id: I9ee0a1a1ef62cf28f55be3249e8142c362083163
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2851
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
their parent's permissions
This patch adds --insert_inherit_permissions. If true, all
new partition directories created by INSERT will inherit their
permissions from their parent. When false, the directories are created
with the default permissions.
Change-Id: Ib2b4c251e51ea5048387169678e8dde34ecfe5f6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1917
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Partition column expressions are analysed twice for INSERT statements -
once to infer the type and so to add a possible cast, and once to
compute stats on the resulting expr. However, this process resulted in
an partition column expr that was a IntLiteral getting the smallest type
that would contains its value, rather than retaining the
column-compatible type that had been assigned to it.
This patch does the minimum thing, which is make IntLiteral.analyze()
idempotent. Doing the same thing to Expr and LiteralExpr unearths some
other bugs, which we will have to fix in a follow-on patch (see
IMPALA-884).
Change-Id: Ie22fc5d3f4832c735a1ebc0ef78f50d736f597fd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1931
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 1912d65ea21a5025d385948642f0d4aadad91abf)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1947
This patch goes some way to improving recovery after an INSERT
fails. Inserts now write intermediate results to
<table_dir>/.impala_insert_staging. After execution completes, either
successfully or not, the query-specific directory under that directory
is deleted.
This doesn't complete the job for better cleanup (although this goes as
far as IMPALA-449 suggests). Two things to do in the future:
* Have each backend delete its own staging files on error. The
difficulty getting there now is that backends don't know if they are
cancelled in error or because a LIMIT was reached.
* If the operation to move files to their final destinations should
fail during FinalizeQuery(), the coordinator should perform
compensation actions and delete the files that made it.
Note: We also considered a query-wide and impalad-wide option to change
the staging dir. There are advantages to this (all intermediate results
go to a known location which is easy to clean up on failure), but also
security and other operational concerns. Worth revisiting in the future.
Change-Id: Ia54cf36db6a382e359877f87d7d40aad7fdb77be
Reviewed-on: http://gerrit.ent.cloudera.com:8080/670
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.
Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.
What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
contains the change. An impalad will wait for the statestore heartbeat that contains this
version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing
Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
same JAR.
Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
OVERWRITE
INSERT OVERWRITE into an unpartitioned table is supposed to remove all
data files from the root. This should not include hidden files or
directories. This patch excludes hidden files from deletion, and adds a
test case.
Partition directories are still removed in their entirety: the cost of
statting a large number of files and directories rather than issuing a
single "rm -rf" outweighs the benefits of preserving hidden files for
now.
Hive does not preserve hidden files in either configuration.
Change-Id: Ia73e55e011c26c88f14745075210cf359764e3c1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/418
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>