Files
impala/tests/query_test/test_metadata_query_statements.py
Lenni Kuff a2cbd2820e Add Catalog Service and support for automatic metadata refresh
The Impala CatalogService manages the caching and dissemination of cluster-wide metadata.
The CatalogService combines the metadata from the Hive Metastore, the NameNode,
and potentially additional sources in the future. The CatalogService uses the
StateStore to broadcast metadata updates across the cluster.
The CatalogService also directly handles executing metadata updates request from
impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to
directly connect execute their DDL operations.
The CatalogService has two main components - a C++ server that implements StateStore
integration, Thrift service implementiation, and exporting of the debug webpage/metrics.
The other main component is the Java Catalog that manages caching and updating of of all
the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast
to the rest of the cluster.

Some Notes On the Changes
---
* The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views,
Databases, UDFs) have thrift struct to represent them. These are sent with each statestore
delta update.
* The existing Catalog class has been seperated into two seperate sub-classes. An
ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more
details.

What is working:
* New CatalogService created
* Working with statestore delta updates and latest UDF changes
* DDL performed on Node 1 is now visible on all other nodes without a "refresh".
* Each DDL operation against the Catalog Service will return the catalog version that
  contains the change. An impalad will wait for the statestore heartbeat that contains this
  version before returning from the DDL comment.
* All table types (Hbase, Hdfs, Views) getting their metadata propagated properly
* Block location information included in CS updates and used by Impalads
* Column and table stats included in CS updates and used by Impalads
* Query tests are all passing

Still TODO:
* Directly return catalog object metadata from DDL requests
* Poll the Hive Metastore to detect new/dropped/modified tables
* Reorganize the FE code for the Catalog Service. I don't think we want everything in the
  same JAR.

Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda
Reviewed-on: http://gerrit.ent.cloudera.com:8080/601
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:11 -08:00

107 lines
3.8 KiB
Python

#!/usr/bin/env python
# Copyright (c) 2012 Cloudera, Inc. All rights reserved.
# Impala tests for queries that query metadata and set session settings
import logging
import pytest
from subprocess import call
from tests.common.test_vector import *
from tests.common.impala_test_suite import *
from tests.util.shell_util import exec_shell_cmd
# TODO: For these tests to pass, all table metadata must be created exhaustively.
# the tests should be modified to remove that requirement.
class TestMetadataQueryStatements(ImpalaTestSuite):
@classmethod
def get_workload(self):
return 'functional-query'
@classmethod
def add_test_dimensions(cls):
super(TestMetadataQueryStatements, cls).add_test_dimensions()
# There is no reason to run these tests using all dimensions.
cls.TestMatrix.add_constraint(lambda v:\
v.get_value('table_format').file_format == 'text' and\
v.get_value('table_format').compression_codec == 'none')
def setup_method(self, method):
self.cleanup_db('hive_test_db')
def teardown_method(self, method):
self.cleanup_db('hive_test_db')
def test_show_tables(self, vector):
self.run_test_case('QueryTest/show', vector)
def test_describe_table(self, vector):
self.run_test_case('QueryTest/describe', vector)
def test_describe_formatted(self, vector):
# Describe a partitioned table.
self.exec_and_compare_hive_and_impala_hs2("describe formatted functional.alltypes")
self.exec_and_compare_hive_and_impala_hs2(
"describe formatted functional_text_lzo.alltypes")
# Describe an unpartitioned table.
self.exec_and_compare_hive_and_impala_hs2("describe formatted tpch.lineitem")
self.exec_and_compare_hive_and_impala_hs2("describe formatted functional.jointbl")
try:
# Describe a view
self.exec_and_compare_hive_and_impala_hs2(\
"describe formatted functional.alltypes_view_sub")
except AssertionError:
pytest.xfail("Investigate minor difference in displaying null vs empty values")
def test_use_table(self, vector):
self.run_test_case('QueryTest/use', vector)
@pytest.mark.execute_serially
def test_impala_sees_hive_created_tables_and_databases(self, vector):
db_name = 'hive_test_db'
tbl_name = 'testtbl'
self.client.refresh()
result = self.execute_query("show databases");
assert db_name not in result.data
call(["hive", "-e", "CREATE DATABASE %s" % db_name])
result = self.execute_query("show databases");
assert db_name not in result.data
self.client.refresh()
result = self.execute_query("show databases");
assert db_name in result.data
# Make sure no tables show up in the new database
result = self.execute_query("show tables in %s" % db_name);
assert len(result.data) == 0
self.client.refresh()
result = self.execute_query("show tables in %s" % db_name);
assert len(result.data) == 0
call(["hive", "-e", "CREATE TABLE %s.%s (i int)" % (db_name, tbl_name)])
result = self.execute_query("show tables in %s" % db_name)
assert tbl_name not in result.data
self.client.refresh()
result = self.execute_query("show tables in %s" % db_name)
assert tbl_name in result.data
# Make sure we can actually use the table
self.execute_query(("insert overwrite table %s.%s "
"select 1 from functional.alltypes limit 5"
% (db_name, tbl_name)))
result = self.execute_scalar("select count(*) from %s.%s" % (db_name, tbl_name))
assert int(result) == 5
call(["hive", "-e", "DROP TABLE %s.%s " % (db_name, tbl_name)])
call(["hive", "-e", "DROP DATABASE %s" % db_name])
# Requires a refresh to see the dropped database
result = self.execute_query("show databases");
assert db_name in result.data
self.client.refresh()
result = self.execute_query("show databases");
assert db_name not in result.data