diff --git a/README-build.md b/README-build.md new file mode 100644 index 000000000..1297b8698 --- /dev/null +++ b/README-build.md @@ -0,0 +1,69 @@ +This document introduces the Impala project layout and some key configuration variables. +Beware that it may become stale over time as the project evolves. + +# Detailed Build Notes + +Impala can be built with pre-built components or components downloaded from S3. +The components needed to build Impala are Apache Hadoop, Hive, HBase, and Sentry. +If you need to manually override the locations or versions of these components, you +can do so through the environment variables and scripts listed below. + +## Scripts and directories + +| Location | Purpose | +|------------------------------|---------| +| bin/impala-config.sh | This script must be sourced to setup all environment variables properly to allow other scripts to work | +| bin/impala-config-local.sh | A script can be created in this location to set local overrides for any environment variables | +| bin/impala-config-branch.sh | A version of the above that can be checked into a branch for convenience. | +| bin/bootstrap_build.sh | A helper script to bootstrap some of the build requirements. | +| bin/bootstrap_development.sh | A helper script to bootstrap a developer environment. Please read it before using. | +| be/build/ | Impala build output goes here. | +| be/generated-sources/ | Thrift and other generated source will be found here. | + +## Build Related Variables + +| Environment variable | Default value | Description | +|----------------------|---------------|-------------| +| IMPALA_HOME | | Top level Impala directory | +| IMPALA_TOOLCHAIN | "${IMPALA_HOME}/toolchain" | Native toolchain directory (for compilers, libraries, etc.) | +| SKIP_TOOLCHAIN_BOOTSTRAP | "false" | Skips downloading the toolchain any python dependencies if "true" | +| CDH_BUILD_NUMBER | | Identifier to indicate the CDH build number +| CDH_COMPONENTS_HOME | "${IMPALA_HOME}/toolchain/cdh_components-${CDH_BUILD_NUMBER}" | Location of the CDH components within the toolchain. | +| CDH_MAJOR_VERSION | "5" | Identifier used to uniqueify paths for potentially incompatible component builds. | +| IMPALA_CONFIG_SOURCED | "1" | Set by ${IMPALA_HOME}/bin/impala-config.sh (internal use) | +| JAVA_HOME | "/usr/lib/jvm/${JAVA_VERSION}" | Used to locate Java | +| JAVA_VERSION | "java-7-oracle-amd64" | Can override to set a local Java version. | +| JAVA | "${JAVA_HOME}/bin/java" | Java binary location. | +| CLASSPATH | | See bin/set-classpath.sh for details. | +| PYTHONPATH | Will be changed to include: "${IMPALA_HOME}/shell/gen-py" "${IMPALA_HOME}/testdata" "${THRIFT_HOME}/python/lib/python2.7/site-packages" "${HIVE_HOME}/lib/py" | + +## Source Directories for Impala + +| Environment variable | Default value | Description | +|----------------------|---------------|-------------| +| IMPALA_BE_DIR | "${IMPALA_HOME}/be" | Backend directory. Build output is also stored here. | +| IMPALA_FE_DIR | "${IMPALA_HOME}/fe" | Frontend directory | +| IMPALA_COMMON_DIR | "${IMPALA_HOME}/common" | Common code (thrift, function registry) | + +## Various Compilation Settings + +| Environment variable | Default value | Description | +|----------------------|---------------|-------------| +| IMPALA_BUILD_THREADS | "8" or set to number of processors by default. | Used for make -j and distcc -j settings. | +| IMPALA_MAKE_FLAGS | "" | Any extra settings to pass to make. Also used when copying udfs / udas into HDFS. | +| USE_SYSTEM_GCC | "0" | If set to any other value, directs cmake to not set GCC_ROOT, CMAKE_C_COMPILER, CMAKE_CXX_COMPILER, as well as setting TOOLCHAIN_LINK_FLAGS | +| IMPALA_CXX_COMPILER | "default" | Used by cmake (cmake_modules/toolchain and clang_toolchain.cmake) to select gcc / clang | +| USE_GOLD_LINKER | "true" | Directs backend cmake to use gold. | +| IS_OSX | "false" | (Experimental) currently only used to disable Kudu. | + +## Dependencies +| Environment variable | Default value | Description | +|----------------------|---------------|-------------| +| HADOOP_HOME | "${CDH_COMPONENTS_HOME}/hadoop-${IMPALA_HADOOP_VERSION}/" | Used to locate Hadoop | +| HADOOP_INCLUDE_DIR | "${HADOOP_HOME}/include" | For 'hdfs.h' | +| HADOOP_LIB_DIR | "${HADOOP_HOME}/lib" | For 'libhdfs.a' or 'libhdfs.so' | +| HIVE_HOME | "${CDH_COMPONENTS_HOME}/{hive-${IMPALA_HIVE_VERSION}/" | | +| HBASE_HOME | "${CDH_COMPONENTS_HOME}/hbase-${IMPALA_HBASE_VERSION}/" | | +| SENTRY_HOME | "${CDH_COMPONENTS_HOME}/sentry-${IMPALA_SENTRY_VERSION}/" | Used to setup test data | +| THRIFT_HOME | "${IMPALA_TOOLCHAIN}/thrift-${IMPALA_THRIFT_VERSION}" | | + diff --git a/README.md b/README.md index f37be9ad6..7faa6532e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Welcome to Impala -Lightning-fast, distributed [SQL](http://en.wikipedia.org/wiki/SQL) queries for petabytes +Lightning-fast, distributed [SQL](https://en.wikipedia.org/wiki/SQL) queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets @@ -8,18 +8,24 @@ you analyze, transform and combine data from a variety of data sources: * Best of breed performance and scalability. * Support for data stored in [HDFS](https://hadoop.apache.org/), - [Apache HBase](http://hbase.apache.org/) and [Amazon S3](http://aws.amazon.com/s3/). + [Apache HBase](https://hbase.apache.org/), [Apache Kudu](https://kudu.apache.org/), + [Amazon S3](https://aws.amazon.com/s3/), + [Azure Data Lake Storage](https://azure.microsoft.com/en-us/services/storage/data-lake-storage/), + [Apache Hadoop Ozone](https://hadoop.apache.org/ozone/) and more! * Wide analytic SQL support, including window functions and subqueries. -* On-the-fly code generation using [LLVM](http://llvm.org/) to generate CPU-efficient +* On-the-fly code generation using [LLVM](http://llvm.org/) to generate lightning-fast code tailored specifically to each individual query. -* Support for the most commonly-used Hadoop file formats, including the - [Apache Parquet](https://parquet.apache.org/) project. +* Support for the most commonly-used Hadoop file formats, including + [Apache Parquet](https://parquet.apache.org/) and [Apache ORC](https://orc.apache.org). +* Support for industry-standard security protocols, including Kerberos, LDAP and TLS. * Apache-licensed, 100% open source. ## More about Impala To learn more about Impala as a business user, or to try Impala live or in a VM, please -visit the [Impala homepage](https://impala.apache.org). +visit the [Impala homepage](https://impala.apache.org). Detailed documentation for +administrators and users is available at +[Apache Impala documentation](https://impala.apache.org/impala-docs.html). If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the @@ -36,70 +42,8 @@ Please refer to EXPORT\_CONTROL.md for more information. ## Build Instructions -See bin/bootstrap_build.sh. +See [Impala's developer documentation](https://cwiki.apache.org/confluence/display/IMPALA/Impala+Home) +to get started. -### Detailed Build Notes - -Impala can be built with pre-built components or components downloaded from S3. -The components needed to build Impala are Apache Hadoop, Hive, HBase, and Sentry. -If you need to manually override the locations or versions of these components, you -can do so through the environment variables and scripts listed below. - -##### Scripts and directories - -| Location | Purpose | -|------------------------------|---------| -| bin/impala-config.sh | This script must be sourced to setup all environment variables properly to allow other scripts to work | -| bin/impala-config-local.sh | A script can be created in this location to set local overrides for any environment variables | -| bin/impala-config-branch.sh | A version of the above that can be checked into a branch for convenience. | -| bin/bootstrap_build.sh | A helper script to bootstrap some of the build requirements. | -| bin/bootstrap_development.sh | A helper script to bootstrap a developer environment. Please read it before using. | -| be/build/ | Impala build output goes here. | -| be/generated-sources/ | Thrift and other generated source will be found here. | - -##### Build Related Variables - -| Environment variable | Default value | Description | -|----------------------|---------------|-------------| -| IMPALA_HOME | | Top level Impala directory | -| IMPALA_TOOLCHAIN | "${IMPALA_HOME}/toolchain" | Native toolchain directory (for compilers, libraries, etc.) | -| SKIP_TOOLCHAIN_BOOTSTRAP | "false" | Skips downloading the toolchain any python dependencies if "true" | -| CDH_BUILD_NUMBER | | Identifier to indicate the CDH build number -| CDH_COMPONENTS_HOME | "${IMPALA_HOME}/toolchain/cdh_components-${CDH_BUILD_NUMBER}" | Location of the CDH components within the toolchain. | -| CDH_MAJOR_VERSION | "5" | Identifier used to uniqueify paths for potentially incompatible component builds. | -| IMPALA_CONFIG_SOURCED | "1" | Set by ${IMPALA_HOME}/bin/impala-config.sh (internal use) | -| JAVA_HOME | "/usr/lib/jvm/${JAVA_VERSION}" | Used to locate Java | -| JAVA_VERSION | "java-7-oracle-amd64" | Can override to set a local Java version. | -| JAVA | "${JAVA_HOME}/bin/java" | Java binary location. | -| CLASSPATH | | See bin/set-classpath.sh for details. | -| PYTHONPATH | Will be changed to include: "${IMPALA_HOME}/shell/gen-py" "${IMPALA_HOME}/testdata" "${THRIFT_HOME}/python/lib/python2.7/site-packages" "${HIVE_HOME}/lib/py" | - -##### Source Directories for Impala - -| Environment variable | Default value | Description | -|----------------------|---------------|-------------| -| IMPALA_BE_DIR | "${IMPALA_HOME}/be" | Backend directory. Build output is also stored here. | -| IMPALA_FE_DIR | "${IMPALA_HOME}/fe" | Frontend directory | -| IMPALA_COMMON_DIR | "${IMPALA_HOME}/common" | Common code (thrift, function registry) | - -##### Various Compilation Settings - -| Environment variable | Default value | Description | -|----------------------|---------------|-------------| -| IMPALA_BUILD_THREADS | "8" or set to number of processors by default. | Used for make -j and distcc -j settings. | -| IMPALA_MAKE_FLAGS | "" | Any extra settings to pass to make. Also used when copying udfs / udas into HDFS. | -| USE_SYSTEM_GCC | "0" | If set to any other value, directs cmake to not set GCC_ROOT, CMAKE_C_COMPILER, CMAKE_CXX_COMPILER, as well as setting TOOLCHAIN_LINK_FLAGS | -| IMPALA_CXX_COMPILER | "default" | Used by cmake (cmake_modules/toolchain and clang_toolchain.cmake) to select gcc / clang | -| USE_GOLD_LINKER | "true" | Directs backend cmake to use gold. | -| IS_OSX | "false" | (Experimental) currently only used to disable Kudu. | - -##### Dependencies -| Environment variable | Default value | Description | -|----------------------|---------------|-------------| -| HADOOP_HOME | "${CDH_COMPONENTS_HOME}/hadoop-${IMPALA_HADOOP_VERSION}/" | Used to locate Hadoop | -| HADOOP_INCLUDE_DIR | "${HADOOP_HOME}/include" | For 'hdfs.h' | -| HADOOP_LIB_DIR | "${HADOOP_HOME}/lib" | For 'libhdfs.a' or 'libhdfs.so' | -| HIVE_HOME | "${CDH_COMPONENTS_HOME}/{hive-${IMPALA_HIVE_VERSION}/" | | -| HBASE_HOME | "${CDH_COMPONENTS_HOME}/hbase-${IMPALA_HBASE_VERSION}/" | | -| SENTRY_HOME | "${CDH_COMPONENTS_HOME}/sentry-${IMPALA_SENTRY_VERSION}/" | Used to setup test data | -| THRIFT_HOME | "${IMPALA_TOOLCHAIN}/thrift-${IMPALA_THRIFT_VERSION}" | | +[Detailed build notes](README-build.md) has some detailed information on the project +layout and build. diff --git a/bin/rat_exclude_files.txt b/bin/rat_exclude_files.txt index a125c3909..04d73d0c7 100644 --- a/bin/rat_exclude_files.txt +++ b/bin/rat_exclude_files.txt @@ -92,7 +92,7 @@ be/src/util/cache/rl-cache-test.cc be/src/testutil/certificates-info.txt bin/README-RUNNING-BENCHMARKS LOGS.md -README.md +README*.md */README */README.dox */README.txt