如何在文件存储HDFS版上搭建及使用Presto_文件存储HDFS版(HDFS)-阿里云帮助中心

本文主要介绍如何在文件存储 HDFS 版上搭建及使用Presto。

前提条件

已开通文件存储 HDFS 版服务并创建文件系统实例和挂载点。具体操作，请参见文件存储HDFS版快速入门。
已搭建Hadoop集群并且所有集群节点已安装JDK，JDK版本不低于1.8。建议您使用的Hadoop版本不低于2.7.2，本文档中使用的Hadoop版本为Apache Hadoop 2.8.5。
已在集群中下载并安装Hive。本文使用的Hive版本为2.3.7。
已下载Presto安装包和presto-cli-xxx-executable.jar。本文使用的Presto版本为0.265.1。

背景信息

Presto是一个开源的分布式SQL查询引擎，适用于交互式分析查询。

说明

在本文中Presto是通过连接Hive的元数据服务来读取文件存储 HDFS 版上的数据，在文件存储 HDFS 版上使用Presto时需要额外配置一些依赖包。具体操作，请参见步骤二：配置Presto。

步骤一：Hadoop集群挂载文件存储 HDFS 版实例

在Hadoop集群中配置文件存储 HDFS 版实例。具体操作，请参见挂载文件存储 HDFS 版文件系统。

步骤二：配置Presto

执行以下命令，解压Presto安装包。
```
tar -zxf presto-server-0.265.1.tar.gz
```
执行以下命令，在Presto解压目录下创建目录（例如/etc）。
```
mkdir presto-server-0.265.1/etc
```

配置节点环境。

创建etc/node.properties文件。

vim presto-server-0.265.1/etc/node.properties

在etc/node.properties文件中添加如下内容。

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

配置JVM参数。

创建etc/jvm.config文件。

vim presto-server-0.265.1/etc/jvm.config

在etc/jvm.config文件中添加如下内容。

-server
-Xmx8G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

配置Presto Server。

本文以将Coordinator和Worker配置在同一台机器上为例进配置。您也可以将Coordinator和Worker配置在不同的机器中，具体操作，请参见Presto官方文档。

创建etc/config.properties文件。

vim presto-server-0.265.1/etc/config.properties

在etc/config.properties中添加如下内容。

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://xx.xx.xx.xx:8080 #xx.xx.xx.xx为当前机器的ip地址

配置日志级别。
1. 创建etc/log.properties文件。
```
vim presto-server-0.265.1/etc/log.properties
```
2. 在etc/log.properties文件中添加如下内容。
```
com.facebook.presto=INFO
```

配置Presto数据源。

创建etc/catalog目录。
```
mkdir presto-server-0.265.1/etc/catalog
```

创建etc/catalog/hive.properties文件。

vim presto-server-0.265.1/etc/catalog/hive.properties

在etc/catalog/hive.properties中添加如下内容。

connector.name=hive-hadoop2
hive.metastore.uri=thrift://xxxx:9083 #xxxx为启动 hive 元数据服务的ip地址
hive.config.resources=/path/to/core-site.xml #请替换为该节点上已挂载文件存储HDFS版的Hadoop core-site.xml路径

编译并替换JAR包。
Presto使用maven-shade-plugin插件打包，对引入的Hadoop依赖进行了重命名，文件存储 HDFS 版Java SDK和Hadoop共用了protobuf-xxx.jar包，Presto通过Hive Metastore读取文件存储 HDFS 版上的数据时，文件存储 HDFS 版获取不到Presto重命名地址后的protobuf-xxx.jar包而报错。为了避免兼容性问题，文件存储 HDFS 版的Java SDK需要作为Presto Hadoop的依赖项引入，并对Presto中引入的Hadoop Jar包hadoop-apache2-xxx.jar重新编译。
1. 查看安装的Presto中的presto-hadoop-apache2版本。
  以0.265.1版本为例，通过源码POM文件可以看到引入的hadoop-apache2版本为2.7.4-9。
2. 下载presto-hadoop-apache2对应版本的源码。
```
git clone -b 2.7.4-9 https://github.com/prestodb/presto-hadoop-apache2.git
```
3. 在源码中的POM文件中添加文件存储 HDFS 版最新Java SDK的依赖项。本文使用的Java SDK版本为1.0.5。
```
vim presto-hadoop-apache2/pom.xml
<dependency>
    <groupId>com.aliyun.dfs</groupId>
    <artifactId>aliyun-sdk-dfs</artifactId>
    <version>1.0.5</version>
</dependency>
```
4. 编译presto-hadoop-apache2。
```
cd presto-hadoop-apache2
```
```
mvn clean package -DskipTests
```
5. 查看生成的hadoop-apache2-2.7.4-9.jar。
  1. 执行以下命令，进入上一级目录。
    cd ..
  2. 执行以下命令，查看hadoop-apache2-2.7.4-9.jar是否符合预期。
    ll -h presto-hadoop-apache2/target/
6. 替换旧JAR包。
  1. 移除旧JAR包。
    mv presto-server-0.265.1/plugin/hive-hadoop2/hadoop-apache2-2.7.4-9.jar presto-server-0.265.1/plugin/hive-hadoop2/hadoop-apache2-2.7.4-9.jar.bak
  2. 将编译后的hadoop-apache2-2.7.4-9.jar依赖包拷贝到对应目录下。
    cp presto-hadoop-apache2/target/hadoop-apache2-2.7.4-9.jar presto-server-0.265.1/plugin/hive-hadoop2/
配置presto-cli-xxx-executable.jar。
1. 将下载的presto-cli-xxx-executable.jar复制到presto-server-0.265.1/bin/目录。
```
cp presto-cli-0.265.1-executable.jar  presto-server-0.265.1/bin/
```
2. 重命名presto-server-0.265.1/bin/目录中的presto-cli-xxx-executable.jar。
```
mv presto-server-0.265.1/bin/presto-cli-0.265.1-executable.jar  presto-server-0.265.1/bin/presto
```
3. 为重新命名的文件添加可执行权限。
```
chmod +x presto-server-0.265.1/bin/presto
```

步骤三：验证Presto

执行以下命令，启动Hive的元数据服务。
```
hive --service metastore
```

启动Presto Server并连接Hive Metastore。

启动Presto Server。

presto-server-0.265.1/bin/launcher start

连接Hive Metastore。

presto-server-0.265.1/bin/presto  --server localhost:8080 --catalog hive

通过Presto在Hive中创建数据库。

在文件存储 HDFS 版实例上创建测试目录。

hadoop fs -mkdir dfs://f-xxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/tmp/presto_test

在Hive中创建数据库。

CREATE SCHEMA hive.prosto_test
WITH (location = 'dfs://f-xxxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/tmp/presto_test');

创建数据库

在刚创建的数据库中新建表并添加数据。

进入刚创建的数据库。
```
use prosto_test;
```

创建表。

 CREATE TABLE user_info_test (
   user_id bigint,
   firstname varchar,
   lastname varchar,
   country varchar
 )
 WITH (
   format = 'TEXTFILE'
 );

在刚创建表中插入数据。

INSERT INTO user_info_test VALUES(1,'Dennis','Hu','CN'),(2,'Json','Lv','Jpn'),(3,'Mike','Lu','USA');

添加数据

查看文件存储 HDFS 版实例上是否有刚创建的表数据。
```
hadoop fs -ls dfs://f-xxxxx.cn-zhangjiakou.dfs.aliyuncs.com:10290/tmp/presto_test/*
```
如果返回信息显示刚创建的表数据，则表示Presto可以向文件存储 HDFS 版写入数据。
进行Word Count计算，检验Presto能否读取文件存储 HDFS 版上的数据并计算。
```
SELECT country,count(*) FROM user_info_test GROUP BY country;
```
如果返回信息与创建表中的信息一致，则表示Presto可读取文件存储 HDFS 版上的数据并计算。