本文主要介绍如何操作工具Spark-SQL以及相关示例。
前提条件
获取Spark-SQL工具包
您可以点击下载获取dla-spark-toolkit.tar.gz,或者您可以通过wget的方式进行下载
wget https://dla003.oss-cn-hangzhou.aliyuncs.com/dla_spark_toolkit_1/dla-spark-toolkit.tar.gz
下载成功后,解压工具包。
tar zxvf dla-spark-toolkit.tar.gz
说明 本工具包要求JDK8或以上版本
操作步骤
- 查看帮助说明
- 输入如下命令,查看使用帮助。
cd /path/to/dla-spark-toolkit ./bin/spark-sql --help
- 运行帮助命令后,结果如下所示:
./spark-sql [options] [cli option] Options: --keyId Your ALIYUN_ACCESS_KEY_ID, required --secretId Your ALIYUN_ACCESS_KEY_SECRET, required --regionId Your Cluster Region Id, required --vcName Your Virtual Cluster Name, required --oss-keyId Your ALIYUN_ACCESS_KEY_ID to upload local resource to oss default is same as --keyId --oss-secretId Your ALIYUN_ACCESS_KEY_SECRET, default is same as --secretId --oss-endpoint Oss endpoint where the resource will upload. default is http://oss-$regionId.aliyuncs.com --oss-upload-path the user oss path where the resource will upload If you want to upload a local jar package to the OSS directory, you need to specify this parameter --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of jars to include on the driver and executor classpaths. --conf PROP=VALUE Arbitrary Spark configuration property --help, -h Show this help message and exit. --driver-resource-spec Indicates the resource specifications used by the driver: small | medium | large --executor-resource-spec Indicates the resource specifications used by the executor: small | medium | large --num-executors Number of executors to launch --properties-file Default properties file location, only local files are supported --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName). Specially, you can pass in a custom log output format file named `log4j.properties` Note: The file name must be `log4j.properties` to take effect --status job_id If given, requests the status and details of the job specified --verbose print more messages, enable spark-submit print job status and more job details. List Spark Job Only: --list List Spark Job, should use specify --vcName and --regionId --pagenumber, -pn Set page number which want to list (default: 1) --pagesize, -ps Set page size which want to list (default: 1) Get Job Log Only: --get-log job_id Get job log Kill Spark Job Only: --kill job_id,job_id Comma-separated list of job to kill spark job with specific ids Spark Offline SQL options: -e <quoted-query-string> SQL from command line -f <filename> SQL from files
- 输入如下命令,查看使用帮助。
- 使用spark-defaults.conf配置常用参数spark-defaults.conf 目前支持下列类型参数的配置(其中spark conf,只列出了常用配置):
# cluster information # AccessKeyId #keyId = # AccessKeySecret #secretId = # RegionId #regionId = # set vcName #vcName = # set OssUploadPath, if you need upload local resource #ossUploadPath = ##spark conf # driver specifications : small 1c4g | medium 2c8g | large 4c16g #spark.driver.resourceSpec = # executor instance number #spark.executor.instances = # executor specifications : small 1c4g | medium 2c8g | large 4c16g #spark.executor.resourceSpec = # when use ram, role arn #spark.dla.roleArn = # when use option -f or -e, set catalog implementation #spark.sql.catalogImplementation = # config dla oss connectors #spark.dla.connectors = oss # config eni, if you want to use eni #spark.dla.eni.enable = #spark.dla.eni.vswitch.id = #spark.dla.eni.security.group.id = # config log location, need an oss path to store logs #spark.dla.job.log.oss.uri = # config spark access dla metadata #spark.sql.hive.metastore.version = dla
说明- spark-submit脚本将自动读取
conf/spark-defaults.conf
中的配置文件。 - 命令行中参数的优先级要高于
conf/spark-defaults.conf
中的配置文件。 - 地区和
regionId
的对照关系请参见对照关系。
- spark-submit脚本将自动读取
- 提交离线SQL作业spark-sql工具,提供
-e
用于执行以分号隔开的多条sql语句,以及-f
用于执行sql文件中的语句(每条sql语句以分号结尾)。用户可将conf字段指定的配置放入conf/spark-defaults.conf
中,然后按照如下格式提交:$ ./bin/spark-sql \ --verbose \ --name offlinesql \ -e "select * from t1;insert into table t1 values(4,'test');select * from t1" ## 或者您也可以将sql语句放入到文件中,每条sql语句使用分号`;`隔开,并使用-f参数指向sql文件 $ ./bin/spark-sql \ --verbose \ --name offlinesql \ -f /path/to/your/sql/file ##输出结果如下所示 ++++++++++++++++++executing sql: select * from t1 | id|name| | 1| zz| | 2| xx| | 3| yy| | 4|test| ++++++++++++++++++ end ++++++++++++++++++ ++++++++++++++++++executing sql: insert into table t1 values(4,'test') || ++++++++++++++++++ end ++++++++++++++++++ ++++++++++++++++++executing sql: select * from t1 | id|name| | 1| zz| | 2| xx| | 3| yy| | 4|test|
在文档使用中是否遇到以下问题
更多建议
匿名提交