EasyRec支持csv和Parquet两种Hive文件存储格式。本文通过示例为您介绍,如何基于Hive在Data Science集群进行EasyRec模型训练、评估和预测。
前提条件
操作步骤
- 开放安全组。将DataScience集群的所有公网IP地址,添加至Hadoop集群的安全组中,端口为10000和9000,详情请参见添加安全组规则。
- 修改ml_on_ds目录下的文件。
- 上传获取到的dsdemo*.zip至DataScience集群的header节点。
- 通过SSH方式连接DataScience集群,详情请参见登录集群。
- 解压dsdemo*.zip。
- 修改ml_on_ds目录下的easyrec_model.config文件。使用./ml_on_ds/testdata/dssm_hiveinput/dssm_hive_csv_input.config或dssm_hive_parquet_input.config作为easyrec_model.config。
yes | cp ./testdata/dssm_hiveinput/dssm_hive_csv_input.config ./easyrec_model.config
easyrec_model.config的代码片段如下,根据集群信息修改host和username。
hive_train_input { host: "*.*.*.*" username: "admin" port:10000 } hive_eval_input { host: "*.*.*.*" username: "admin" port:10000 } train_config { ... } eval_config { ... } data_config { input_type: HiveInput eval_batch_size: 1024 ... }
参数 描述 host Hadoop集群Header节点的外网IP地址。 username Hadoop集群LDAP的用户名。默认为admin。 input_type 输入类型。EasyRec支持HiveInput和HiveParquetInput两种关于Hive的输入类型。
- 生成镜像文件。修改ml_on_ds目录下的config文件,设置
DATABASE
、TRAIN_TABLE_NAME
、EVAL_TABLE_NAME
、PREDICT_TABLE_NAME
、PREDICT_OUTPUT_TABLE_NAME
和PARTITION_NAME
。设置好容器服务的地址,保证容器服务是私有的且为企业镜像。配置好之后执行make build push
命令打包镜像。 - 准备数据。
- 在ml_on_ds目录下运行以下命令,以启动Hive。
spark-sql
- 执行以下命令,准备测试数据。
create database pai_online_projects location 'hdfs://192.168.*.*:9000/pai_online_projects.db'; ---------------------csv数据--------------------- CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY',' partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data'; CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY',' partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data'; load data local inpath './testdata/taobao/19700101/train/easyrec_demo_taobao_train_data_10000.csv' into table pai_online_projects.easyrec_demo_taobao_train_data partition(dt = 'yyyymmdd'); load data local inpath './testdata/taobao/19700101/test/easyrec_demo_taobao_test_data_1000.csv' into table pai_online_projects.easyrec_demo_taobao_test_data partition(dt = 'yyyymmdd'); ---------------------parquet数据--------------------- CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) STORED AS PARQUET partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data_parquet'; CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) STORED AS PARQUET partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data_parquet'; insert into pai_online_projects.easyrec_demo_taobao_train_data_parquet partition(dt='yyyymmdd') select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price from pai_online_projects.easyrec_demo_taobao_train_data where dt = yyyymmdd; insert into pai_online_projects.easyrec_demo_taobao_test_data_parquet partition(dt='yyyymmdd') select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price from pai_online_projects.easyrec_demo_taobao_test_data where dt = yyyymmdd;
- 在ml_on_ds目录下运行以下命令,以启动Hive。
- 执行以下命令,进行模型训练。
kubectl apply -f tfjob_easyrec_training_hive.yaml
- 执行以下命令,进行模型评估。
kubectl apply -f tfjob_easyrec_eval_hive.yaml
- 执行以下命令,进行模型预测。
kubectl apply -f tfjob_easyrec_predict_hive.yaml