EasyRec支持csv和Parquet两种Hive文件存储格式。本文通过示例为您介绍,如何基于Hive在Data Science集群进行EasyRec模型训练、评估和预测。

前提条件

  • 已创建Hadoop集群,详情请参见创建集群
  • 已创建DataScience集群,且选择了EasyRec和TensorFlow服务,详情请参见创建集群
  • 下载dsdemo代码:请已创建DataScience集群的用户,使用钉钉搜索钉钉群号32497587加入钉钉群以获取dsdemo代码。

操作步骤

  1. 开放安全组。
    将DataScience集群的所有公网IP地址,添加至Hadoop集群的安全组中,端口为10000和9000,详情请参见添加安全组规则
  2. 修改ml_on_ds目录下的文件。
    1. 上传获取到的dsdemo*.zip至DataScience集群的header节点。
    2. 通过SSH方式连接DataScience集群,详情请参见登录集群
    3. 解压dsdemo*.zip
    4. 修改ml_on_ds目录下的easyrec_model.config文件。
      使用./ml_on_ds/testdata/dssm_hiveinput/dssm_hive_csv_input.configdssm_hive_parquet_input.config作为easyrec_model.config
      yes | cp ./testdata/dssm_hiveinput/dssm_hive_csv_input.config ./easyrec_model.config

      easyrec_model.config的代码片段如下,根据集群信息修改hostusername

      hive_train_input {
          host: "*.*.*.*"
          username: "admin"
          port:10000
      }
      hive_eval_input {
          host: "*.*.*.*"
          username: "admin"
          port:10000
      }
      train_config {
         ...
      }
      
      eval_config {
          ...
      }
      data_config {
        input_type: HiveInput
        eval_batch_size: 1024
        ...
      
      }
      参数 描述
      host Hadoop集群Header节点的外网IP地址。
      username Hadoop集群LDAP的用户名。默认为admin。
      input_type 输入类型。EasyRec支持HiveInput和HiveParquetInput两种关于Hive的输入类型。
  3. 生成镜像文件。
    修改ml_on_ds目录下的config文件,设置DATABASETRAIN_TABLE_NAMEEVAL_TABLE_NAMEPREDICT_TABLE_NAMEPREDICT_OUTPUT_TABLE_NAMEPARTITION_NAME。设置好容器服务的地址,保证容器服务是私有的且为企业镜像。配置好之后执行make build push命令打包镜像。
  4. 准备数据。
    1. ml_on_ds目录下运行以下命令,以启动Hive。
      spark-sql
    2. 执行以下命令,准备测试数据。
      create database pai_online_projects location 'hdfs://192.168.*.*:9000/pai_online_projects.db';
      
      ---------------------csv数据---------------------
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY','
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data';
      
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY','
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data';
      
      
      load data local inpath './testdata/taobao/19700101/train/easyrec_demo_taobao_train_data_10000.csv' into table pai_online_projects.easyrec_demo_taobao_train_data partition(dt = 'yyyymmdd');
      
      load data local inpath './testdata/taobao/19700101/test/easyrec_demo_taobao_test_data_1000.csv' into table pai_online_projects.easyrec_demo_taobao_test_data partition(dt = 'yyyymmdd');
      
      ---------------------parquet数据---------------------
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      STORED AS PARQUET
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data_parquet';
      
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      STORED AS PARQUET
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data_parquet';
      
      
      insert into pai_online_projects.easyrec_demo_taobao_train_data_parquet partition(dt='yyyymmdd')
      select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price  from pai_online_projects.easyrec_demo_taobao_train_data where dt = yyyymmdd;
      
      
      insert into pai_online_projects.easyrec_demo_taobao_test_data_parquet partition(dt='yyyymmdd')
      select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price  from pai_online_projects.easyrec_demo_taobao_test_data where dt = yyyymmdd;
                                      
  5. 执行以下命令,进行模型训练。
     kubectl apply -f tfjob_easyrec_training_hive.yaml
  6. 执行以下命令,进行模型评估。
      kubectl apply -f tfjob_easyrec_eval_hive.yaml
  7. 执行以下命令,进行模型预测。
     kubectl apply -f tfjob_easyrec_predict_hive.yaml