Offline scheduling

更新时间:
复制 MD 格式

This document uses an ad click-through rate (CTR) prediction pipeline to demonstrate offline scheduling with data mining components in Platform for AI (PAI).

Background information

The pipeline works as follows:

  1. Train a model on historical data in Platform for AI (PAI).

  2. Schedule the model with DataWorks.

  3. Perform CTR prediction for ad placements at midnight every day to select and deliver ads that meet specified criteria.

The dataset for this pipeline is randomly generated. Therefore, this document does not evaluate the pipeline's results and instead focuses on how to build the pipeline and schedule it with DataWorks.

Step 1: Prepare the dataset

The pipeline uses a MaxCompute partitioned table for its dataset. The training data includes historical records from September 19 and 20, 2016, and the prediction target is the data for September 21, 2016.

Field

Type

Description

id

STRING

The unique identifier of the ad.

age

DOUBLE

The age of the target audience.

sex

DOUBLE

The gender of the target audience. 1 indicates male, and 0 indicates female.

duration

DOUBLE

The duration, in seconds, the ad is displayed on the UI.

place

DOUBLE

The ad placement position. The value ranges from 0 to 4, representing the positions from top to bottom.

ctr

DOUBLE

Ad CTR. If the number of ad clicks divided by the number of impressions is greater than 0.03, this parameter is set to 1. Otherwise, it is set to 0.

dt

STRING

The date in the YYYYMMDD format.

You can run the following command in a MaxCompute client to create a partitioned table named ad. For more information, see Create tables.

create table if not exists ad (id STRING,age DOUBLE,sex DOUBLE,duration DOUBLE,place DOUBLE,ctr DOUBLE ) partitioned by (dt STRING) ;
alter table ad add if not exists partition (dt='20160919') partition (dt='20160920');

An example of the data table ad for this pipeline is as follows. You can use the Tunnel command to import partitioned table data. For more information, see Import Data.

id

age

sex

duration

place

ctr

dt

0

49

1

9

0

0

20160919

1

17

1

3

1

1

20160919

2

44

0

4

0

0

20160919

3

14

1

9

1

0

20160919

4

44

1

5

4

0

20160919

5

10

1

9

3

1

20160919

6

42

1

7

3

0

20160919

7

51

1

3

1

1

20160919

8

18

0

3

3

0

20160919

9

39

0

8

4

1

20160919

10

45

1

3

2

0

20160919

11

57

0

8

2

0

20160919

12

14

0

7

2

1

20160919

Step 2: Create a pipeline

  1. Create and open a custom pipeline. For more information, see Create a custom pipeline.

  2. Build the pipeline.

    1. In the component list on the left, under Source/Target, drag two Read Table components onto the canvas, and rename them to ad-1 and ad-2 respectively.

    2. From the component list on the left, navigate to the Data Preprocessing folder and drag two Normalization components to the canvas.

    3. From the component list on the left, navigate to Machine Learning > Binary Classification and drag the Binary Logistic Regression component to the canvas.

    4. From the component list on the left, navigate to the Machine Learning folder and drag the Predicted component to the canvas.

    5. In the component list on the left, drag the Write Table component under Source/Target onto the canvas, and rename it to ad_result-1.

    6. Connect the components to build a pipeline as shown in the following figure.

      离线模型

      No.

      Description

      Import data sources.

      Preprocess data.

      Train the model.

      Make predictions.

  3. Configure the component parameters.

    1. Click the ad-2 (training data source) and ad-1 (prediction data source) components on the canvas. In the right-side pane, configure the data sources for the pipeline.

      Tab

      Parameter

      Description

      Select Table

      Table Name

      Enter ad.

      Partition

      Select the Partition check box.

      Parameter

      Set this parameter to dt=@@{yyyyMMdd} to ensure that daily incremental data is used for prediction.

      Field Details

      Source Table Fields

      After you configure parameters on the Select Table tab, the system automatically populates the Source Table Fields. You do not need to configure this parameter.

    2. Click the Normalization-1 and Normalization-2 components on the canvas. On the Field Settings tab in the right-side pane, click Select Fields and select the DOUBLE and INT fields.

    3. Click the Binary Logistic Regression component on the canvas. In the right-side pane, configure its parameters. Configure only the following parameters and leave the others at their default values.

      Tab

      Parameter

      Description

      Field Settings

      Feature Columns For Training

      Select the age, sex, duration, and place columns.

      Target Column

      Select the ctr column.

    4. Click the Predicted component on the canvas. In the right-side pane, configure its parameters. Configure only the following parameters and leave the others at their default values.

      Tab

      Parameter

      Description

      Field Settings

      Feature Columns

      Select the age, sex, duration, and place columns.

      Reserved Columns

      Select the ctr column.

    5. Click the ad_result-1 component on the canvas. On the Select Table tab in the right-side pane, set Output Table Name to ad_result.

  4. Click the Run icon image in the upper-left corner of the canvas to run the pipeline.

  5. After the pipeline finishes running, right-click the ad_result-1 component and choose View Data > Output to view the prediction results.

    The output columns are:

    • prediction_result: indicates whether an ad was clicked. A value of 1 means clicked, and 0 means not clicked.

    • prediction_score: indicates the probability that the ad is clicked.

Step 3: Perform offline scheduling

  1. Use DataWorks to create, configure, and commit a PAI task. For more information, see Create and run a PAI Studio node.

    When you configure the scheduling task, set Time to run the prediction pipeline at midnight (00:00) every day. For more information, see Configure time properties.

  2. On the task submission page, click O&M in the upper-right corner to view task logs. For more information, see Manage auto-triggered tasks.

Related documents