Import data in batches-Lindorm(Lindorm)-阿里云帮助中心

Advantages

Compared to importing data through an API, the bulkload feature provides the following advantages:

Faster imports: Imports data more than 10 times faster.
Stable online services: Does not consume online service resources, so running services are not affected.
Flexible resource separation: Separates online and offline resources for flexible resource usage.
Multiple data sources: Supports importing data from CSV, ORC, Parquet, and MaxCompute sources.
No code required: Loads data in batches in bypass mode without any code.
Cost-effective: LTS uses serverless Spark to provide elastic computing resources for bulkload operations. Resources are scaled as needed and billed on a pay-as-you-go basis, so you do not need to provision long-running computing resources.

Prerequisites

Before you import data in batches, make sure that the following requirements are met:

LTS is activated and you are logged on to the LTS console. For more information, see Activate and log on to LTS.
The Lindorm compute engine is activated. For more information, see Activation and upgrade/downgrade.
A Spark data source is added. For more information, see Add a Spark data source.

Supported data sources

The following table lists the supported source and destination data sources.

Source data source	Destination data source
MaxCompute Table	LindormTable
HDFS CSV or OSS CSV	LindormTable
HDFS Parquet or OSS Parquet	LindormTable
HDFS ORC or OSS ORC	LindormTable

Submission methods

You can submit a bulkload job in one of the following ways:

Submit a job using the LTS console
Submit a job using an API operation

Submit a job using the LTS console

Step 1: Add data sources

Log on to the LTS console. For more information, see Activate and log on to LTS.
In the navigation pane on the left, choose Data Source Management > Add Data Source to add the following data sources:
- Add a MaxCompute (ODPS) data source. For more information, see MaxCompute data source.
- Add a Lindorm wide table data source. For more information, see Lindorm wide table data source.
- Add an HDFS data source. For more information, see Add an HDFS data source.

Step 2: Create and configure a bulkload job

In the navigation pane on the left, choose Import To Lindorm/HBase > Universal Import.
- For LTS versions earlier than 3.8.12.4.3, choose Import To Lindorm/HBase > Bulkload.
- To view the LTS version, go to the Configuration Information section on the Instance Details page in the Lindorm console.
Click Create Job and configure the following parameters.

Data source settings

Configuration item	Parameter	Description
Select Data Source	Source Data Source	Select the ODPS or HDFS data source that you added.
	Destination Data Source	Select the Lindorm wide table data source that you added.

Reader configuration

The Plugin Configuration > Reader Configuration parameters depend on the source data source type.

MaxCompute (ODPS) source

Parameter	Description
`table`	The name of the MaxCompute (ODPS) table.
`column`	The names of the MaxCompute (ODPS) columns to import.
`partition`	The partition information. Leave this empty for a non-partitioned table.
`numPartitions`	The degree of parallelism for reading data.

CSV source (HDFS or OSS)

Parameter	Description
`filePath`	The directory where the CSV file is located.
`header`	Specifies whether the CSV file contains a header row.
`delimiter`	The delimiter used in the CSV file.
`column`	The column names and their corresponding types in the CSV file.

Parquet source (HDFS or OSS)

Parameter	Description
`filePath`	The directory where the Parquet file is located.
`column`	The column names in the Parquet file.

For configuration examples, see Reader plugin configuration examples.

Writer configuration

The Plugin Configuration > Writer Configuration parameters apply to all source types.

Parameter	Description
`namespace`	The namespace of the Lindorm wide table.
`lindormTable`	The name of the Lindorm wide table.
`compression`	The compression algorithm. Currently, only `zstd` is supported. To disable compression, set this to `none`.
`columns`	The column mapping. The configuration depends on the destination table type. See Column mapping.
`timestamp`	The timestamp of the data in the Lindorm wide table. Supports the following types: a Long type with a 13-digit value, or a String type in the `yyyy-MM-dd HH:mm:ss` or `yyyy-MM-dd HH:mm:ss SSS` format.

Column mapping

Lindorm wide tables: Specify the column names of the Lindorm SQL wide table. The columns must correspond to the columns in the reader configuration.
HBase-compatible Lindorm tables: Specify the standard column names of the HBase table. The columns must correspond to the columns in the reader configuration.

For configuration examples, see Writer plugin configuration examples.

Job running parameters

Configuration item	Parameter	Description
Job Running Parameter Configuration	Spark Driver Specification	Select the Spark driver specification.
	Spark Executor Specification	Select the Spark executor specification.
	Number Of Executors	Enter the number of executors.
	Spark Configuration	Enter the Spark configuration. This parameter is optional.

Step 3: Submit and monitor the job

Click Create.
On the Bulkload page, click the Job Name to view the job details.
- Click the Job Name to view the Spark UI of the job.
- Click Details to view the execution logs of the job.

If data is evenly distributed across partitions in the destination Lindorm wide table, it takes about 1 hour to import 100 GB of data with a 4:1 compression ratio. The actual time may vary.

Configuration examples

Reader plugin configuration examples

MaxCompute (ODPS) data source

{
  "table": "test",
  "column": [
    "id",
    "intcol",
    "doublecol",
    "stringcol",
    "string1col",
    "decimalcol"
  ],
  "partition": [
    "pt=1"
  ],
  "numPartitions": 10
}

CSV file in an HDFS data source

{
  "filePath": "csv/",
  "header": false,
  "delimiter": ",",
  "column": [
    "id|string",
    "intcol|int",
    "doublecol|double",
    "stringcol|string",
    "string1col|string",
    "decimalcol|decimal"
  ]
}

Parquet file in an HDFS data source

{
  "filePath": "parquet/",
  "column": [
    "id",
    "intcol",
    "doublecol",
    "stringcol",
    "string1col",
    "decimalcol"
  ]
}

Writer plugin configuration examples

Lindorm SQL table

{
  "namespace": "default",
  "lindormTable": "xxx",
  "compression": "zstd",
  "timestamp": "2022-07-01 10:00:00",
  "columns": [
    "id",
    "intcol",
    "doublecol",
    "stringcol",
    "string1col",
    "decimalcol"
  ]
}

Lindorm table compatible with HBase

In this format, ROW represents the row key and the value after || indicates the data type. For non-row-key columns, use the column family:column name||column type format.

{
  "namespace": "default",
  "lindormTable": "xxx",
  "compression": "zstd",
  "timestamp": "2022-07-01 10:00:00",
  "columns": [
    "ROW||String",
    "f:intcol||Int",
    "f:doublecol||Double",
    "f:stringcol||String",
    "f:string1col||String",
    "f:decimalcol||Decimal"
  ]
}

Submit a job using an API operation

All API operations use the LTS master endpoint in the following format: http://{LTSMaster}:12311. Replace {LTSMaster} with the master hostname of your Lindorm instance, which you can find in the Basic Information section on the Cluster Information page of the LTS console.

Submit a job

Method: POST
Endpoint: http://{LTSMaster}:12311/pro/proc/bulkload/create

Parameters

Parameter	Description
`src`	The name of the source data source.
`dst`	The name of the destination data source.
`readerConfig`	The reader plugin configuration in JSON format. For configuration examples, see Reader plugin configuration examples.
`writerConfig`	The writer plugin configuration in JSON format. For configuration examples, see Writer plugin configuration examples.
`driverSpec`	The specification of the Spark driver. Valid values: `small`, `medium`, `large`, and `xlarge`. We recommend that you set this parameter to `large`.
`instances`	The number of executors.
`fileType`	The source file type. If the source data source is HDFS, set this parameter to `CSV` or `Parquet`.
`sparkAdditionalParams`	The extension parameters. This parameter is optional.

Example

curl -d "src=hdfs&dst=ld&readerConfig={\"filePath\":\"parquet/\",\"column\":[\"id\",\"intcol\",\"doublecol\",\"stringcol\",\"string1col\",\"decimalcol\"]}&writerConfig={\"columns\":[\"ROW||String\",\"f:intcol||Int\",\"f:doublecol||Double\",\"f:stringcol||String\",\"f:string1col||String\",\"f:decimalcol||Decimal\"],\"namespace\":\"default\",\"lindormTable\":\"bulkload_test\",\"compression\":\"zstd\"}&driverSpec=large&instances=5&fileType=Parquet" -H "Content-Type: application/x-www-form-urlencoded" -X POST http://{LTSMaster}:12311/pro/proc/bulkload/create

Response

The message field contains the job ID.

{"success":"true","message":"proc-91-ff383c616e5242888b398e51359c****"}

Get job information

Method: GET
Endpoint: http://{LTSMaster}:12311/pro/proc/{procId}/info
Parameter: procId -- the job ID.

Example

curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/info

Response

{
    "data":{
        "checkJobs":Array,
        "procId":"proc-91-ff383c616e5242888b398e51359c****",
        "incrJobs":Array,
        "procConfig":Object,
        "stage":"WAIT_FOR_SUCCESS",
        "fullJobs":Array,
        "mergeJobs":Array,
        "srcDS":"hdfs",
        "sinkDS":"ld-uf6el41jkba96****",
        "state":"RUNNING",
        "schemaJob":Object,
        "procType":"SPARK_BULKLOAD"
    },
    "success":"true"
}

The response includes the following key fields:

Field	Description
`procId`	The job ID.
`srcDS`	The source data source.
`sinkDS`	The destination data source.
`state`	The job status, such as `RUNNING`.
`stage`	The job stage, such as `WAIT_FOR_SUCCESS`.
`procType`	The job type, such as `SPARK_BULKLOAD`.

Stop a job

Method: GET
Endpoint: http://{LTSMaster}:12311/pro/proc/{procId}/abort
Parameter: procId -- the job ID.

Example

curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/abort

Response

{"success":"true","message":"ok"}

Retry a job

Method: GET
Endpoint: http://{LTSMaster}:12311/pro/proc/{procId}/retry
Parameter: procId -- the job ID.

Example

curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/retry

Response

{"success":"true","message":"ok"}

Delete a job

Method: GET
Endpoint: http://{LTSMaster}:12311/pro/proc/{procId}/delete
Parameter: procId -- the job ID.

Example

curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/delete

Response

{"success":"true","message":"ok"}