Configure the Databricks input component

The Databricks input component reads data from a Databricks data source. To synchronize Databricks data to other data sources, configure the Databricks input component as the source, then configure the target data source.

Prerequisites

A Databricks data source is created.Create a Databricks data source.
The account has read-through permission on the data source. If not, request the permission.Request data source permissions.

Procedure

In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.
In the top navigation bar, select a project. In Dev-Prod mode, also select an environment.
In the left-side navigation pane, click Batch Pipeline. From the Batch Pipeline list, click the target offline pipeline to open its configuration page.
In the upper-right corner, click Component Library to open the Component Library panel.
In the Component Library panel, select Input, find Databricks, and drag it to the canvas.
Click the icon on the Databricks component to open the Databricks Input Configuration dialog box.

In the Databricks Input Configuration dialog box, configure the parameters.

Parameter	Description
Step Name	The component name. Dataphin auto-generates a name that you can modify. Naming rules: Allows Chinese characters, letters, underscores (_), and digits only. Maximum 64 characters.
Datasource	Lists all Databricks data sources and project-level data sources in the current Dataphin instance, regardless of your read-through permission. Click the icon to copy the data source name. To request read-through permission, click Request next to the data source.Request data source permissions. If no Databricks data source exists, click Create Data Source.Create a Databricks data source.
Time Zone	The time zone used to process time-formatted data. Defaults to the time zone of the selected data source and cannot be modified. Note For tasks created before V5.1.2, you can select Data Source Default Configuration or Channel Configuration Time Zone. Default: Channel Configuration Time Zone. Data Source Default Configuration: uses the time zone of the selected data source. Channel Configuration Time Zone: uses the time zone set in Properties > Channel Configuration for the current task.
Schema (optional)	Select the schema of the target table. If not specified, the schema configured in the data source is used. When a project is selected as the data source, the schema is automatically determined by the project.
Table	Search by keyword or enter the exact name and click Exact Match. The system validates the table after selection. Click the icon to copy the table name.
Shard Key (optional)	Shards data by the specified column for concurrent reading. Use with the concurrency setting. Use a primary key or indexed column for optimal performance. Important For date and time types, the system shards based on the total time range and concurrency. Even distribution is not guaranteed.
Batch Read Count (optional)	The number of records to read per batch, for example, 1,024. Batching reduces data source interactions, improves I/O efficiency, and lowers network latency.
Input Filter (optional)	A Databricks-compatible condition expression to filter source data. Note Enter only the condition after WHERE. Do not include the WHERE keyword. You can use system global variables, such as the data timestamp ${bizdate}.
Output Fields	Displays all fields matching the selected table and filter conditions. Remove fields you do not want to pass to downstream components. Note The data source table does not support hierarchical classification. Delete a single field: Click the icon in the Operation column. Delete multiple fields in batches: Click Field Management. In the Field Management dialog box, select the fields to remove, click the left arrow icon to move them to the unselected list, then click OK.

Click OK to save the Databricks input component configuration.

上一篇: Configure TDengine input widget 下一篇: Configure the Snowflake input component