Configure the Databricks input component
The Databricks input component reads data from a Databricks data source. To synchronize Databricks data to other data sources, configure the Databricks input component as the source, then configure the target data source.
Prerequisites
-
A Databricks data source is created.Create a Databricks data source.
-
The account has read-through permission on the data source. If not, request the permission.Request data source permissions.
Procedure
-
In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.
-
In the top navigation bar, select a project. In Dev-Prod mode, also select an environment.
-
In the left-side navigation pane, click Batch Pipeline. From the Batch Pipeline list, click the target offline pipeline to open its configuration page.
-
In the upper-right corner, click Component Library to open the Component Library panel.
-
In the Component Library panel, select Input, find Databricks, and drag it to the canvas.
-
Click the
icon on the Databricks component to open the Databricks Input Configuration dialog box. -
In the Databricks Input Configuration dialog box, configure the parameters.
Parameter
Description
Step Name
The component name. Dataphin auto-generates a name that you can modify. Naming rules:
-
Allows Chinese characters, letters, underscores (_), and digits only.
-
Maximum 64 characters.
Datasource
Lists all Databricks data sources and project-level data sources in the current Dataphin instance, regardless of your read-through permission. Click the
icon to copy the data source name.To request read-through permission, click Request next to the data source.Request data source permissions.
If no Databricks data source exists, click Create Data Source.Create a Databricks data source.
Time Zone
The time zone used to process time-formatted data. Defaults to the time zone of the selected data source and cannot be modified.
NoteFor tasks created before V5.1.2, you can select Data Source Default Configuration or Channel Configuration Time Zone. Default: Channel Configuration Time Zone.
-
Data Source Default Configuration: uses the time zone of the selected data source.
-
Channel Configuration Time Zone: uses the time zone set in Properties > Channel Configuration for the current task.
Schema (optional)
Select the schema of the target table. If not specified, the schema configured in the data source is used.
When a project is selected as the data source, the schema is automatically determined by the project.
Table
Search by keyword or enter the exact name and click Exact Match. The system validates the table after selection. Click the
icon to copy the table name.Shard Key (optional)
Shards data by the specified column for concurrent reading. Use with the concurrency setting. Use a primary key or indexed column for optimal performance.
ImportantFor date and time types, the system shards based on the total time range and concurrency. Even distribution is not guaranteed.
Batch Read Count (optional)
The number of records to read per batch, for example, 1,024. Batching reduces data source interactions, improves I/O efficiency, and lowers network latency.
Input Filter (optional)
A Databricks-compatible condition expression to filter source data.
Note-
Enter only the condition after WHERE. Do not include the WHERE keyword.
-
You can use system global variables, such as the data timestamp ${bizdate}.
Output Fields
Displays all fields matching the selected table and filter conditions. Remove fields you do not want to pass to downstream components.
NoteThe data source table does not support hierarchical classification.
-
Delete a single field: Click the
icon in the Operation column. -
Delete multiple fields in batches: Click Field Management. In the Field Management dialog box, select the fields to remove, click the
left arrow icon to move them to the unselected list, then click OK.
-
-
Click OK to save the Databricks input component configuration.