When you synchronize data from a data source to OpenSearch, field values in the source may need transformation before indexing — for example, splitting a comma-separated string into an array, or stripping HTML tags. Data processing plug-ins handle these transformations automatically as data syncs, so you don't need to pre-process data outside of OpenSearch.
Data processing plug-ins are only available when synchronizing data through a configured data source. If you upload data via the API operation or OpenSearch SDKs, process the data before uploading.
Prerequisites
Before you begin, ensure that you have:
A configured data source for your OpenSearch application
Field mappings defined between your source tables and OpenSearch tables
Configure plug-ins when setting up the data source, not when defining the application schema. Plug-ins are only available after a data source is configured.
Data source constraints
Note these constraints before configuring plug-ins:
ApsaraDB RDS and PolarDB: An OpenSearch table can be associated with multiple source tables (supports database and table sharding).
MaxCompute: An OpenSearch table can be associated with only one source table. To synchronize from multiple MaxCompute source tables, join them into a single table first.
Available plug-ins
OpenSearch provides five plug-ins for field transformation during synchronization:
| Plug-in | Transformation type | What it does |
|---|---|---|
| JsonKeyValueExtractor | JSON extraction | Extracts a specified key's value from a JSON-formatted source field |
| MultiValueSpliter | Value splitting | Splits a source field into multiple values using a delimiter |
| KeyValueExtractor | Key-value extraction | Extracts keys and values from key-value pair source fields |
| StringCatenateExtractor | String concatenation | Concatenates values from multiple fields into a single string |
| HTMLTagRemover | HTML stripping | Removes HTML tags from a source field value |
JsonKeyValueExtractor
Extracts the value of a specified key from a JSON-formatted source field and maps it to the destination field. Only the value of the specified key can be extracted.
Type requirement: The extracted value type must match the destination field type. If the types are mismatched, the extracted value is silently dropped.
Array conversion: If the extracted value is a JSON array, it is automatically converted to an Array type field value.
Example
Source field value:
{"title": "the content", "body": "the content"}To extract the title key, configure the plug-in to target title. The destination field receives "the content".
For Array types:
LITERAL_ARRAY source:
{"tags": ["a", "b", "c"]}INT_ARRAY source:
{"tags": [1, 2, 3]}
MultiValueSpliter
Splits a source field value into multiple values using a specified delimiter, and writes the results to an Array type destination field.
Type requirement: The destination field must be of Array type.
Delimiter support:
| Delimiter type | How to specify |
|---|---|
Common non-printable characters (e.g., \t) | Write directly |
| Uncommon non-printable characters | Use Unicode notation (e.g., \u001D) |
Multi-character delimiters (e.g., ##, \t\t) | Write directly |
Example
Source field value: 1,2,3
Specify , as the delimiter.
For more configuration details, see MultiValueSpliter configuration.
KeyValueExtractor
Extracts specified keys and values from a source field formatted as key-value pairs, and maps the results to the destination field. Only the values of the specified key can be extracted. Delimiters are not required.
Type requirement: The extracted value type must match the destination field type. If the types are mismatched, the value is silently dropped. If a delimiter separates extracted values, the destination field must be of Array type.
Duplicate key behavior: If two identical keys exist, only the value of the second key is extracted.
Example
Source field value: key1:value1,value2;key2:value3
Configuration:
Key-value pairs are separated by semicolons (
;): separateskey1:value1,value2fromkey2:value3Keys and values are separated by colons (
:): separates key from valueValues are separated by commas (
,): separates multiple values for a key
StringCatenateExtractor
Concatenates values from multiple destination table fields into a single string in a specified order.
Type requirement: This plug-in cannot concatenate fields of the INT type. We recommend that you use fields of the LITERAL type.
Field source: Fields must come from the destination table, not the source table. Separate multiple field names with commas (,).
System variable: Use $table to include the current table name in the concatenated string. $table is only populated when a table-sharding wildcard is configured.
Example
To concatenate field1 and field2 with an underscore (_) separator.
HTMLTagRemover
Strips all HTML tags from a source field value. The destination field receives the plain text content.
Example
Source field value: <div id="copyright">OpenSearch</div>
After processing, the destination field value is: OpenSearch
Limitations
| Constraint | Detail |
|---|---|
| API and SDK uploads | Plug-ins are not available. Process data before uploading. |
| MaxCompute data source | One OpenSearch table maps to one MaxCompute source table only |
| StringCatenateExtractor | Cannot concatenate INT type fields |
| JsonKeyValueExtractor and KeyValueExtractor | Type mismatch between the extracted value and destination field causes silent data loss |
| Plug-in configuration timing | Configure only after a data source is set up; not available during schema definition |