Paraformer audio file transcription iOS SDK-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

This document describes how to use the Paraformer audio file transcription iOS SDK to convert speech to text.

User guide: For an introduction to the models and selection recommendations, see Audio file transcription.

Getting started

Get an API key: Get and configure an API key

Note
To grant temporary access to third-party applications or users, or want to strictly control risky operations such as accessing or deleting sensitive data, use a temporary API key. A temporary API key is valid for 60 seconds and you must obtain a new one after it expires.
Download the SDK and run the sample code:
- Download the latest SDK package.
- Unzip the package and add nuisdk.framework to your project.
- In Build Phases → Link Binary With Libraries, add nuisdk.framework.
- In General → Frameworks, Libraries, and Embedded Content, set nuisdk.framework to Embed & Sign.
- Open the sample project in Xcode. The sample code is in the DashParaformerFileTranscriberViewController class. Replace the API key to test the feature.

Call procedure

Synchronous mode

Initialize the SDK.
Configure parameters as needed.
Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="a2eef1df15cn8">nui_file_trans_start</a> to start the recognition task. Set the async_request parameter to false.
In the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="966fbae1d26c5">onFileTransEventCallback</a> interface, listen for the EVENT_FILE_TRANS_RESULT event to retrieve the final recognition result.
You can call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="c8dd7d3379543">nui_release</a> to release SDK resources.

Asynchronous mode

Initialize the SDK.
Configure parameters as needed.
Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="2c143882394kf">nui_file_trans_start</a> to start a detection task. Set the async_request parameter to true.
Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#d894d1c6f41ke" id="8576d050ffo09">nui_file_trans_query</a> to query the recognition progress or results.
In the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="187879b1f6hr5">onFileTransEventCallback</a> interface, listen for the EVENT_FILE_TRANS_QUERY_RESULT event to obtain the current query result.
In the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="110eee8e99ace">onFileTransEventCallback</a> callback, listen for the EVENT_FILE_TRANS_RESULT event to obtain the final recognition result.
Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="8246d583bbpkj">nui_release</a> to release the SDK resources.

Request parameters

Connection and control parameters

You can configure the SDK by passing a JSON string to the parameters parameter of the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#05eab5125e2pm" id="4cd19de9805x7">nui_initialize</a> interface.

Parameter example: The following JSON string is an example. Not all parameters are listed. You can add parameters as needed during encoding.

{
    "url": "wss://dashscope.aliyuncs.com/api-ws/v1/inference",
    "apikey": "st-****",
    "device_id": "my_device_id",
    "service_mode": "1"
}

Parameter descriptions

Parameter	Type	Required	Description
`url`	`String`	Yes	The endpoint. This is fixed to `wss://dashscope.aliyuncs.com/api-ws/v1/inference`.
`apikey`	`String`	Yes	The API key. For improved security, use a temporary API key. It has a short validity period to reduce the risk of leakage.
`service_mode`	`String`	Yes	The running mode. For audio file transcription, this is fixed to `"1"`.
`device_id`	`String`	Yes	A unique string that identifies the end user. You can set it to an in-app user ID or a unique device identifier generated by the client. This ID is mainly used for log tracking and troubleshooting.
`debug_path`	`String`	No	The storage path for log files. This parameter takes effect only when `save_log` is set to `YES` in the call to the nui_initialize interface. You must set a log file path in this case, or an error will occur. A maximum of two log files are kept locally.
`max_log_file_size`	`int`	No	The maximum size of a log file in bytes. This parameter takes effect only when `save_log` is set to `YES` in the call to the nui_initialize interface. Default value: 104857600 (100 × 1024 × 1024 bytes, which is 100 MiB).
`log_track_level`	`int`	No	Controls the filtering level for log content sent externally through the onNuiLogTrackCallback. Default value: 2. Value range: 0: LOG_LEVEL_VERBOSE 1: LOG_LEVEL_DEBUG 2: LOG_LEVEL_INFO 3: LOG_LEVEL_WARNING 4: LOG_LEVEL_ERROR 5: LOG_LEVEL_NONE (disables this feature) Note: `log_track_level` and `level` (set through the nui_initialize interface) together determine the final logs that are sent to the callback. For a log to be sent, its level value must be greater than or equal to both the `log_track_level` and `level` values. For example, if `log_track_level` is set to 2 (INFO) and `level` is set to 3 (WARNING), only logs at the WARNING level and higher (value >= 3) are sent.

Speech recognition effect parameters

You can use the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#763672f3f8dgw" id="987c6febcbqd9">nui_set_param</a> API to configure the nl_config parameter, or use the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="df65a624e53ur">nui_file_trans_start</a> API to configure all speech recognition performance parameters.

Parameter example: The following JSON string is an example. Not all parameters are listed. You can add parameters as needed during encoding.

{
    "file_urls": [
        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
    ],
    "async_request": false,
    "nls_config": {
        "model":"paraformer-v2",
        "disfluency_removal_enabled":false,
        "timestamp_alignment_enabled": false
    }
}

Parameter descriptions

Parameter	Type	Required	Description
`file_urls`	`array[string]`	Yes	A list of URLs for the audio or video files to be transcribed. The HTTP and HTTPS protocols are supported. A single request supports only 1 URL. If your audio files are stored in OSS, the SDK does not support temporary URLs that start with the oss:// prefix. Audio formats: `aac`, `amr`, `avi`, `flac`, `flv`, `m4a`, `mkv`, `mov`, `mp3`, `mp4`, `mpeg`, `ogg`, `opus`, `wav`, `webm`, `wma`, `wmv` Important Because of the large number of audio and video formats and their variations, it is not technically possible to test all of them. The API cannot guarantee that all formats will be correctly recognized. Test your files to verify that you can get normal speech recognition results. Audio sampling rate Varies by model: paraformer-v2 supports any sample rate. paraformer-v1 supports any sample rate. paraformer-8k-v2 supports only an 8 kHz sample rate. paraformer-8k-v1 supports only an 8 kHz sample rate. paraformer-mtl-v1 supports a sample rate of 16 kHz or higher. Audio file size and duration: The audio file cannot exceed 2 GB. The duration must be within 12 hours. If you want to process a file that exceeds these limits, you can try to pre-process the file to reduce its size. For more information about best practices for file pre-processing, see Pre-process video files to improve transcription efficiency (for audio file transcription).
`async_request`	`boolean`	No	Specifies whether the speech recognition request is asynchronous. Default value: `false`. Value range: true: asynchronous request false: synchronous request
`apikey`	`string`	No	If the `apikey` in the Connection and control parameters is a temporary API key, you can update it here to prevent it from expiring.
`nls_config`	`object`	Yes	The core configuration object for speech recognition. It contains key parameters such as model selection and recognition effect controls.
`nls_config.model`	`string`	Yes	The speech recognition model.
`nls_config.language_hints`	`array[string]`	No	Specifies the language codes of the speech to be recognized. This parameter applies only to the paraformer-v2 model. Default value: `["zh", "en"]`. Supported language codes: zh: Chinese en: English ja: Japanese yue: Cantonese ko: Korean de: German fr: French ru: Russian
`nls_config.disfluency_removal_enabled`	`boolean`	No	Specifies whether to filter out disfluencies, such as "um" and "ah". Default value: false. Value range: true: Filter false: Do not filter
`nls_config.timestamp_alignment_enabled`	`boolean`	No	Specifies whether to enable the timestamp alignment feature. Default value: false. Value range: true: Enable false: Disable
`nls_config.special_word_filter`	`object`	No	Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words. If this parameter is not passed, the system's built-in sensitive word filtering logic is enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with an equal number of asterisks (``). If this parameter is passed, you can implement the following sensitive word processing policies: Replace with ``: Replaces matching sensitive words with an equal number of asterisks (``). Filter directly: Completely removes matching sensitive words from the recognition result. The value of this parameter must be a JSON Object with the following structure: `{ "filter_with_signed": { "word_list": ["test"] }, "filter_with_empty": { "word_list": ["start", "happen"] }, "system_reserved_filter": true }` JSON field descriptions: `filter_with_signed` Type: Object. Required: No. Description: Configures a list of sensitive words to be replaced with asterisks (``). Matching words in the recognition result are replaced with an equal number of asterisks (``). Example: Based on the JSON example, the speech recognition result for "Help me test this piece of code" will be "Help me * this piece of code". Internal fields: `word_list`: An array of strings that lists the sensitive words to be replaced. `filter_with_empty` Type: Object. Required: No. Description: Configures a list of sensitive words to be removed (filtered) from the recognition result. Matching words are completely deleted from the result. Example: Based on the JSON example, the speech recognition result for "Is the game about to start?" will be "Is the game about to?". Internal fields: `word_list`: An array of strings that lists the sensitive words to be completely removed (filtered). `system_reserved_filter` Type: Boolean value. Required: No. Default value: true. Description: Specifies whether to enable the system's preset sensitive word rules. If set to `true`, the system's built-in sensitive word filtering logic is also enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with an equal number of asterisks (`*`).
`nls_config.channel_id`	`array[integer]`	No	Specifies the indexes of the audio tracks in a multi-track audio file to recognize. The index starts from 0. For example, [0] indicates that only the first track is recognized, and [0, 1] indicates that both the first and second tracks are recognized. If you omit this parameter, the first track is processed by default. Important Each specified audio track is billed separately. For example, a request for [0, 1] for a single file incurs two separate charges. Default value: `[0]`
`nls_config.diarization_enabled`	`boolean`	No	Automatic speaker diarization. This feature is disabled by default. This feature is applicable only to mono audio. Multi-channel audio does not support speaker diarization. When this feature is enabled, the recognition results will display a `speaker_id` field to distinguish different speakers. Note If you enable speaker diarization, keep the audio duration under 2 hours. Exceeding this limit may cause recognition failures or timeouts. For an example of `speaker_id`, see Recognition result description.
`nls_config.speaker_count`	`integer`	No	A reference value for the number of speakers. To use this feature, you must set `diarization_enabled` to `true`. By default, the number of speakers is automatically determined. If you configure this parameter, it only helps the algorithm try to output the specified number of speakers, but it cannot guarantee that this number will be output. Value range: `[2, 100]`. This feature is used to distinguish between multiple speakers, so you must set it to at least 2.
`nls_config.vocabulary_id`	`string`	No	The ID of the hotword list, used to improve the recognition accuracy of specific words. This parameter applies to v2 and later models. For more information about how to use hotwords, see Customize hotwords.
`nls_config.resources`	`array[object]`	No	The hotword resource configuration for v1 models. This feature is the same as `vocabulary_id`, but the configuration method is different: `resources` is an array of objects. Each object contains the `resource_id` and `resource_type` fields: `resource_id`: A `string`. The hotword ID. `resource_type`: A `string`. The value is fixed to "`asr_phrase`". Example: `{ "nls_config": { "resources": [ { "resource_id": "xxxxxxxxxxxx", "resource_type": "asr_phrase" } ] } }` For more information about how to use hotwords, see Customize and manage hotwords for Paraformer speech recognition.

Key interfaces

NeoNui

nui_initialize

Initializes the speech recognition SDK instance. The SDK is a singleton. Do not initialize it again before you call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="edeb95fae1muw">nui_release</a>.

Method signature

-(NuiResultCode) nui_initialize:(const char *)parameters
                       logLevel:(NuiSdkLogLevel)level
                        saveLog:(BOOL)save_log;

Parameter descriptions

Parameter	Type	Description
`parameters`	`char*`	A JSON string that contains authentication, connection, and debugging parameters. See Connection and control parameters.
`level`	`NuiSdkLogLevel`	Controls the printing level of the SDK's own logs.
`save_log`	`BOOL`	Specifies whether to save local logs. If set to `YES`, you must specify a path using `debug_path` in the Connection and control parameters. You can also set the file size using `max_log_file_size`.

Return value description

Returns an error code. For more information, see Query error codes.

nui_set_param

This interface sets or updates the nls_config parameter independently. If you provide all parameters at once in <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="3cd46417307dh">nui_file_trans_start</a>, you do not need to call this method.

Method signature

-(NuiResultCode) nui_set_params:(const char *)params;

Parameter descriptions

Parameter

Type

Description

params

char*

The nls_config parameter in Speech recognition effect parameters. Parameters other than nls_config cannot be set using this method.

Example:

{
    "nls_config": {
        "model":"paraformer-v2",
        "disfluency_removal_enabled":false,
        "timestamp_alignment_enabled": false
    }
}

Return value description

Returns an error code. For more information, see Query error codes.

nui_file_trans_start

You can start detection.

Method signature

-(NuiResultCode) nui_file_trans_start(const char *params, char *task_id);

Parameter descriptions

Parameter

Type

Description

params

char*

Speech recognition effect parameters.

Example:

{
    "file_urls": [
        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
    ],
    "async_request": false,
    "nls_config": {
        "model":"paraformer-v2",
        "disfluency_removal_enabled":false,
        "timestamp_alignment_enabled": false
    }
}

task_id

char*

The task ID. The SDK internally generates a random string. You can get the task_id after this interface is successfully called.

Return value description

Returns an error code. For more information, see Query error codes.

nui_file_trans_query

You can use this interface to query the current status and result of an asynchronous task. After a successful call, the result is returned by the EVENT_FILE_TRANS_QUERY_RESULT event in the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="2275a61dbb925">onFileTransEventCallback</a> callback.

Method signature

-(NuiResultCode) nui_file_trans_query(const char *task_id);

Parameter descriptions

Parameter	Type	Description
`task_id`	`char*`	The ID of the task to query.

Return value description

Returns an error code. For more information, see Query error codes.

nui_file_trans_cancel

Cancels the current task.

Method signature

-(NuiResultCode) nui_file_trans_cancel(const char *task_id);

Parameter descriptions

Parameter	Type	Description
`task_id`	`char*`	The ID of the task to cancel.

Return value description

Returns an error code. For more information, see Query error codes.

nui_release

Releases all internal resources of the SDK and forcibly terminates all ongoing tasks. After you call this method, the SDK instance becomes unavailable. To use the instance again, you must call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#05eab5125e2pm" id="5a8692bbf70wb">nui_initialize</a> to initialize it.

Method signature
```
-(NuiResultCode) nui_release;
```
Return value description

Returns an error code. For more information, see Query error codes.

nui_get_version

Retrieves the current SDK version information.

Method signature
```
-(const char*) nui_get_version;
```
Return value description

The current SDK version information.

NeoNuiSdkDelegate: Listener callback

onFileTransEventCallback: Listen for events and speech recognition results

Method signature

-(void) onFileTransEventCallback:(NuiCallbackEvent)nuiEvent
                       asrResult:(const char *)asr_result
                          taskId:(const char *)task_id
                        ifFinish:(BOOL)finish
                         retCode:(int)code;

Parameter descriptions

Parameter	Type	Description
`nuiEvent`	`NuiCallbackEvent`	The callback event.
`asr_result`	`char*`	The speech recognition result.
`task_id`	`char*`	The task ID.
`finish`	`BOOL`	A flag that indicates whether the current recognition round is complete.
`code`	`int`	The error code. This is valid when an EVENT_ASR_ERROR event occurs. For more information, see Query error codes.

onFileTransLogTrackCallback: Listen for tracking logs

This callback receives detailed internal logs from the SDK for troubleshooting and debugging.

-(void)onFileTransLogTrackCallback:(NuiSdkLogLevel)level
                        logMessage:(const char *)log;

NuiCallbackEvent: Event types

Event	Description
EVENT_FILE_TRANS_CONNECTED	Successfully connected to the service.
EVENT_FILE_TRANS_UPLOADED	Successfully uploaded the audio file for transcription.
EVENT_FILE_TRANS_QUERY_RESULT	The result of a task query.
EVENT_FILE_TRANS_RESULT	The final transcription result.
EVENT_ASR_ERROR	An error occurred during speech recognition.