This document describes how to use the Paraformer audio file transcription iOS SDK to convert speech to text.
User guide: For an introduction to the models and selection recommendations, see Audio file transcription.
Getting started
-
Get an API key: Get and configure an API key
NoteTo grant temporary access to third-party applications or users, or want to strictly control risky operations such as accessing or deleting sensitive data, use a temporary API key. A temporary API key is valid for 60 seconds and you must obtain a new one after it expires.
-
Download the SDK and run the sample code:
-
Unzip the package and add nuisdk.framework to your project.
-
In Build Phases → Link Binary With Libraries, add nuisdk.framework.
-
In General → Frameworks, Libraries, and Embedded Content, set nuisdk.framework to Embed & Sign.
-
Open the sample project in Xcode. The sample code is in the
DashParaformerFileTranscriberViewControllerclass. Replace the API key to test the feature.
Call procedure
Synchronous mode
-
Initialize the SDK.
-
Configure parameters as needed.
-
Call
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="a2eef1df15cn8">nui_file_trans_start</a>to start the recognition task. Set theasync_requestparameter tofalse. -
In the
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="966fbae1d26c5">onFileTransEventCallback</a>interface, listen for theEVENT_FILE_TRANS_RESULTevent to retrieve the final recognition result. -
You can call
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="c8dd7d3379543">nui_release</a>to release SDK resources.
Asynchronous mode
-
Initialize the SDK.
-
Configure parameters as needed.
-
Call
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="2c143882394kf">nui_file_trans_start</a>to start a detection task. Set theasync_requestparameter totrue. -
Call
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#d894d1c6f41ke" id="8576d050ffo09">nui_file_trans_query</a>to query the recognition progress or results. -
In the
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="187879b1f6hr5">onFileTransEventCallback</a>interface, listen for theEVENT_FILE_TRANS_QUERY_RESULTevent to obtain the current query result. -
In the
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="110eee8e99ace">onFileTransEventCallback</a>callback, listen for theEVENT_FILE_TRANS_RESULTevent to obtain the final recognition result. -
Call
<a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="8246d583bbpkj">nui_release</a>to release the SDK resources.
Request parameters
Connection and control parameters
You can configure the SDK by passing a JSON string to the parameters parameter of the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#05eab5125e2pm" id="4cd19de9805x7">nui_initialize</a> interface.
-
Parameter example: The following JSON string is an example. Not all parameters are listed. You can add parameters as needed during encoding.
{ "url": "wss://dashscope.aliyuncs.com/api-ws/v1/inference", "apikey": "st-****", "device_id": "my_device_id", "service_mode": "1" } -
Parameter descriptions
Parameter
Type
Required
Description
urlStringYes
The endpoint. This is fixed to
wss://dashscope.aliyuncs.com/api-ws/v1/inference.apikeyStringYes
The API key. For improved security, use a temporary API key. It has a short validity period to reduce the risk of leakage.
service_modeStringYes
The running mode. For audio file transcription, this is fixed to
"1".device_idStringYes
A unique string that identifies the end user. You can set it to an in-app user ID or a unique device identifier generated by the client. This ID is mainly used for log tracking and troubleshooting.
debug_pathStringNo
The storage path for log files.
This parameter takes effect only when
save_logis set toYESin the call to the nui_initialize interface. You must set a log file path in this case, or an error will occur.A maximum of two log files are kept locally.
max_log_file_sizeintNo
The maximum size of a log file in bytes.
This parameter takes effect only when
save_logis set toYESin the call to the nui_initialize interface.Default value: 104857600 (100 × 1024 × 1024 bytes, which is 100 MiB).
log_track_levelintNo
Controls the filtering level for log content sent externally through the onNuiLogTrackCallback.
Default value: 2.
Value range:
-
0: LOG_LEVEL_VERBOSE
-
1: LOG_LEVEL_DEBUG
-
2: LOG_LEVEL_INFO
-
3: LOG_LEVEL_WARNING
-
4: LOG_LEVEL_ERROR
-
5: LOG_LEVEL_NONE (disables this feature)
Note:
log_track_levelandlevel(set through the nui_initialize interface) together determine the final logs that are sent to the callback. For a log to be sent, its level value must be greater than or equal to both thelog_track_levelandlevelvalues. For example, iflog_track_levelis set to 2 (INFO) andlevelis set to 3 (WARNING), only logs at the WARNING level and higher (value >= 3) are sent. -
Speech recognition effect parameters
You can use the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#763672f3f8dgw" id="987c6febcbqd9">nui_set_param</a> API to configure the nl_config parameter, or use the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="df65a624e53ur">nui_file_trans_start</a> API to configure all speech recognition performance parameters.
-
Parameter example: The following JSON string is an example. Not all parameters are listed. You can add parameters as needed during encoding.
{ "file_urls": [ "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav" ], "async_request": false, "nls_config": { "model":"paraformer-v2", "disfluency_removal_enabled":false, "timestamp_alignment_enabled": false } } -
Parameter descriptions
Parameter
Type
Required
Description
file_urlsarray[string]Yes
A list of URLs for the audio or video files to be transcribed. The HTTP and HTTPS protocols are supported. A single request supports only 1 URL.
If your audio files are stored in OSS, the SDK does not support temporary URLs that start with the oss:// prefix.
-
Audio formats:
aac,amr,avi,flac,flv,m4a,mkv,mov,mp3,mp4,mpeg,ogg,opus,wav,webm,wma,wmvImportantBecause of the large number of audio and video formats and their variations, it is not technically possible to test all of them. The API cannot guarantee that all formats will be correctly recognized. Test your files to verify that you can get normal speech recognition results.
-
Audio sampling rate
Varies by model:
-
paraformer-v2 supports any sample rate.
-
paraformer-v1 supports any sample rate.
-
paraformer-8k-v2 supports only an 8 kHz sample rate.
-
paraformer-8k-v1 supports only an 8 kHz sample rate.
-
paraformer-mtl-v1 supports a sample rate of 16 kHz or higher.
-
-
Audio file size and duration: The audio file cannot exceed 2 GB. The duration must be within 12 hours.
If you want to process a file that exceeds these limits, you can try to pre-process the file to reduce its size. For more information about best practices for file pre-processing, see Pre-process video files to improve transcription efficiency (for audio file transcription).
async_requestbooleanNo
Specifies whether the speech recognition request is asynchronous.
Default value:
false.Value range:
-
true: asynchronous request
-
false: synchronous request
apikeystringNo
If the
apikeyin the Connection and control parameters is a temporary API key, you can update it here to prevent it from expiring.nls_configobjectYes
The core configuration object for speech recognition. It contains key parameters such as model selection and recognition effect controls.
nls_config.modelstringYes
The speech recognition model.
nls_config.language_hintsarray[string]No
Specifies the language codes of the speech to be recognized. This parameter applies only to the paraformer-v2 model.
Default value:
["zh", "en"].Supported language codes:
-
zh: Chinese
-
en: English
-
ja: Japanese
-
yue: Cantonese
-
ko: Korean
-
de: German
-
fr: French
-
ru: Russian
nls_config.disfluency_removal_enabledbooleanNo
Specifies whether to filter out disfluencies, such as "um" and "ah".
Default value: false.
Value range:
-
true: Filter
-
false: Do not filter
nls_config.timestamp_alignment_enabledbooleanNo
Specifies whether to enable the timestamp alignment feature.
Default value: false.
Value range:
-
true: Enable
-
false: Disable
nls_config.special_word_filterobjectNo
Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words.
If this parameter is not passed, the system's built-in sensitive word filtering logic is enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with an equal number of asterisks (
*).If this parameter is passed, you can implement the following sensitive word processing policies:
-
Replace with
*: Replaces matching sensitive words with an equal number of asterisks (*). -
Filter directly: Completely removes matching sensitive words from the recognition result.
The value of this parameter must be a JSON Object with the following structure:
{ "filter_with_signed": { "word_list": ["test"] }, "filter_with_empty": { "word_list": ["start", "happen"] }, "system_reserved_filter": true }JSON field descriptions:
-
filter_with_signed-
Type: Object.
-
Required: No.
-
Description: Configures a list of sensitive words to be replaced with asterisks (
*). Matching words in the recognition result are replaced with an equal number of asterisks (*). -
Example: Based on the JSON example, the speech recognition result for "Help me test this piece of code" will be "Help me ** this piece of code".
-
Internal fields:
-
word_list: An array of strings that lists the sensitive words to be replaced.
-
-
-
filter_with_empty-
Type: Object.
-
Required: No.
-
Description: Configures a list of sensitive words to be removed (filtered) from the recognition result. Matching words are completely deleted from the result.
-
Example: Based on the JSON example, the speech recognition result for "Is the game about to start?" will be "Is the game about to?".
-
Internal fields:
-
word_list: An array of strings that lists the sensitive words to be completely removed (filtered).
-
-
-
system_reserved_filter-
Type: Boolean value.
-
Required: No.
-
Default value: true.
-
Description: Specifies whether to enable the system's preset sensitive word rules. If set to
true, the system's built-in sensitive word filtering logic is also enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with an equal number of asterisks (*).
-
nls_config.channel_idarray[integer]No
Specifies the indexes of the audio tracks in a multi-track audio file to recognize. The index starts from 0. For example, [0] indicates that only the first track is recognized, and [0, 1] indicates that both the first and second tracks are recognized. If you omit this parameter, the first track is processed by default.
ImportantEach specified audio track is billed separately. For example, a request for [0, 1] for a single file incurs two separate charges.
Default value:
[0]nls_config.diarization_enabledbooleanNo
Automatic speaker diarization. This feature is disabled by default.
This feature is applicable only to mono audio. Multi-channel audio does not support speaker diarization.
When this feature is enabled, the recognition results will display a
speaker_idfield to distinguish different speakers.NoteIf you enable speaker diarization, keep the audio duration under 2 hours. Exceeding this limit may cause recognition failures or timeouts.
For an example of
speaker_id, see Recognition result description.nls_config.speaker_countintegerNo
A reference value for the number of speakers. To use this feature, you must set
diarization_enabledtotrue.By default, the number of speakers is automatically determined. If you configure this parameter, it only helps the algorithm try to output the specified number of speakers, but it cannot guarantee that this number will be output.
Value range:
[2, 100]. This feature is used to distinguish between multiple speakers, so you must set it to at least 2.nls_config.vocabulary_idstringNo
The ID of the hotword list, used to improve the recognition accuracy of specific words. This parameter applies to v2 and later models. For more information about how to use hotwords, see Customize hotwords.
nls_config.resourcesarray[object]No
The hotword resource configuration for v1 models. This feature is the same as
vocabulary_id, but the configuration method is different:resourcesis an array of objects. Each object contains theresource_idandresource_typefields:-
resource_id: Astring. The hotword ID. -
resource_type: Astring. The value is fixed to "asr_phrase".
Example:
{ "nls_config": { "resources": [ { "resource_id": "xxxxxxxxxxxx", "resource_type": "asr_phrase" } ] } }For more information about how to use hotwords, see Customize and manage hotwords for Paraformer speech recognition.
-
Key interfaces
NeoNui
nui_initialize
Initializes the speech recognition SDK instance. The SDK is a singleton. Do not initialize it again before you call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="edeb95fae1muw">nui_release</a>.
-
Method signature
-(NuiResultCode) nui_initialize:(const char *)parameters logLevel:(NuiSdkLogLevel)level saveLog:(BOOL)save_log; -
Parameter descriptions
Parameter
Type
Description
parameterschar*A JSON string that contains authentication, connection, and debugging parameters. See Connection and control parameters.
levelNuiSdkLogLevelControls the printing level of the SDK's own logs.
save_logBOOLSpecifies whether to save local logs. If set to
YES, you must specify a path usingdebug_pathin the Connection and control parameters. You can also set the file size usingmax_log_file_size. -
Return value description
Returns an error code. For more information, see Query error codes.
nui_set_param
This interface sets or updates the nls_config parameter independently. If you provide all parameters at once in <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="3cd46417307dh">nui_file_trans_start</a>, you do not need to call this method.
-
Method signature
-(NuiResultCode) nui_set_params:(const char *)params; -
Parameter descriptions
Parameter
Type
Description
paramschar*The
nls_configparameter in Speech recognition effect parameters. Parameters other thannls_configcannot be set using this method.Example:
{ "nls_config": { "model":"paraformer-v2", "disfluency_removal_enabled":false, "timestamp_alignment_enabled": false } } -
Return value description
Returns an error code. For more information, see Query error codes.
nui_file_trans_start
You can start detection.
-
Method signature
-(NuiResultCode) nui_file_trans_start(const char *params, char *task_id); -
Parameter descriptions
Parameter
Type
Description
paramschar*Speech recognition effect parameters.
Example:
{ "file_urls": [ "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav" ], "async_request": false, "nls_config": { "model":"paraformer-v2", "disfluency_removal_enabled":false, "timestamp_alignment_enabled": false } }task_idchar*The task ID. The SDK internally generates a random string. You can get the task_id after this interface is successfully called.
-
Return value description
Returns an error code. For more information, see Query error codes.
nui_file_trans_query
You can use this interface to query the current status and result of an asynchronous task. After a successful call, the result is returned by the EVENT_FILE_TRANS_QUERY_RESULT event in the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="2275a61dbb925">onFileTransEventCallback</a> callback.
-
Method signature
-(NuiResultCode) nui_file_trans_query(const char *task_id); -
Parameter descriptions
Parameter
Type
Description
task_idchar*The ID of the task to query.
-
Return value description
Returns an error code. For more information, see Query error codes.
nui_file_trans_cancel
Cancels the current task.
-
Method signature
-(NuiResultCode) nui_file_trans_cancel(const char *task_id); -
Parameter descriptions
Parameter
Type
Description
task_idchar*The ID of the task to cancel.
-
Return value description
Returns an error code. For more information, see Query error codes.
nui_release
Releases all internal resources of the SDK and forcibly terminates all ongoing tasks. After you call this method, the SDK instance becomes unavailable. To use the instance again, you must call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#05eab5125e2pm" id="5a8692bbf70wb">nui_initialize</a> to initialize it.
-
Method signature
-(NuiResultCode) nui_release; -
Return value description
Returns an error code. For more information, see Query error codes.
nui_get_version
Retrieves the current SDK version information.
-
Method signature
-(const char*) nui_get_version; -
Return value description
The current SDK version information.
NeoNuiSdkDelegate: Listener callback
onFileTransEventCallback: Listen for events and speech recognition results
-
Method signature
-(void) onFileTransEventCallback:(NuiCallbackEvent)nuiEvent asrResult:(const char *)asr_result taskId:(const char *)task_id ifFinish:(BOOL)finish retCode:(int)code; -
Parameter descriptions
Parameter
Type
Description
nuiEventNuiCallbackEventThe callback event.
asr_resultchar*The speech recognition result.
task_idchar*The task ID.
finishBOOLA flag that indicates whether the current recognition round is complete.
codeintThe error code. This is valid when an EVENT_ASR_ERROR event occurs. For more information, see Query error codes.
onFileTransLogTrackCallback: Listen for tracking logs
This callback receives detailed internal logs from the SDK for troubleshooting and debugging.
-(void)onFileTransLogTrackCallback:(NuiSdkLogLevel)level
logMessage:(const char *)log;
NuiCallbackEvent: Event types
|
Event |
Description |
|
EVENT_FILE_TRANS_CONNECTED |
Successfully connected to the service. |
|
EVENT_FILE_TRANS_UPLOADED |
Successfully uploaded the audio file for transcription. |
|
EVENT_FILE_TRANS_QUERY_RESULT |
The result of a task query. |
|
EVENT_FILE_TRANS_RESULT |
The final transcription result. |
|
EVENT_ASR_ERROR |
An error occurred during speech recognition. |