Paraformer audio file transcription iOS SDK

更新时间:
复制 MD 格式

This document describes how to use the Paraformer audio file transcription iOS SDK to convert speech to text.

User guide: For an introduction to the models and selection recommendations, see Audio file transcription.

Getting started

  1. Get an API key: Get and configure an API key

    Note

    To grant temporary access to third-party applications or users, or want to strictly control risky operations such as accessing or deleting sensitive data, use a temporary API key. A temporary API key is valid for 60 seconds and you must obtain a new one after it expires.

  2. Download the SDK and run the sample code:

    • Download the latest SDK package.

    • Unzip the package and add nuisdk.framework to your project.

    • In Build Phases → Link Binary With Libraries, add nuisdk.framework.

    • In General → Frameworks, Libraries, and Embedded Content, set nuisdk.framework to Embed & Sign.

    • Open the sample project in Xcode. The sample code is in the DashParaformerFileTranscriberViewController class. Replace the API key to test the feature.

Call procedure

Synchronous mode

  1. Initialize the SDK.

  2. Configure parameters as needed.

  3. Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="a2eef1df15cn8">nui_file_trans_start</a> to start the recognition task. Set the async_request parameter to false.

  4. In the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="966fbae1d26c5">onFileTransEventCallback</a> interface, listen for the EVENT_FILE_TRANS_RESULT event to retrieve the final recognition result.

  5. You can call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="c8dd7d3379543">nui_release</a> to release SDK resources.

Asynchronous mode

  1. Initialize the SDK.

  2. Configure parameters as needed.

  3. Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="2c143882394kf">nui_file_trans_start</a> to start a detection task. Set the async_request parameter to true.

  4. Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#d894d1c6f41ke" id="8576d050ffo09">nui_file_trans_query</a> to query the recognition progress or results.

  5. In the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="187879b1f6hr5">onFileTransEventCallback</a> interface, listen for the EVENT_FILE_TRANS_QUERY_RESULT event to obtain the current query result.

  6. In the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="110eee8e99ace">onFileTransEventCallback</a> callback, listen for the EVENT_FILE_TRANS_RESULT event to obtain the final recognition result.

  7. Call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="8246d583bbpkj">nui_release</a> to release the SDK resources.

Request parameters

Connection and control parameters

You can configure the SDK by passing a JSON string to the parameters parameter of the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#05eab5125e2pm" id="4cd19de9805x7">nui_initialize</a> interface.

  • Parameter example: The following JSON string is an example. Not all parameters are listed. You can add parameters as needed during encoding.

    {
        "url": "wss://dashscope.aliyuncs.com/api-ws/v1/inference",
        "apikey": "st-****",
        "device_id": "my_device_id",
        "service_mode": "1"
    }
  • Parameter descriptions

    Parameter

    Type

    Required

    Description

    url

    String

    Yes

    The endpoint. This is fixed to wss://dashscope.aliyuncs.com/api-ws/v1/inference.

    apikey

    String

    Yes

    The API key. For improved security, use a temporary API key. It has a short validity period to reduce the risk of leakage.

    service_mode

    String

    Yes

    The running mode. For audio file transcription, this is fixed to "1".

    device_id

    String

    Yes

    A unique string that identifies the end user. You can set it to an in-app user ID or a unique device identifier generated by the client. This ID is mainly used for log tracking and troubleshooting.

    debug_path

    String

    No

    The storage path for log files.

    This parameter takes effect only when save_log is set to YES in the call to the nui_initialize interface. You must set a log file path in this case, or an error will occur.

    A maximum of two log files are kept locally.

    max_log_file_size

    int

    No

    The maximum size of a log file in bytes.

    This parameter takes effect only when save_log is set to YES in the call to the nui_initialize interface.

    Default value: 104857600 (100 × 1024 × 1024 bytes, which is 100 MiB).

    log_track_level

    int

    No

    Controls the filtering level for log content sent externally through the onNuiLogTrackCallback.

    Default value: 2.

    Value range:

    • 0: LOG_LEVEL_VERBOSE

    • 1: LOG_LEVEL_DEBUG

    • 2: LOG_LEVEL_INFO

    • 3: LOG_LEVEL_WARNING

    • 4: LOG_LEVEL_ERROR

    • 5: LOG_LEVEL_NONE (disables this feature)

    Note: log_track_level and level (set through the nui_initialize interface) together determine the final logs that are sent to the callback. For a log to be sent, its level value must be greater than or equal to both the log_track_level and level values. For example, if log_track_level is set to 2 (INFO) and level is set to 3 (WARNING), only logs at the WARNING level and higher (value >= 3) are sent.

Speech recognition effect parameters

You can use the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#763672f3f8dgw" id="987c6febcbqd9">nui_set_param</a> API to configure the nl_config parameter, or use the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="df65a624e53ur">nui_file_trans_start</a> API to configure all speech recognition performance parameters.

  • Parameter example: The following JSON string is an example. Not all parameters are listed. You can add parameters as needed during encoding.

    {
        "file_urls": [
            "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
        ],
        "async_request": false,
        "nls_config": {
            "model":"paraformer-v2",
            "disfluency_removal_enabled":false,
            "timestamp_alignment_enabled": false
        }
    }
  • Parameter descriptions

    Parameter

    Type

    Required

    Description

    file_urls

    array[string]

    Yes

    A list of URLs for the audio or video files to be transcribed. The HTTP and HTTPS protocols are supported. A single request supports only 1 URL.

    If your audio files are stored in OSS, the SDK does not support temporary URLs that start with the oss:// prefix.

    • Audio formats: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

      Important

      Because of the large number of audio and video formats and their variations, it is not technically possible to test all of them. The API cannot guarantee that all formats will be correctly recognized. Test your files to verify that you can get normal speech recognition results.

    • Audio sampling rate

      Varies by model:

      • paraformer-v2 supports any sample rate.

      • paraformer-v1 supports any sample rate.

      • paraformer-8k-v2 supports only an 8 kHz sample rate.

      • paraformer-8k-v1 supports only an 8 kHz sample rate.

      • paraformer-mtl-v1 supports a sample rate of 16 kHz or higher.

    • Audio file size and duration: The audio file cannot exceed 2 GB. The duration must be within 12 hours.

      If you want to process a file that exceeds these limits, you can try to pre-process the file to reduce its size. For more information about best practices for file pre-processing, see Pre-process video files to improve transcription efficiency (for audio file transcription).

    async_request

    boolean

    No

    Specifies whether the speech recognition request is asynchronous.

    Default value: false.

    Value range:

    • true: asynchronous request

    • false: synchronous request

    apikey

    string

    No

    If the apikey in the Connection and control parameters is a temporary API key, you can update it here to prevent it from expiring.

    nls_config

    object

    Yes

    The core configuration object for speech recognition. It contains key parameters such as model selection and recognition effect controls.

    nls_config.model

    string

    Yes

    The speech recognition model.

    nls_config.language_hints

    array[string]

    No

    Specifies the language codes of the speech to be recognized. This parameter applies only to the paraformer-v2 model.

    Default value: ["zh", "en"].

    Supported language codes:

    • zh: Chinese

    • en: English

    • ja: Japanese

    • yue: Cantonese

    • ko: Korean

    • de: German

    • fr: French

    • ru: Russian

    nls_config.disfluency_removal_enabled

    boolean

    No

    Specifies whether to filter out disfluencies, such as "um" and "ah".

    Default value: false.

    Value range:

    • true: Filter

    • false: Do not filter

    nls_config.timestamp_alignment_enabled

    boolean

    No

    Specifies whether to enable the timestamp alignment feature.

    Default value: false.

    Value range:

    • true: Enable

    • false: Disable

    nls_config.special_word_filter

    object

    No

    Specifies the sensitive words to be processed during speech recognition and supports different processing methods for different sensitive words.

    If this parameter is not passed, the system's built-in sensitive word filtering logic is enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with an equal number of asterisks (*).

    If this parameter is passed, you can implement the following sensitive word processing policies:

    • Replace with *: Replaces matching sensitive words with an equal number of asterisks (*).

    • Filter directly: Completely removes matching sensitive words from the recognition result.

    The value of this parameter must be a JSON Object with the following structure:

    {
      "filter_with_signed": {
        "word_list": ["test"]
      },
      "filter_with_empty": {
        "word_list": ["start", "happen"]
      },
      "system_reserved_filter": true
    }

    JSON field descriptions:

    • filter_with_signed

      • Type: Object.

      • Required: No.

      • Description: Configures a list of sensitive words to be replaced with asterisks (*). Matching words in the recognition result are replaced with an equal number of asterisks (*).

      • Example: Based on the JSON example, the speech recognition result for "Help me test this piece of code" will be "Help me ** this piece of code".

      • Internal fields:

        • word_list: An array of strings that lists the sensitive words to be replaced.

    • filter_with_empty

      • Type: Object.

      • Required: No.

      • Description: Configures a list of sensitive words to be removed (filtered) from the recognition result. Matching words are completely deleted from the result.

      • Example: Based on the JSON example, the speech recognition result for "Is the game about to start?" will be "Is the game about to?".

      • Internal fields:

        • word_list: An array of strings that lists the sensitive words to be completely removed (filtered).

    • system_reserved_filter

      • Type: Boolean value.

      • Required: No.

      • Default value: true.

      • Description: Specifies whether to enable the system's preset sensitive word rules. If set to true, the system's built-in sensitive word filtering logic is also enabled. Words in the recognition result that match the Alibaba Cloud Model Studio sensitive word list are replaced with an equal number of asterisks (*).

    nls_config.channel_id

    array[integer]

    No

    Specifies the indexes of the audio tracks in a multi-track audio file to recognize. The index starts from 0. For example, [0] indicates that only the first track is recognized, and [0, 1] indicates that both the first and second tracks are recognized. If you omit this parameter, the first track is processed by default.

    Important

    Each specified audio track is billed separately. For example, a request for [0, 1] for a single file incurs two separate charges.

    Default value: [0]

    nls_config.diarization_enabled

    boolean

    No

    Automatic speaker diarization. This feature is disabled by default.

    This feature is applicable only to mono audio. Multi-channel audio does not support speaker diarization.

    When this feature is enabled, the recognition results will display a speaker_id field to distinguish different speakers.

    Note

    If you enable speaker diarization, keep the audio duration under 2 hours. Exceeding this limit may cause recognition failures or timeouts.

    For an example of speaker_id, see Recognition result description.

    nls_config.speaker_count

    integer

    No

    A reference value for the number of speakers. To use this feature, you must set diarization_enabled to true.

    By default, the number of speakers is automatically determined. If you configure this parameter, it only helps the algorithm try to output the specified number of speakers, but it cannot guarantee that this number will be output.

    Value range: [2, 100]. This feature is used to distinguish between multiple speakers, so you must set it to at least 2.

    nls_config.vocabulary_id

    string

    No

    The ID of the hotword list, used to improve the recognition accuracy of specific words. This parameter applies to v2 and later models. For more information about how to use hotwords, see Customize hotwords.

    nls_config.resources

    array[object]

    No

    The hotword resource configuration for v1 models. This feature is the same as vocabulary_id, but the configuration method is different:

    resources is an array of objects. Each object contains the resource_id and resource_type fields:

    • resource_id: A string. The hotword ID.

    • resource_type: A string. The value is fixed to "asr_phrase".

    Example:

    {
        "nls_config": {
              "resources": [
                  {
                      "resource_id": "xxxxxxxxxxxx",
                      "resource_type": "asr_phrase"
                  }
              ]
        }
    }

    For more information about how to use hotwords, see Customize and manage hotwords for Paraformer speech recognition.

Key interfaces

NeoNui

nui_initialize

Initializes the speech recognition SDK instance. The SDK is a singleton. Do not initialize it again before you call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#6c2931e9ae3eq" id="edeb95fae1muw">nui_release</a>.

  • Method signature

    -(NuiResultCode) nui_initialize:(const char *)parameters
                           logLevel:(NuiSdkLogLevel)level
                            saveLog:(BOOL)save_log;
  • Parameter descriptions

    Parameter

    Type

    Description

    parameters

    char*

    A JSON string that contains authentication, connection, and debugging parameters. See Connection and control parameters.

    level

    NuiSdkLogLevel

    Controls the printing level of the SDK's own logs.

    save_log

    BOOL

    Specifies whether to save local logs. If set to YES, you must specify a path using debug_path in the Connection and control parameters. You can also set the file size using max_log_file_size.

  • Return value description

    Returns an error code. For more information, see Query error codes.

nui_set_param

This interface sets or updates the nls_config parameter independently. If you provide all parameters at once in <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#8fe6ea298apzu" id="3cd46417307dh">nui_file_trans_start</a>, you do not need to call this method.

  • Method signature

    -(NuiResultCode) nui_set_params:(const char *)params;
  • Parameter descriptions

    Parameter

    Type

    Description

    params

    char*

    The nls_config parameter in Speech recognition effect parameters. Parameters other than nls_config cannot be set using this method.

    Example:

    {
        "nls_config": {
            "model":"paraformer-v2",
            "disfluency_removal_enabled":false,
            "timestamp_alignment_enabled": false
        }
    }
  • Return value description

    Returns an error code. For more information, see Query error codes.

nui_file_trans_start

You can start detection.

  • Method signature

    -(NuiResultCode) nui_file_trans_start(const char *params, char *task_id);
  • Parameter descriptions

    Parameter

    Type

    Description

    params

    char*

    Speech recognition effect parameters.

    Example:

    {
        "file_urls": [
            "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
        ],
        "async_request": false,
        "nls_config": {
            "model":"paraformer-v2",
            "disfluency_removal_enabled":false,
            "timestamp_alignment_enabled": false
        }
    }

    task_id

    char*

    The task ID. The SDK internally generates a random string. You can get the task_id after this interface is successfully called.

  • Return value description

    Returns an error code. For more information, see Query error codes.

nui_file_trans_query

You can use this interface to query the current status and result of an asynchronous task. After a successful call, the result is returned by the EVENT_FILE_TRANS_QUERY_RESULT event in the <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#163c1ef871tqt" id="2275a61dbb925">onFileTransEventCallback</a> callback.

  • Method signature

    -(NuiResultCode) nui_file_trans_query(const char *task_id);
  • Parameter descriptions

    Parameter

    Type

    Description

    task_id

    char*

    The ID of the task to query.

  • Return value description

    Returns an error code. For more information, see Query error codes.

nui_file_trans_cancel

Cancels the current task.

  • Method signature

    -(NuiResultCode) nui_file_trans_cancel(const char *task_id);
  • Parameter descriptions

    Parameter

    Type

    Description

    task_id

    char*

    The ID of the task to cancel.

  • Return value description

    Returns an error code. For more information, see Query error codes.

nui_release

Releases all internal resources of the SDK and forcibly terminates all ongoing tasks. After you call this method, the SDK instance becomes unavailable. To use the instance again, you must call <a baseurl="t3169153_v1_0_0.xdita" data-node="6190271" data-root="85177" data-tag="xref" href="#05eab5125e2pm" id="5a8692bbf70wb">nui_initialize</a> to initialize it.

  • Method signature

    -(NuiResultCode) nui_release;
  • Return value description

    Returns an error code. For more information, see Query error codes.

nui_get_version

Retrieves the current SDK version information.

  • Method signature

    -(const char*) nui_get_version;
  • Return value description

    The current SDK version information.

NeoNuiSdkDelegate: Listener callback

onFileTransEventCallback: Listen for events and speech recognition results

  • Method signature

    -(void) onFileTransEventCallback:(NuiCallbackEvent)nuiEvent
                           asrResult:(const char *)asr_result
                              taskId:(const char *)task_id
                            ifFinish:(BOOL)finish
                             retCode:(int)code;
  • Parameter descriptions

    Parameter

    Type

    Description

    nuiEvent

    NuiCallbackEvent

    The callback event.

    asr_result

    char*

    The speech recognition result.

    task_id

    char*

    The task ID.

    finish

    BOOL

    A flag that indicates whether the current recognition round is complete.

    code

    int

    The error code. This is valid when an EVENT_ASR_ERROR event occurs. For more information, see Query error codes.

onFileTransLogTrackCallback: Listen for tracking logs

This callback receives detailed internal logs from the SDK for troubleshooting and debugging.

-(void)onFileTransLogTrackCallback:(NuiSdkLogLevel)level
                        logMessage:(const char *)log;

NuiCallbackEvent: Event types

Event

Description

EVENT_FILE_TRANS_CONNECTED

Successfully connected to the service.

EVENT_FILE_TRANS_UPLOADED

Successfully uploaded the audio file for transcription.

EVENT_FILE_TRANS_QUERY_RESULT

The result of a task query.

EVENT_FILE_TRANS_RESULT

The final transcription result.

EVENT_ASR_ERROR

An error occurred during speech recognition.