Paraformer real-time speech recognition Android SDK

更新时间:
复制 MD 格式

This document provides a detailed guide on using the Paraformer real-time speech recognition Android software development kit (SDK) to convert speech to text.

User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Gummy/Paraformer.

Online experience: An online experience is available only for paraformer-realtime-v2, paraformer-realtime-8k-v2, and paraformer-realtime-v1.

Getting started

  1. Obtain an API key: Get an API key. For security purposes, configure the API key as an environment variable.

    Note

    To grant temporary access permissions to third-party applications or users, or to strictly control risky operations such as accessing or deleting sensitive data, use a temporary API key. A temporary API key is valid for 60 seconds by default and must be reacquired after it expires.

  2. Download the SDK and run the sample code:

    • Download the latest SDK package.

    • Unzip the package. Obtain the AAR-format SDK from the app/libs folder and add it to your project dependencies.
      For Android C++ integration, use the android_libs and android_include folders in the ZIP package to obtain the dynamic libraries and header files.































    • Open the project in Android Studio. The sample code is in DashParaformerSpeechTranscriberActivity.java. Replace the API key to test the feature.

Invocation steps

  1. Initialize the SDK.

  2. Set the required parameters. Use the parameters parameter of the initialize method to set connection and control parameters. Use the setParams method to set speech recognition effect parameters.

  3. Call startDialog to start the recognition process.

  4. In the onNuiAudioStateChanged callback, start the recording device based on the audio state.

  5. In the onNuiNeedAudioData callback, continuously provide recording data.

  6. In the onNuiEventCallback callback, listen for events and retrieve the speech recognition results.

  7. Call stopDialog to stop recognition. Confirm that recognition has ended by listening for the EVENT_TRANSCRIBER_COMPLETE event.

  8. When the recognition feature is no longer needed, call the release method to release SDK resources.

Request parameters

Connection and control parameters

You can configure these parameters by passing a JSON string in the parameters parameter of the initialize method.

  • Parameter example: The following code provides a sample JSON string. Not all parameters are listed. You can add parameters as needed during encoding:

    {
        "url": "wss://dashscope.aliyuncs.com/api-ws/v1/inference",
        "apikey": "st-****",
        "device_id": "my_device_id",
        "service_mode": "1"
    }
  • Parameter descriptions

    Parameter

    Type

    Required

    Description

    url

    String

    Yes

    The service endpoint. This is fixed at wss://dashscope.aliyuncs.com/api-ws/v1/inference.

    apikey

    String

    Yes

    The API key. To reduce the risk of a long-term key being exposed, use a more secure temporary API key with a short validity period.

    service_mode

    String

    Yes

    The operating mode. For real-time speech recognition, this is fixed at "1".

    device_id

    String

    Yes

    A unique string that identifies the end user. You can set this to an in-app user ID or a unique device identifier generated by the client. This ID is mainly used for log tracking and troubleshooting.

    debug_path

    String

    No

    The storage path for log files.

    This parameter takes effect only when save_log is set to true in the initialize call. You must set a log file path, otherwise an error will occur.

    A maximum of two log files are kept locally.

    save_wav

    String

    No

    Specifies whether to save audio files for debugging. The audio files are saved to the path specified by debug_path.

    Default value: "false".

    Valid values:

    • "true": Yes

    • "false": No

    This parameter takes effect only when save_log is set to true in the initialize call. The debug_path parameter must also be set.

    max_log_file_size

    int

    No

    The maximum size of a log file in bytes.

    This parameter takes effect only when save_log is set to true in the initialize call.

    Default value: 104857600 (100 × 1024 × 1024 bytes, which is 100 MiB).

    log_track_level

    int

    No

    Controls the filter level of log content sent externally through the log callback (<a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#9c10968457gc6" id="0002fe9689yts">onNuiLogTrackCallback</a>).

    Default value: 2.

    Valid values:

    • 0: LOG_LEVEL_VERBOSE

    • 1: LOG_LEVEL_DEBUG

    • 2: LOG_LEVEL_INFO

    • 3: LOG_LEVEL_WARNING

    • 4: LOG_LEVEL_ERROR

    • 5: LOG_LEVEL_NONE (disables this feature)

    Note: log_track_level and level (set through the initialize interface) together determine the final logs that are sent to the callback. A log is sent to the callback only if its level value is greater than or equal to both the log_track_level and level values. For example, if log_track_level is set to 2 (INFO) and level is set to 3 (WARNING), only logs at the WARNING level and higher (value >= 3) are sent to the callback.

Speech recognition effect parameters

You can configure these parameters by passing a JSON string in the params parameter of the setParams method.

  • Parameter example: The following code provides a sample JSON string. Not all parameters are listed. You can add parameters as needed during encoding:

    {
        "service_type": 4,
        "nls_config": {
            "model": "paraformer-realtime-v2",
            "sr_format": "pcm",
            "sample_rate": "16000"
        }
    }
  • Parameter descriptions

    Top-level parameter

    Type

    Required

    Description

    service_type

    int

    Yes

    The Voice Service type. For real-time speech recognition, this is fixed at 4.

    nls_config

    object

    Yes

    The core configuration object for speech recognition. It contains key parameters such as model selection and recognition effect controls.

    nls_config.model

    string

    Yes

    The speech recognition model.

    nls_config.sr_format

    string

    Yes

    The format of the audio to be recognized.

    Supported audio formats: pcm, wav, opus.

    Important
    • opus: Must be PCM encoded. The SDK internally encodes it into the OPUS format.

    • wav/pcm: Must be PCM encoded.

    nls_config.sample_rate

    int

    Yes

    The sample rate of the audio to be recognized, in Hz.

    This varies by model:

    • paraformer-realtime-v2 supports any sample rate.

    • paraformer-realtime-v1 only supports a 16000 Hz sample rate.

    • paraformer-realtime-8k-v2 only supports an 8000 Hz sample rate.

    • paraformer-realtime-8k-v1 only supports an 8000 Hz sample rate.

    nls_config.disfluency_removal_enabled

    boolean

    No

    Specifies whether to filter out filler words, such as "um" and "ah".

    Default value: false.

    nls_config.language_hints

    array[string]

    No

    Sets the language codes for recognition. If you cannot determine the language in advance, you can leave this unset. The model will automatically detect the language.

    Supported language codes:

    • zh: Chinese

    • en: English

    • ja: Japanese

    • yue: Cantonese

    • ko: Korean

    • de: German

    • fr: French

    • ru: Russian

    This parameter only takes effect for multilingual models.

    nls_config.semantic_punctuation_enabled

    boolean

    No

    Sets the sentence segmentation mode.

    Default value: false.

    Valid values:

    • true: Enables semantic segmentation and disables VAD segmentation.

    • false: Enables VAD segmentation and disables semantic segmentation.

    Semantic segmentation provides higher accuracy and is suitable for meeting transcription scenarios. Voice Activity Detection (VAD) segmentation has lower latency and is suitable for real-time interactive scenarios.

    This parameter takes effect only for v2 and later models.

    nls_config.max_sentence_silence

    int

    No

    The silence duration threshold for VAD segmentation, in ms.

    Default value: 800.

    Value range: [200, 6000].

    When the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended.

    This parameter takes effect only when the semantic_punctuation_enabled parameter is false and for v2 and later models.

    nls_config.multi_threshold_mode_enabled

    boolean

    No

    Specifies whether to enable the long-segment prevention mode. Enabling this prevents VAD from creating excessively long segments.

    Default value: false (disabled).

    Valid values:

    • true: Enabled

    • false: Disabled

    This parameter takes effect only when the semantic_punctuation_enabled parameter is false and for v2 and later models.

    nls_config.punctuation_prediction_enabled

    boolean

    No

    Specifies whether to automatically add punctuation to the recognition results.

    Default value: true (yes).

    Valid values:

    • true: Yes

    • false: No

    This parameter takes effect only for v2 and later models.

    nls_config.heartbeat

    boolean

    No

    Specifies whether to maintain a persistent connection with the server.

    Default value: false.

    Valid values:

    • true: The connection to the server can be maintained without interruption by continuously sending silent audio.

    • false: The connection will be disconnected due to a timeout after 60 seconds, even if silent audio is continuously sent. This 60-second timeout is a default server-side behavior and cannot be configured by the client.

    This parameter takes effect only for v2 and later models.

    nls_config.inverse_text_normalization_enabled

    boolean

    No

    Specifies whether to enable Inverse Text Normalization (ITN). When enabled, Chinese numerals are converted to Arabic numerals.

    Default value: true (enabled).

    Valid values:

    • true: Enabled

    • false: Disabled

    This parameter takes effect only for v2 and later models.

    nls_config.vocabulary_id

    string

    No

    The ID of the hotword vocabulary, used to improve the recognition accuracy of specific words. This parameter applies to v2 and later models. For information about how to use hotwords, see Customize hotwords.

    nls_config.resources

    array[object]

    No

    The hotword resource configuration for v1 models. This feature is the same as vocabulary_id but is configured differently:

    resources is an array of objects. Each element contains the resource_id and resource_type fields:

    • resource_id: A string that specifies the hotword ID.

    • resource_type: A string with a fixed value of "asr_phrase".

    Example:

    {
        "nls_config": {
              "resources": [
                  {
                      "resource_id": "xxxxxxxxxxxx",
                      "resource_type": "asr_phrase"
                  }
              ]
        }
    }

    For information about how to use hotwords, see Customize and manage Paraformer speech recognition hotwords.

Key interfaces

NativeNui

initialize

Initialize the speech recognition SDK instance. The SDK is a singleton and cannot be initialized again until you call release.

This API call is blocking and should be made on a non-UI thread.

  • Method signature

    public synchronized int initialize(final INativeNuiCallback callback,
                                       String parameters,
                                       final Constants.LogLevel level,
                                       final boolean save_log)
  • Parameter descriptions

    Parameter

    Type

    Description

    callback

    <a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#8b030fec74e01" id="6a02b0dea4han">INativeNuiCallback</a>

    The implementation of the event and data callback interface.

    parameters

    String

    A JSON string that contains authentication, connection, and debugging parameters. For more information, see Connection and control parameters.

    level

    Constants.LogLevel

    Controls the print level of the SDK's own logs.

    save_log

    boolean

    Specifies whether to save binary logs. If set to true, you must specify a path with debug_path in the connection and control parameters. You can also set the file size with max_log_file_size.

  • Return value

    Returns an error code. For more information, see Query error codes.

setParams

Sets speech recognition effect parameters in JSON format. You must call this method before you call startDialog.

startDialog

Starts the recognition process.

  • Method signature

    public synchronized int startDialog(VadMode vad_mode, String dialog_params)
  • Parameter descriptions

    Parameter

    Type

    Description

    vad_mode

    VadMode

    The VAD mode. This is fixed at VadMode.TYPE_P2T.

    dialog_params

    String

    When the temporary API key corresponding to the apikey parameter in the connection and control parameters expires, you can update it here.

    The content is in JSON format:

    {
      "apikey": "st-****"
    }
  • Return value

    Returns an error code. For more information, see Query error codes.

stopDialog

Stops the recognition process. After this method is called, the server returns the final recognition result and ends the task.

  • Method signature

    public synchronized int stopDialog();
  • Return value

    Returns an error code. For more information, see Query error codes.

cancelDialog

Immediately stops the recognition process. After this method is called, the task ends immediately without waiting for the server to return the final recognition result.

  • Method signature

    public synchronized int cancelDialog();
  • Return value

    Returns an error code. For more information, see Query error codes.

release

Releases all internal resources of the SDK. After this method is called, the SDK instance becomes unavailable. To use the instance again, you must call initialize to re-initialize it.

  • Method signature

    public synchronized int release();
  • Return value

    Returns an error code. For more information, see Query error codes.

GetVersion

Retrieves the current SDK version information.

  • Method signature

    public synchronized String GetVersion();
  • Return value

    The current SDK version information.

INativeNuiCallback: Listener callbacks

onNuiEventCallback: Listen for events and speech recognition results

  • Method signature

    void onNuiEventCallback(NuiEvent event, final int resultCode, final int arg2, KwsResult kwsResult, AsrResult asrResult);
  • Parameter descriptions

    Parameter

    Type

    Description

    event

    <a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#981ff433acpmr" id="3ff676fee5j37">NuiEvent</a>

    The callback event.

    resultCode

    int

    The error code. This is valid when an EVENT_ASR_ERROR event occurs.

    asrResult

    AsrResult

    The speech recognition result.

    kwsResult

    KwsResult

    The voice wake-up feature. You do not need to follow this parameter.

    arg2

    int

    Reserved parameter.

onNuiAudioStateChanged: Listen for audio state

The SDK uses this callback to notify you when to start or stop recording.

  • Method signature

    void onNuiAudioStateChanged(AudioState state);
  • AudioState descriptions

    State

    Description

    STATE_OPEN

    The interaction starts. You can open the recording device to record audio.

    STATE_PAUSE

    The interaction stops. You can stop recording.

    STATE_CLOSE

    The SDK instance is released. You can completely shut down the recording device.

onNuiNeedAudioData: Fill in audio data for recognition

After the recognition process starts, this callback is triggered continuously. You must provide the audio data to be recognized in this callback.

  • Method signature

    int onNuiNeedAudioData(byte[] buffer, int len);
  • Parameter descriptions

    Parameter

    Type

    Description

    buffer

    byte[]

    The audio data to fill.

    len

    int

    The number of bytes of the audio data to fill.

  • Return value

    The actual number of bytes filled.

onNuiLogTrackCallback: Listen for tracking logs

This callback is used to receive detailed internal logs from the SDK for troubleshooting and debugging.

default void onNuiLogTrackCallback(Constants.LogLevel level, String log)

NuiEvent: Event types

Event

Description

EVENT_TRANSCRIBER_STARTED

The task started successfully.

EVENT_VAD_START

This event is triggered after the task starts. It does not indicate that the start of a human voice has been detected.

EVENT_VAD_END

The end of a human voice is detected.

EVENT_ASR_PARTIAL_RESULT

Intermediate speech recognition result.

EVENT_ASR_ERROR

An error occurred during speech recognition.

EVENT_MIC_ERROR

Triggered because no audio data was received for 2 consecutive seconds.

EVENT_SENTENCE_END

The end of a sentence is detected. A complete recognition result for the sentence is returned.

EVENT_TRANSCRIBER_COMPLETE

Speech recognition is complete.