Paraformer real-time speech recognition Android SDK-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

This document provides a detailed guide on using the Paraformer real-time speech recognition Android software development kit (SDK) to convert speech to text.

User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Gummy/Paraformer.

Online experience: An online experience is available only for paraformer-realtime-v2, paraformer-realtime-8k-v2, and paraformer-realtime-v1.

Getting started

Obtain an API key: Get an API key. For security purposes, configure the API key as an environment variable.

Note
To grant temporary access permissions to third-party applications or users, or to strictly control risky operations such as accessing or deleting sensitive data, use a temporary API key. A temporary API key is valid for 60 seconds by default and must be reacquired after it expires.
Download the SDK and run the sample code:
- Download the latest SDK package.
- Unzip the package. Obtain the AAR-format SDK from the app/libs folder and add it to your project dependencies.
  For Android C++ integration, use the android_libs and android_include folders in the ZIP package to obtain the dynamic libraries and header files.
- Open the project in Android Studio. The sample code is in DashParaformerSpeechTranscriberActivity.java. Replace the API key to test the feature.

Invocation steps

Initialize the SDK.
Set the required parameters. Use the parameters parameter of the initialize method to set connection and control parameters. Use the setParams method to set speech recognition effect parameters.
Call startDialog to start the recognition process.
In the onNuiAudioStateChanged callback, start the recording device based on the audio state.
In the onNuiNeedAudioData callback, continuously provide recording data.
In the onNuiEventCallback callback, listen for events and retrieve the speech recognition results.
Call stopDialog to stop recognition. Confirm that recognition has ended by listening for the EVENT_TRANSCRIBER_COMPLETE event.
When the recognition feature is no longer needed, call the release method to release SDK resources.

Request parameters

Connection and control parameters

You can configure these parameters by passing a JSON string in the parameters parameter of the initialize method.

Parameter example: The following code provides a sample JSON string. Not all parameters are listed. You can add parameters as needed during encoding:

{
    "url": "wss://dashscope.aliyuncs.com/api-ws/v1/inference",
    "apikey": "st-****",
    "device_id": "my_device_id",
    "service_mode": "1"
}

Parameter descriptions

Parameter	Type	Required	Description
`url`	`String`	Yes	The service endpoint. This is fixed at `wss://dashscope.aliyuncs.com/api-ws/v1/inference`.
`apikey`	`String`	Yes	The API key. To reduce the risk of a long-term key being exposed, use a more secure temporary API key with a short validity period.
`service_mode`	`String`	Yes	The operating mode. For real-time speech recognition, this is fixed at `"1"`.
`device_id`	`String`	Yes	A unique string that identifies the end user. You can set this to an in-app user ID or a unique device identifier generated by the client. This ID is mainly used for log tracking and troubleshooting.
`debug_path`	`String`	No	The storage path for log files. This parameter takes effect only when `save_log` is set to true in the initialize call. You must set a log file path, otherwise an error will occur. A maximum of two log files are kept locally.
`save_wav`	`String`	No	Specifies whether to save audio files for debugging. The audio files are saved to the path specified by `debug_path`. Default value: "false". Valid values: "true": Yes "false": No This parameter takes effect only when `save_log` is set to true in the initialize call. The `debug_path` parameter must also be set.
`max_log_file_size`	`int`	No	The maximum size of a log file in bytes. This parameter takes effect only when `save_log` is set to true in the initialize call. Default value: 104857600 (100 × 1024 × 1024 bytes, which is 100 MiB).
`log_track_level`	`int`	No	Controls the filter level of log content sent externally through the log callback (`<a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#9c10968457gc6" id="0002fe9689yts">onNuiLogTrackCallback</a>`). Default value: 2. Valid values: 0: LOG_LEVEL_VERBOSE 1: LOG_LEVEL_DEBUG 2: LOG_LEVEL_INFO 3: LOG_LEVEL_WARNING 4: LOG_LEVEL_ERROR 5: LOG_LEVEL_NONE (disables this feature) Note: `log_track_level` and `level` (set through the initialize interface) together determine the final logs that are sent to the callback. A log is sent to the callback only if its level value is greater than or equal to both the `log_track_level` and `level` values. For example, if `log_track_level` is set to 2 (INFO) and `level` is set to 3 (WARNING), only logs at the WARNING level and higher (value >= 3) are sent to the callback.

Speech recognition effect parameters

You can configure these parameters by passing a JSON string in the params parameter of the setParams method.

Parameter example: The following code provides a sample JSON string. Not all parameters are listed. You can add parameters as needed during encoding:

{
    "service_type": 4,
    "nls_config": {
        "model": "paraformer-realtime-v2",
        "sr_format": "pcm",
        "sample_rate": "16000"
    }
}

Parameter descriptions

Top-level parameter	Type	Required	Description
`service_type`	`int`	Yes	The Voice Service type. For real-time speech recognition, this is fixed at `4`.
`nls_config`	`object`	Yes	The core configuration object for speech recognition. It contains key parameters such as model selection and recognition effect controls.
`nls_config.model`	`string`	Yes	The speech recognition model.
`nls_config.sr_format`	`string`	Yes	The format of the audio to be recognized. Supported audio formats: pcm, wav, opus. Important opus: Must be PCM encoded. The SDK internally encodes it into the OPUS format. wav/pcm: Must be PCM encoded.
`nls_config.sample_rate`	`int`	Yes	The sample rate of the audio to be recognized, in Hz. This varies by model: paraformer-realtime-v2 supports any sample rate. paraformer-realtime-v1 only supports a 16000 Hz sample rate. paraformer-realtime-8k-v2 only supports an 8000 Hz sample rate. paraformer-realtime-8k-v1 only supports an 8000 Hz sample rate.
`nls_config.disfluency_removal_enabled`	`boolean`	No	Specifies whether to filter out filler words, such as "um" and "ah". Default value: false.
`nls_config.language_hints`	`array[string]`	No	Sets the language codes for recognition. If you cannot determine the language in advance, you can leave this unset. The model will automatically detect the language. Supported language codes: zh: Chinese en: English ja: Japanese yue: Cantonese ko: Korean de: German fr: French ru: Russian This parameter only takes effect for multilingual models.
`nls_config.semantic_punctuation_enabled`	`boolean`	No	Sets the sentence segmentation mode. Default value: false. Valid values: true: Enables semantic segmentation and disables VAD segmentation. false: Enables VAD segmentation and disables semantic segmentation. Semantic segmentation provides higher accuracy and is suitable for meeting transcription scenarios. Voice Activity Detection (VAD) segmentation has lower latency and is suitable for real-time interactive scenarios. This parameter takes effect only for v2 and later models.
`nls_config.max_sentence_silence`	`int`	No	The silence duration threshold for VAD segmentation, in ms. Default value: 800. Value range: [200, 6000]. When the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended. This parameter takes effect only when the `semantic_punctuation_enabled` parameter is false and for v2 and later models.
`nls_config.multi_threshold_mode_enabled`	`boolean`	No	Specifies whether to enable the long-segment prevention mode. Enabling this prevents VAD from creating excessively long segments. Default value: false (disabled). Valid values: true: Enabled false: Disabled This parameter takes effect only when the `semantic_punctuation_enabled` parameter is false and for v2 and later models.
`nls_config.punctuation_prediction_enabled`	`boolean`	No	Specifies whether to automatically add punctuation to the recognition results. Default value: true (yes). Valid values: true: Yes false: No This parameter takes effect only for v2 and later models.
`nls_config.heartbeat`	`boolean`	No	Specifies whether to maintain a persistent connection with the server. Default value: false. Valid values: true: The connection to the server can be maintained without interruption by continuously sending silent audio. false: The connection will be disconnected due to a timeout after 60 seconds, even if silent audio is continuously sent. This 60-second timeout is a default server-side behavior and cannot be configured by the client. This parameter takes effect only for v2 and later models.
`nls_config.inverse_text_normalization_enabled`	`boolean`	No	Specifies whether to enable Inverse Text Normalization (ITN). When enabled, Chinese numerals are converted to Arabic numerals. Default value: true (enabled). Valid values: true: Enabled false: Disabled This parameter takes effect only for v2 and later models.
`nls_config.vocabulary_id`	`string`	No	The ID of the hotword vocabulary, used to improve the recognition accuracy of specific words. This parameter applies to v2 and later models. For information about how to use hotwords, see Customize hotwords.
`nls_config.resources`	`array[object]`	No	The hotword resource configuration for v1 models. This feature is the same as `vocabulary_id` but is configured differently: `resources` is an array of objects. Each element contains the `resource_id` and `resource_type` fields: `resource_id`: A `string` that specifies the hotword ID. `resource_type`: A `string` with a fixed value of "`asr_phrase`". Example: `{ "nls_config": { "resources": [ { "resource_id": "xxxxxxxxxxxx", "resource_type": "asr_phrase" } ] } }` For information about how to use hotwords, see Customize and manage Paraformer speech recognition hotwords.

Key interfaces

NativeNui

initialize

Initialize the speech recognition SDK instance. The SDK is a singleton and cannot be initialized again until you call release.

This API call is blocking and should be made on a non-UI thread.

Method signature

public synchronized int initialize(final INativeNuiCallback callback,
                                   String parameters,
                                   final Constants.LogLevel level,
                                   final boolean save_log)

Parameter descriptions

Parameter	Type	Description
`callback`	`<a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#8b030fec74e01" id="6a02b0dea4han">INativeNuiCallback</a>`	The implementation of the event and data callback interface.
`parameters`	`String`	A JSON string that contains authentication, connection, and debugging parameters. For more information, see Connection and control parameters.
`level`	`Constants.LogLevel`	Controls the print level of the SDK's own logs.
`save_log`	`boolean`	Specifies whether to save binary logs. If set to `true`, you must specify a path with `debug_path` in the connection and control parameters. You can also set the file size with `max_log_file_size`.

Return value

Returns an error code. For more information, see Query error codes.

setParams

Sets speech recognition effect parameters in JSON format. You must call this method before you call startDialog.

Method signature

public synchronized int setParams(String params)

Parameter descriptions

Parameter	Type	Description
`params`	`String`	Speech recognition effect parameters.

Return value

Returns an error code. For more information, see Query error codes.

startDialog

Starts the recognition process.

Method signature

public synchronized int startDialog(VadMode vad_mode, String dialog_params)

Parameter descriptions

Parameter

Type

Description

vad_mode

VadMode

The VAD mode. This is fixed at VadMode.TYPE_P2T.

dialog_params

String

When the temporary API key corresponding to the apikey parameter in the connection and control parameters expires, you can update it here.

The content is in JSON format:

{
  "apikey": "st-****"
}

Return value

Returns an error code. For more information, see Query error codes.

stopDialog

Stops the recognition process. After this method is called, the server returns the final recognition result and ends the task.

Method signature
```
public synchronized int stopDialog();
```
Return value

Returns an error code. For more information, see Query error codes.

cancelDialog

Immediately stops the recognition process. After this method is called, the task ends immediately without waiting for the server to return the final recognition result.

Method signature
```
public synchronized int cancelDialog();
```
Return value

Returns an error code. For more information, see Query error codes.

release

Releases all internal resources of the SDK. After this method is called, the SDK instance becomes unavailable. To use the instance again, you must call initialize to re-initialize it.

Method signature
```
public synchronized int release();
```
Return value

Returns an error code. For more information, see Query error codes.

GetVersion

Retrieves the current SDK version information.

Method signature

public synchronized String GetVersion();

Return value

The current SDK version information.

INativeNuiCallback: Listener callbacks

onNuiEventCallback: Listen for events and speech recognition results

Method signature

void onNuiEventCallback(NuiEvent event, final int resultCode, final int arg2, KwsResult kwsResult, AsrResult asrResult);

Parameter descriptions

Parameter	Type	Description
`event`	`<a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#981ff433acpmr" id="3ff676fee5j37">NuiEvent</a>`	The callback event.
`resultCode`	`int`	The error code. This is valid when an EVENT_ASR_ERROR event occurs.
`asrResult`	`AsrResult`	The speech recognition result.
`kwsResult`	`KwsResult`	The voice wake-up feature. You do not need to follow this parameter.
`arg2`	`int`	Reserved parameter.

onNuiAudioStateChanged: Listen for audio state

The SDK uses this callback to notify you when to start or stop recording.

Method signature

void onNuiAudioStateChanged(AudioState state);

AudioState descriptions

State	Description
`STATE_OPEN`	The interaction starts. You can open the recording device to record audio.
`STATE_PAUSE`	The interaction stops. You can stop recording.
`STATE_CLOSE`	The SDK instance is released. You can completely shut down the recording device.

onNuiNeedAudioData: Fill in audio data for recognition

After the recognition process starts, this callback is triggered continuously. You must provide the audio data to be recognized in this callback.

Method signature

int onNuiNeedAudioData(byte[] buffer, int len);

Parameter descriptions

Parameter	Type	Description
`buffer`	`byte[]`	The audio data to fill.
`len`	`int`	The number of bytes of the audio data to fill.

Return value

The actual number of bytes filled.

onNuiLogTrackCallback: Listen for tracking logs

This callback is used to receive detailed internal logs from the SDK for troubleshooting and debugging.

default void onNuiLogTrackCallback(Constants.LogLevel level, String log)

`NuiEvent`: Event types

Event	Description
EVENT_TRANSCRIBER_STARTED	The task started successfully.
EVENT_VAD_START	This event is triggered after the task starts. It does not indicate that the start of a human voice has been detected.
EVENT_VAD_END	The end of a human voice is detected.
EVENT_ASR_PARTIAL_RESULT	Intermediate speech recognition result.
EVENT_ASR_ERROR	An error occurred during speech recognition.
EVENT_MIC_ERROR	Triggered because no audio data was received for 2 consecutive seconds.
EVENT_SENTENCE_END	The end of a sentence is detected. A complete recognition result for the sentence is returned.
EVENT_TRANSCRIBER_COMPLETE	Speech recognition is complete.

Getting started

Invocation steps

Request parameters

Connection and control parameters

Speech recognition effect parameters

Key interfaces

NativeNui

initialize

setParams

startDialog

stopDialog

cancelDialog

release

GetVersion

INativeNuiCallback: Listener callbacks

onNuiEventCallback: Listen for events and speech recognition results

onNuiAudioStateChanged: Listen for audio state

onNuiNeedAudioData: Fill in audio data for recognition

onNuiLogTrackCallback: Listen for tracking logs

NuiEvent: Event types

`NuiEvent`: Event types