This document provides a detailed guide on using the Paraformer real-time speech recognition Android software development kit (SDK) to convert speech to text.
User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Gummy/Paraformer.
Online experience: An online experience is available only for paraformer-realtime-v2, paraformer-realtime-8k-v2, and paraformer-realtime-v1.
Getting started
-
Obtain an API key: Get an API key. For security purposes, configure the API key as an environment variable.
NoteTo grant temporary access permissions to third-party applications or users, or to strictly control risky operations such as accessing or deleting sensitive data, use a temporary API key. A temporary API key is valid for 60 seconds by default and must be reacquired after it expires.
-
Download the SDK and run the sample code:
-
Unzip the package. Obtain the AAR-format SDK from the
app/libsfolder and add it to your project dependencies.
For Android C++ integration, use theandroid_libsandandroid_includefolders in the ZIP package to obtain the dynamic libraries and header files. -
Open the project in Android Studio. The sample code is in
DashParaformerSpeechTranscriberActivity.java. Replace the API key to test the feature.
Invocation steps
-
Initialize the SDK.
-
Set the required parameters. Use the
parametersparameter of the initialize method to set connection and control parameters. Use the setParams method to set speech recognition effect parameters. -
Call startDialog to start the recognition process.
-
In the onNuiAudioStateChanged callback, start the recording device based on the audio state.
-
In the onNuiNeedAudioData callback, continuously provide recording data.
-
In the onNuiEventCallback callback, listen for events and retrieve the speech recognition results.
-
Call stopDialog to stop recognition. Confirm that recognition has ended by listening for the EVENT_TRANSCRIBER_COMPLETE event.
-
When the recognition feature is no longer needed, call the release method to release SDK resources.
Request parameters
Connection and control parameters
You can configure these parameters by passing a JSON string in the parameters parameter of the initialize method.
-
Parameter example: The following code provides a sample JSON string. Not all parameters are listed. You can add parameters as needed during encoding:
{ "url": "wss://dashscope.aliyuncs.com/api-ws/v1/inference", "apikey": "st-****", "device_id": "my_device_id", "service_mode": "1" } -
Parameter descriptions
Parameter
Type
Required
Description
urlStringYes
The service endpoint. This is fixed at
wss://dashscope.aliyuncs.com/api-ws/v1/inference.apikeyStringYes
The API key. To reduce the risk of a long-term key being exposed, use a more secure temporary API key with a short validity period.
service_modeStringYes
The operating mode. For real-time speech recognition, this is fixed at
"1".device_idStringYes
A unique string that identifies the end user. You can set this to an in-app user ID or a unique device identifier generated by the client. This ID is mainly used for log tracking and troubleshooting.
debug_pathStringNo
The storage path for log files.
This parameter takes effect only when
save_logis set to true in the initialize call. You must set a log file path, otherwise an error will occur.A maximum of two log files are kept locally.
save_wavStringNo
Specifies whether to save audio files for debugging. The audio files are saved to the path specified by
debug_path.Default value: "false".
Valid values:
-
"true": Yes
-
"false": No
This parameter takes effect only when
save_logis set to true in the initialize call. Thedebug_pathparameter must also be set.max_log_file_sizeintNo
The maximum size of a log file in bytes.
This parameter takes effect only when
save_logis set to true in the initialize call.Default value: 104857600 (100 × 1024 × 1024 bytes, which is 100 MiB).
log_track_levelintNo
Controls the filter level of log content sent externally through the log callback (
<a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#9c10968457gc6" id="0002fe9689yts">onNuiLogTrackCallback</a>).Default value: 2.
Valid values:
-
0: LOG_LEVEL_VERBOSE
-
1: LOG_LEVEL_DEBUG
-
2: LOG_LEVEL_INFO
-
3: LOG_LEVEL_WARNING
-
4: LOG_LEVEL_ERROR
-
5: LOG_LEVEL_NONE (disables this feature)
Note:
log_track_levelandlevel(set through the initialize interface) together determine the final logs that are sent to the callback. A log is sent to the callback only if its level value is greater than or equal to both thelog_track_levelandlevelvalues. For example, iflog_track_levelis set to 2 (INFO) andlevelis set to 3 (WARNING), only logs at the WARNING level and higher (value >= 3) are sent to the callback. -
Speech recognition effect parameters
You can configure these parameters by passing a JSON string in the params parameter of the setParams method.
-
Parameter example: The following code provides a sample JSON string. Not all parameters are listed. You can add parameters as needed during encoding:
{ "service_type": 4, "nls_config": { "model": "paraformer-realtime-v2", "sr_format": "pcm", "sample_rate": "16000" } } -
Parameter descriptions
Top-level parameter
Type
Required
Description
service_typeintYes
The Voice Service type. For real-time speech recognition, this is fixed at
4.nls_configobjectYes
The core configuration object for speech recognition. It contains key parameters such as model selection and recognition effect controls.
nls_config.modelstringYes
The speech recognition model.
nls_config.sr_formatstringYes
The format of the audio to be recognized.
Supported audio formats: pcm, wav, opus.
Important-
opus: Must be PCM encoded. The SDK internally encodes it into the OPUS format.
-
wav/pcm: Must be PCM encoded.
nls_config.sample_rateintYes
The sample rate of the audio to be recognized, in Hz.
This varies by model:
-
paraformer-realtime-v2 supports any sample rate.
-
paraformer-realtime-v1 only supports a 16000 Hz sample rate.
-
paraformer-realtime-8k-v2 only supports an 8000 Hz sample rate.
-
paraformer-realtime-8k-v1 only supports an 8000 Hz sample rate.
nls_config.disfluency_removal_enabledbooleanNo
Specifies whether to filter out filler words, such as "um" and "ah".
Default value: false.
nls_config.language_hintsarray[string]No
Sets the language codes for recognition. If you cannot determine the language in advance, you can leave this unset. The model will automatically detect the language.
Supported language codes:
-
zh: Chinese
-
en: English
-
ja: Japanese
-
yue: Cantonese
-
ko: Korean
-
de: German
-
fr: French
-
ru: Russian
This parameter only takes effect for multilingual models.
nls_config.semantic_punctuation_enabledbooleanNo
Sets the sentence segmentation mode.
Default value: false.
Valid values:
-
true: Enables semantic segmentation and disables VAD segmentation.
-
false: Enables VAD segmentation and disables semantic segmentation.
Semantic segmentation provides higher accuracy and is suitable for meeting transcription scenarios. Voice Activity Detection (VAD) segmentation has lower latency and is suitable for real-time interactive scenarios.
This parameter takes effect only for v2 and later models.
nls_config.max_sentence_silenceintNo
The silence duration threshold for VAD segmentation, in ms.
Default value: 800.
Value range: [200, 6000].
When the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended.
This parameter takes effect only when the
semantic_punctuation_enabledparameter is false and for v2 and later models.nls_config.multi_threshold_mode_enabledbooleanNo
Specifies whether to enable the long-segment prevention mode. Enabling this prevents VAD from creating excessively long segments.
Default value: false (disabled).
Valid values:
-
true: Enabled
-
false: Disabled
This parameter takes effect only when the
semantic_punctuation_enabledparameter is false and for v2 and later models.nls_config.punctuation_prediction_enabledbooleanNo
Specifies whether to automatically add punctuation to the recognition results.
Default value: true (yes).
Valid values:
-
true: Yes
-
false: No
This parameter takes effect only for v2 and later models.
nls_config.heartbeatbooleanNo
Specifies whether to maintain a persistent connection with the server.
Default value: false.
Valid values:
-
true: The connection to the server can be maintained without interruption by continuously sending silent audio.
-
false: The connection will be disconnected due to a timeout after 60 seconds, even if silent audio is continuously sent. This 60-second timeout is a default server-side behavior and cannot be configured by the client.
This parameter takes effect only for v2 and later models.
nls_config.inverse_text_normalization_enabledbooleanNo
Specifies whether to enable Inverse Text Normalization (ITN). When enabled, Chinese numerals are converted to Arabic numerals.
Default value: true (enabled).
Valid values:
-
true: Enabled
-
false: Disabled
This parameter takes effect only for v2 and later models.
nls_config.vocabulary_idstringNo
The ID of the hotword vocabulary, used to improve the recognition accuracy of specific words. This parameter applies to v2 and later models. For information about how to use hotwords, see Customize hotwords.
nls_config.resourcesarray[object]No
The hotword resource configuration for v1 models. This feature is the same as
vocabulary_idbut is configured differently:resourcesis an array of objects. Each element contains theresource_idandresource_typefields:-
resource_id: Astringthat specifies the hotword ID. -
resource_type: Astringwith a fixed value of "asr_phrase".
Example:
{ "nls_config": { "resources": [ { "resource_id": "xxxxxxxxxxxx", "resource_type": "asr_phrase" } ] } }For information about how to use hotwords, see Customize and manage Paraformer speech recognition hotwords.
-
Key interfaces
NativeNui
initialize
Initialize the speech recognition SDK instance. The SDK is a singleton and cannot be initialized again until you call release.
This API call is blocking and should be made on a non-UI thread.
-
Method signature
public synchronized int initialize(final INativeNuiCallback callback, String parameters, final Constants.LogLevel level, final boolean save_log) -
Parameter descriptions
Parameter
Type
Description
callback<a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#8b030fec74e01" id="6a02b0dea4han">INativeNuiCallback</a>The implementation of the event and data callback interface.
parametersStringA JSON string that contains authentication, connection, and debugging parameters. For more information, see Connection and control parameters.
levelConstants.LogLevelControls the print level of the SDK's own logs.
save_logbooleanSpecifies whether to save binary logs. If set to
true, you must specify a path withdebug_pathin the connection and control parameters. You can also set the file size withmax_log_file_size. -
Return value
Returns an error code. For more information, see Query error codes.
setParams
Sets speech recognition effect parameters in JSON format. You must call this method before you call startDialog.
-
Method signature
public synchronized int setParams(String params) -
Parameter descriptions
Parameter
Type
Description
paramsString -
Return value
Returns an error code. For more information, see Query error codes.
startDialog
Starts the recognition process.
-
Method signature
public synchronized int startDialog(VadMode vad_mode, String dialog_params) -
Parameter descriptions
Parameter
Type
Description
vad_modeVadModeThe VAD mode. This is fixed at
VadMode.TYPE_P2T.dialog_paramsStringWhen the temporary API key corresponding to the
apikeyparameter in the connection and control parameters expires, you can update it here.The content is in JSON format:
{ "apikey": "st-****" } -
Return value
Returns an error code. For more information, see Query error codes.
stopDialog
Stops the recognition process. After this method is called, the server returns the final recognition result and ends the task.
-
Method signature
public synchronized int stopDialog(); -
Return value
Returns an error code. For more information, see Query error codes.
cancelDialog
Immediately stops the recognition process. After this method is called, the task ends immediately without waiting for the server to return the final recognition result.
-
Method signature
public synchronized int cancelDialog(); -
Return value
Returns an error code. For more information, see Query error codes.
release
Releases all internal resources of the SDK. After this method is called, the SDK instance becomes unavailable. To use the instance again, you must call initialize to re-initialize it.
-
Method signature
public synchronized int release(); -
Return value
Returns an error code. For more information, see Query error codes.
GetVersion
Retrieves the current SDK version information.
-
Method signature
public synchronized String GetVersion(); -
Return value
The current SDK version information.
INativeNuiCallback: Listener callbacks
onNuiEventCallback: Listen for events and speech recognition results
-
Method signature
void onNuiEventCallback(NuiEvent event, final int resultCode, final int arg2, KwsResult kwsResult, AsrResult asrResult); -
Parameter descriptions
Parameter
Type
Description
event<a baseurl="t3166292_v1_0_0.xdita" data-node="6183301" data-root="85177" data-tag="xref" href="#981ff433acpmr" id="3ff676fee5j37">NuiEvent</a>The callback event.
resultCodeintThe error code. This is valid when an EVENT_ASR_ERROR event occurs.
asrResultAsrResultThe speech recognition result.
kwsResultKwsResultThe voice wake-up feature. You do not need to follow this parameter.
arg2intReserved parameter.
onNuiAudioStateChanged: Listen for audio state
The SDK uses this callback to notify you when to start or stop recording.
-
Method signature
void onNuiAudioStateChanged(AudioState state); -
AudioState descriptions
State
Description
STATE_OPENThe interaction starts. You can open the recording device to record audio.
STATE_PAUSEThe interaction stops. You can stop recording.
STATE_CLOSEThe SDK instance is released. You can completely shut down the recording device.
onNuiNeedAudioData: Fill in audio data for recognition
After the recognition process starts, this callback is triggered continuously. You must provide the audio data to be recognized in this callback.
-
Method signature
int onNuiNeedAudioData(byte[] buffer, int len); -
Parameter descriptions
Parameter
Type
Description
bufferbyte[]The audio data to fill.
lenintThe number of bytes of the audio data to fill.
-
Return value
The actual number of bytes filled.
onNuiLogTrackCallback: Listen for tracking logs
This callback is used to receive detailed internal logs from the SDK for troubleshooting and debugging.
default void onNuiLogTrackCallback(Constants.LogLevel level, String log)
NuiEvent: Event types
|
Event |
Description |
|
EVENT_TRANSCRIBER_STARTED |
The task started successfully. |
|
EVENT_VAD_START |
This event is triggered after the task starts. It does not indicate that the start of a human voice has been detected. |
|
EVENT_VAD_END |
The end of a human voice is detected. |
|
EVENT_ASR_PARTIAL_RESULT |
Intermediate speech recognition result. |
|
EVENT_ASR_ERROR |
An error occurred during speech recognition. |
|
EVENT_MIC_ERROR |
Triggered because no audio data was received for 2 consecutive seconds. |
|
EVENT_SENTENCE_END |
The end of a sentence is detected. A complete recognition result for the sentence is returned. |
|
EVENT_TRANSCRIBER_COMPLETE |
Speech recognition is complete. |