Fun-ASR Real-Time Speech Recognition Android SDK

更新时间:
复制 MD 格式

This document provides a detailed guide on how to use the Fun-ASR real-time speech recognition Android SDK to convert speech into text.

User guide: For model descriptions and selection recommendations, see Real-Time Speech Recognition - Fun-ASR/Gummy/Paraformer.

Getting Started

  1. Obtain your API key and API host.

  2. Download the SDK and run the sample code:

    • Download the latest SDK package.

    • Unzip the ZIP file. Obtain the AAR-formatted SDK from the app/libs folder and add it to your project dependencies.
      If you need Android C++ integration, use the android_libs and android_include folders in the ZIP package to obtain dynamic libraries and header files.

    • Open the project in Android Studio. The sample code is located in DashFunAsrSpeechTranscriberActivity.java. Replace the API key to test the feature.

Call Steps

  1. Initialize the SDK.

  2. Set parameters based on your business needs. Use the parameters parameter of the initialize method to set connection and control parameters. Use the setParams method to set speech recognition effect parameters.

  3. Call startDialog to start the recognition process.

  4. In the onNuiAudioStateChanged callback, enable the audio recording device based on the audio state.

  5. In the onNuiNeedAudioData callback, continuously provide recorded audio data.

  6. In the onNuiEventCallback callback, listen for events and retrieve speech recognition results.

  7. Call stopDialog to stop the recognition. You can confirm that the recognition has ended by listening for the EVENT_TRANSCRIBER_COMPLETE event.

  8. When you no longer need the recognition feature, call the release method to release SDK resources.

Request Parameters

Connection and Control Parameters

You can configure these parameters by passing a JSON string through the parameters argument of the initialize method.

  • Parameter example: The following code shows a sample JSON string. Not all parameters are listed. You can add other parameters as needed during implementation.

    {
        "url": "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference",
        "apikey": "st-****",
        "device_id": "my_device_id",
        "service_mode": "1"
    }
  • Parameter description

    Parameter

    Type

    Required

    Description

    url

    String

    Yes

    Service endpoint. Fixed as wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference.

    apikey

    String

    Yes

    API key.

    service_mode

    String

    Yes

    Operation mode. For real-time speech recognition, this must be "1".

    device_id

    String

    Yes

    A unique string identifying the end user. Set this to an in-app user ID or a client-generated device identifier. This ID is used primarily for log tracing and troubleshooting.

    debug_path

    String

    No

    Storage path for log files.

    This parameter takes effect only when save_log is set to true in the initialize call. You must specify a log file path in this case, or an error occurs.

    The system keeps at most two local log files.

    save_wav

    String

    No

    Whether to save audio files for debugging. Audio files are saved under the debug_path directory.

    Default value: "false".

    Valid values:

    • "true": yes

    • "false": no

    This parameter takes effect only when save_log is set to true in the initialize call. You must also set debug_path.

    max_log_file_size

    int

    No

    Maximum size (in bytes) for each log file.

    This parameter takes effect only when save_log is set to true in the initialize call.

    Default value: 104857600 (100 × 1024 × 1024 bytes, or 100 MiB).

    log_track_level

    int

    No

    Controls the filter level of log content sent to an external destination through the onNuiLogTrackCallback callback.

    Default value: 2.

    Valid values:

    • 0: LOG_LEVEL_VERBOSE

    • 1: LOG_LEVEL_DEBUG

    • 2: LOG_LEVEL_INFO

    • 3: LOG_LEVEL_WARNING

    • 4: LOG_LEVEL_ERROR

    • 5: LOG_LEVEL_NONE (disables this feature)

    Note: log_track_level and level (set via the initialize interface) jointly determine which logs are finally sent through the callback. A log entry is sent only if its level number is greater than or equal to both log_track_level and level. For example, if log_track_level is set to 2 (INFO) and level is set to 3 (WARNING), only logs at WARNING level or higher (numeric value ≥ 3) are sent.

Speech Recognition Effect Parameters

You can configure these parameters by passing a JSON string through the params argument of the setParams method.

  • Parameter example: The following code shows a sample JSON string. Not all parameters are listed. You can add other parameters as needed during implementation.

    {
        "service_type": 4,
        "nls_config": {
            "model": "fun-asr-realtime",
            "sr_format": "pcm",
            "sample_rate": 16000,
            "parameters": {
                "speech_noise_threshold": 0.0
            }
        }
    }
  • Parameter description

    Top-level parameter

    Type

    Required

    Description

    service_type

    int

    Yes

    Voice service type. For real-time speech recognition, this must be 4.

    nls_config

    object

    Yes

    Core speech recognition configuration object containing key parameters such as model selection and recognition behavior control.

    nls_config.model

    string

    Yes

    Speech recognition model.

    nls_config.sr_format

    string

    Yes

    Audio format for recognition.

    Supported formats: pcm, wav, opus.

    Important
    • opus: Must be PCM-encoded. The SDK internally encodes it into OPUS format.

    • wav/pcm: Must be PCM-encoded.

    nls_config.sample_rate

    int

    Yes

    Audio sampling rate in Hz.

    Only 16000 Hz is supported.

    nls_config.semantic_punctuation_enabled

    boolean

    No

    Sets sentence segmentation mode.

    Default value: false.

    Valid values:

    • true: Enables semantic punctuation and disables VAD (Voice Activity Detection) segmentation.

    • false: Enables VAD segmentation and disables semantic punctuation.

    Semantic punctuation offers higher accuracy and suits meeting transcription scenarios. VAD segmentation has lower latency and suits real-time interactive scenarios.

    nls_config.max_sentence_silence

    int

    No

    VAD (Voice Activity Detection) silence threshold in milliseconds for sentence segmentation.

    Default value: 800.

    Valid range: [200, 6000].

    When silence after speech exceeds this threshold, the system considers the sentence complete.

    This parameter takes effect only when semantic_punctuation_enabled is false.

    nls_config.multi_threshold_mode_enabled

    boolean

    No

    Enables protection against overly long VAD segmentation. When enabled, it prevents VAD from cutting sentences too long.

    Default value: false (disabled).

    Valid values:

    • true: enabled

    • false: disabled

    This parameter takes effect only when semantic_punctuation_enabled is false.

    nls_config.heartbeat

    boolean

    No

    Is a persistent connection to the server-side maintained?

    Default value: false.

    Valid values:

    • true: Keeps the connection alive even when sending continuous silence audio.

    • false: Closes the connection after 60 seconds of inactivity due to timeout. This 60-second timeout is a server-side default and cannot be configured on the client.

    nls_config.vocabulary_id

    string

    No

    ID of a hotword vocabulary list to improve recognition accuracy for specific terms. For instructions on using hotwords, see Customize Hotwords.

    nls_config.language_hints

    array[string]

    No

    Sets the language codes for recognition. If the language is unknown in advance, leave this parameter unset and the model will identify it automatically.

    The system reads only the first value in the array and ignores all other values.

    Supported language codes by model:

    • fun-asr-realtime, fun-asr-realtime-2025-11-07:

      • zh: Chinese

      • en: English

      • ja: Japanese

    • fun-asr-realtime-2025-09-15:

      • zh: Chinese

      • en: English

    nls_config.parameters

    object

    No

    Configures additional parameters as a JSON object.

    nls_config.parameters.speech_noise_threshold

    float

    No

    Adjusts the speech-noise detection threshold to control VAD sensitivity.

    Range: [-1.0, 1.0].

    Guidelines:

    • Near -1: Lowers the noise threshold — more noise may be transcribed as speech.

    • Near +1: Raises the noise threshold — some speech may be filtered out as noise.

    Important

    This is an advanced parameter. Adjustments can significantly affect recognition quality.

    • Test thoroughly before adjusting.

    • Make small adjustments (step size 0.1) based on your audio environment.

Key Interfaces

NativeNui

initialize

Initializes the speech recognition SDK instance. The SDK uses the singleton pattern. Do not reinitialize the instance before you call release.

This method is a blocking method. You must call it from a non-UI thread.

  • Method signature

    public synchronized int initialize(final INativeNuiCallback callback,
                                       String parameters,
                                       final Constants.LogLevel level,
                                       final boolean save_log)
  • Parameter description

    Parameter

    Type

    Description

    callback

    INativeNuiCallback

    Implementation of the event and data callback interface.

    parameters

    String

    JSON string containing authentication, connection, and debugging parameters. See Connection and Control Parameters.

    level

    Constants.LogLevel

    Controls the SDK's internal log printing level.

    save_log

    boolean

    Whether to save local logs. If set to true, you must specify a path using debug_path in Connection and Control Parameters and optionally set file size using max_log_file_size.

  • Return value description

    Returns an error code. For more information, see Error Code Reference.

setParams

Sets the speech recognition effect parameters in JSON format. You must call this method before you call startDialog.

startDialog

Starts the recognition process.

  • Method signature

    public synchronized int startDialog(VadMode vad_mode, String dialog_params)
  • Parameter description

    Parameter

    Type

    Description

    vad_mode

    VadMode

    VAD mode. Fixed as VadMode.TYPE_P2T.

    dialog_params

    String

    If the apikey parameter in Connection and Control Parameters uses a temporary API key, update it here when it expires.

    Format as JSON:

    {
      "apikey": "st-****"
    }
  • Return value description

    Returns an error code. For more information, see Error Code Reference.

stopDialog

Ends the recognition. After you call this method, the server returns the final recognition result and ends the task.

  • Method signature

    public synchronized int stopDialog();
  • Return value description

    Returns an error code. For more information, see Error Code Reference.

cancelDialog

Immediately ends the recognition. After you call this method, the task ends without waiting for the final recognition result from the server.

  • Method signature

    public synchronized int cancelDialog();
  • Return value description

    Returns an error code. For more information, see Error Code Reference.

release

Releases all internal SDK resources. After this method is called, the SDK instance becomes unusable. To use the instance again, you must reinitialize it by calling initialize.

  • Method signature

    public synchronized int release();
  • Return value description

    Returns an error code. For more information, see Error Code Reference.

GetVersion

Retrieves the current SDK version information.

  • Method signature

    public synchronized String GetVersion();
  • Return value description

    The current SDK version information.

INativeNuiCallback: Listener Callbacks

onNuiEventCallback: Listen for Events and Speech Recognition Results

  • Method signature

    void onNuiEventCallback(NuiEvent event, final int resultCode, final int arg2, KwsResult kwsResult, AsrResult asrResult);
  • Parameter description

    Parameter

    Type

    Description

    event

    NuiEvent

    Callback event.

    resultCode

    int

    Error code. Valid only for the EVENT_ASR_ERROR event.

    asrResult

    AsrResult

    Speech recognition result.

    kwsResult

    KwsResult

    Wake-word detection result. Ignore this parameter.

    arg2

    int

    Reserved parameter.

onNuiAudioStateChanged: Listen for Audio State

The SDK uses this callback to notify your application when to start or stop recording.

  • Method signature

    void onNuiAudioStateChanged(AudioState state);
  • AudioState description

    State

    Description

    STATE_OPEN

    Interaction started. Open the audio recording device.

    STATE_PAUSE

    Interaction stopped. Stop recording.

    STATE_CLOSE

    SDK instance released. Fully close the audio recording device.

onNuiNeedAudioData: Provide Audio Data for Recognition

After the recognition starts, this callback is triggered continuously. You must provide audio data within this callback.

  • Method signature

    int onNuiNeedAudioData(byte[] buffer, int len);
  • Parameter description

    Parameter

    Type

    Description

    buffer

    byte[]

    Audio data to fill.

    len

    int

    Number of bytes of audio data to fill.

  • Return value description

    The actual number of bytes that are filled.

onNuiLogTrackCallback: Listen for Trace Logs

This callback receives detailed internal SDK logs that can be used for troubleshooting and debugging.

default void onNuiLogTrackCallback(Constants.LogLevel level, String log)

NuiEvent: Event Types

Event

Description

EVENT_TRANSCRIBER_STARTED

Task started successfully.

EVENT_VAD_START

Triggered immediately after task start. Does not indicate voice onset detection.

EVENT_VAD_END

Voice endpoint detected.

EVENT_ASR_PARTIAL_RESULT

Intermediate speech recognition result.

EVENT_ASR_ERROR

Error occurred during speech recognition.

EVENT_MIC_ERROR

Triggered after receiving no audio data for 2 consecutive seconds.

EVENT_SENTENCE_END

End of a sentence detected. Returns a complete recognition result for that sentence.

EVENT_TRANSCRIBER_COMPLETE

Speech recognition completed.