Audio labeling templates-Platform For AI(PAI)-阿里云帮助中心

iTAG provides labeling templates for audio classification, audio segmentation, and automatic speech recognition (ASR). When you create a labeling job, you select a template based on your use case. This topic describes the use cases for these audio templates and their input and output data structures.

Background information

This topic describes the data structure for the following audio labeling templates:

Audio classification
Audio segmentation
Automatic speech recognition (ASR)

Audio classification

Audio classification assigns one or more predefined labels to an audio clip. This labeling template supports both single-label and multi-label classification.

Use cases
A common use case is ambient sound classification.

Data structure

Input data
Each line in the input .manifest file is a JSON object representing a single audio file to be labeled. Each object must contain the source field.
```
{"data":{"source":"oss://example-bucket.oss-cn-hangzhou.aliyuncs.com/iTAG/audio/1.wav"}}
...
```

Output data

Each line in the output .manifest file is a JSON object containing the source audio file's location and its annotation results. The following example shows the JSON structure:

{
    "data": {
        "source": "oss://example-bucket.oss-cn-hangzhou.aliyuncs.com/audio/6.wav"
    },
    "label-1432993193909231616": {
        "results": [
            {
                "questionId": "1",
                "data": "Label 1",
                "markTitle": "single-choice",
                "type": "survey/value"
            }
        ]
    }
}

Audio segmentation

Audio segmentation identifies and labels specific time segments within an audio file. You can use a sound wave graph to define the start and end times for each segment.

Use cases
A common use case is conversation analysis.

Data structure

Input data
Each line in the input .manifest file is a JSON object representing a single audio file to be labeled. Each object must contain the source field.
```
{"data":{"source":"oss://example-bucket.oss-cn-hangzhou.aliyuncs.com/iTAG/audio/1.wav"}}
...
```

Output data

Each line in the output .manifest file is a JSON object containing the source audio file's location and its annotation results. The following example shows the JSON structure:

{
    "data": {
        "source": "oss://example-bucket.oss-cn-hangzhou.aliyuncs.com/audio/21.wav"
    },
    "label-1435480301706092544": {
        "results": [
            {
                "duration": 0,
                "objects": [
                    {
                        "result": {
                            "Audio recognition result": "This is the transcribed content for segment 1.",
                            "single-choice": "Label 1"
                        },
                        "color": null,
                        "id": "wavesurfer_ei0aet9uvp8",
                        "start": 2.3886218302094817,
                        "end": 4.635545755237045
                    },
                    {
                        "result": {
                            "Audio recognition result": "This is the transcribed content for segment 2.",
                            "single-choice": "Label 2"
                        },
                        "color": null,
                        "id": "wavesurfer_kl39gnlb2k",
                        "start": 5.698280044101433,
                        "end": 7.348048511576626
                    }
                ],
                "empty": false
            }
        ]
    }
}

Automatic speech recognition (ASR)

ASR converts spoken audio into written text. This labeling template lets you transcribe audio files and apply relevant labels.

Use cases
A common use case is dialect recognition.

Data structure

Input data
Each line in the input .manifest file is a JSON object representing a single audio file to be labeled. Each object must contain the source field.
```
{"data":{"source":"oss://example-bucket.oss-cn-hangzhou.aliyuncs.com/iTAG/audio/1.wav"}}
...
```

Output data

Each line in the output .manifest file is a JSON object containing the source audio file's location and its annotation results. The following example shows the JSON structure:

{
    "data": {
        "source": "oss://example-bucket.oss-cn-hangzhou.aliyuncs.com/audio/14.wav"
    },
    "label-1435448359497441280": {
        "results": [
            {
                "questionId": "1",
                "data": "This is the transcribed content.",
                "markTitle": "Audio recognition result",
                "type": "survey/value"
            },
            {
                "questionId": "3",
                "data": [
                    "Label 1",
                    "Label 2"
                ],
                "markTitle": "multiple-choice",
                "type": "survey/multivalue"
            }
        ]
    }
}