torchaudio.backend¶
Overview¶
torchaudio.backend module provides implementations for audio file I/O functionalities, which are torchaudio.info, torchaudio.load, and torchaudio.save.
There are currently two implementations available.
- "sox_io"(default on Linux/macOS)
- "soundfile"(default on Windows)
Note
Instead of calling functions in torchaudio.backend directly, please use torchaudio.info, torchaudio.load, and torchaudio.save with proper backend set with torchaudio.set_audio_backend().
Availability¶
"sox_io" backend requires C++ extension module, which is included in Linux/macOS binary distributions. This backend is not available on Windows.
"soundfile" backend requires SoundFile. Please refer to the SoundFile documentation for the installation.
Common Data Structure¶
Structures used to report the metadata of audio files.
AudioMetaData¶
- class torchaudio.backend.common.AudioMetaData(sample_rate: int, num_frames: int, num_channels: int, bits_per_sample: int, encoding: str)[source]¶
- Return type of - torchaudio.infofunction.- This class is used by - "sox_io" backendand- "soundfile" backend.- Variables:
- sample_rate (int) – Sample rate 
- num_frames (int) – The number of frames 
- num_channels (int) – The number of channels 
- bits_per_sample (int) – The number of bits per sample. This is 0 for lossy formats, or when it cannot be accurately inferred. 
- encoding (str) – - Audio encoding The values encoding can take are one of the following: - PCM_S: Signed integer linear PCM
- PCM_U: Unsigned integer linear PCM
- PCM_F: Floating point linear PCM
- FLAC: Flac, Free Lossless Audio Codec
- ULAW: Mu-law
- ALAW: A-law
- MP3: MP3, MPEG-1 Audio Layer III
- VORBIS: OGG Vorbis
- AMR_WB: Adaptive Multi-Rate
- AMR_NB: Adaptive Multi-Rate Wideband
- OPUS: Opus
- HTK: Single channel 16-bit PCM
- UNKNOWN: None of above
 
 
 - Tutorials using AudioMetaData:
 
Sox IO Backend¶
The sox_io backend is available and default on Linux/macOS and not available on Windows.
I/O functions of this backend support TorchScript.
You can switch from another backend to the sox_io backend with the following;
torchaudio.set_audio_backend("sox_io")
info¶
- torchaudio.backend.sox_io_backend.info(filepath: str, format: Optional[str] = None) AudioMetaData[source]¶
- Get signal information of an audio file. - Parameters:
- filepath (path-like object or file-like object) – - Source of audio data. When the function is not compiled by TorchScript, (e.g. - torch.jit.script), the following types are accepted;- path-like: file path
- file-like: Object with- read(size: int) -> bytesmethod, which returns byte string of at most- sizelength.
 - When the function is compiled by TorchScript, only - strtype is allowed.- Note - When the input type is file-like object, this function cannot get the correct length ( - num_samples) for certain formats, such as- vorbis. In this case, the value of- num_samplesis- 0.
- This argument is intentionally annotated as - stronly due to TorchScript compiler compatibility.
 
- format (str or None, optional) – Override the format detection with the given format. Providing the argument might help when libsox can not infer the format from header or extension. 
 
- Returns:
- Metadata of the given audio. 
- Return type:
 
load¶
- torchaudio.backend.sox_io_backend.load(filepath: str, frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None) Tuple[Tensor, int][source]¶
- Load audio data from file. - Note - This function can handle all the codecs that underlying libsox can handle, however it is tested on the following formats; - WAV, AMB - 32-bit floating-point 
- 32-bit signed integer 
- 24-bit signed integer 
- 16-bit signed integer 
- 8-bit unsigned integer (WAV only) 
 
- MP3 
- FLAC 
- OGG/VORBIS 
- OPUS 
- SPHERE 
- AMR-NB 
 - To load - MP3,- FLAC,- OGG/VORBIS,- OPUSand other codecs- libsoxdoes not handle natively, your installation of- torchaudiohas to be linked to- libsoxand corresponding codec libraries such as- libmador- libmp3lameetc.- By default ( - normalize=True,- channels_first=True), this function returns Tensor with- float32dtype, and the shape of [channel, time].- Warning - normalizeargument does not perform volume normalization. It only converts the sample type to torch.float32 from the native sample type.- When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer, 24-bit signed integer, and 8-bit unsigned integer, by providing - normalize=False, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is,- int32tensor for 32-bit signed PCM,- int16for 16-bit signed PCM and- uint8for 8-bit unsigned PCM. Since torch does not support- int24dtype, 24-bit signed PCM are converted to- int32tensors.- normalizeargument has no effect on 32-bit floating-point WAV and other formats, such as- flacand- mp3.- For these formats, this function always returns - float32Tensor with values.- Parameters:
- filepath (path-like object or file-like object) – - Source of audio data. When the function is not compiled by TorchScript, (e.g. - torch.jit.script), the following types are accepted;- path-like: file path
- file-like: Object with- read(size: int) -> bytesmethod, which returns byte string of at most- sizelength.
 - When the function is compiled by TorchScript, only - strtype is allowed.- Note: This argument is intentionally annotated as - stronly due to TorchScript compiler compatibility.
- frame_offset (int) – Number of frames to skip before start reading data. 
- num_frames (int, optional) – Maximum number of frames to read. - -1reads all the remaining samples, starting from- frame_offset. This function may return the less number of frames if there is not enough frames in the given file.
- normalize (bool, optional) – - When - True, this function converts the native sample type to- float32. Default:- True.- If input file is integer WAV, giving - Falsewill change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.
- channels_first (bool, optional) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel]. 
- format (str or None, optional) – Override the format detection with the given format. Providing the argument might help when libsox can not infer the format from header or extension. 
 
- Returns:
- Resulting Tensor and sample rate.
- If the input file has integer wav format and - normalize=False, then it has integer type, else- float32type. If- channels_first=True, it has [channel, time] else [time, channel].
 
- Return type:
- (torch.Tensor, int) 
 
save¶
- torchaudio.backend.sox_io_backend.save(filepath: str, src: Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None, format: Optional[str] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]¶
- Save audio data to file. - Parameters:
- filepath (str or pathlib.Path) – Path to save file. This function also handles - pathlib.Pathobjects, but is annotated as- strfor TorchScript compiler compatibility.
- src (torch.Tensor) – Audio data to save. must be 2D tensor. 
- sample_rate (int) – sampling rate 
- channels_first (bool, optional) – If - True, the given tensor is interpreted as [channel, time], otherwise [time, channel].
- compression (float or None, optional) – - Used for formats other than WAV. This corresponds to - -Coption of- soxcommand.- "mp3"
- Either bitrate (in - kbps) with quality factor, such as- 128.2, or VBR encoding with quality factor such as- -4.2. Default:- -4.5.
- "flac"
- Whole number from - 0to- 8.- 8is default and highest compression.
- "ogg",- "vorbis"
- Number from - -1to- 10;- -1is the highest compression and lowest quality. Default:- 3.
 - See the detail at http://sox.sourceforge.net/soxformat.html. 
- format (str or None, optional) – - Override the audio format. When - filepathargument is path-like object, audio format is infered from file extension. If file extension is missing or different, you can specify the correct format with this argument.- When - filepathargument is file-like object, this argument is required.- Valid values are - "wav",- "mp3",- "ogg",- "vorbis",- "amr-nb",- "amb",- "flac",- "sph",- "gsm", and- "htk".
- encoding (str or None, optional) – - Changes the encoding for the supported formats. This argument is effective only for supported formats, such as - "wav",- ""amb"and- "sph". Valid values are;- "PCM_S"(signed integer Linear PCM)
- "PCM_U"(unsigned integer Linear PCM)
- "PCM_F"(floating point PCM)
- "ULAW"(mu-law)
- "ALAW"(a-law)
 - Default values
- If not provided, the default value is picked based on - formatand- bits_per_sample.- "wav",- "amb"
- If bothencodingandbits_per_sampleare not provided, thedtypeof theTensor is used to determine the default value.- "PCM_U"if dtype is- uint8
- "PCM_S"if dtype is- int16or- int32
- "PCM_F"if dtype is- float32
 
- "PCM_U"if- bits_per_sample=8
- "PCM_S"otherwise
 
- "sph"format;
- the default value is - "PCM_S"
 
 
 
- bits_per_sample (int or None, optional) – - Changes the bit depth for the supported formats. When - formatis one of- "wav",- "flac",- "sph", or- "amb", you can change the bit depth. Valid values are- 8,- 16,- 32and- 64.- Default Value;
- If not provided, the default values are picked based on - formatand- "encoding";- "wav",- "amb";
- If bothencodingandbits_per_sampleare not provided, thedtypeof theTensor is used.- 8if dtype is- uint8
- 16if dtype is- int16
- 32if dtype is- int32or- float32
 
- 8if- encodingis- "PCM_U",- "ULAW"or- "ALAW"
- 16if- encodingis- "PCM_S"
- 32if- encodingis- "PCM_F"
 
- "flac"format;
- the default value is - 24
 
- "sph"format;
- 16if- encodingis- "PCM_U",- "PCM_S",- "PCM_F"or not provided.
- 8if- encodingis- "ULAW"or- "ALAW"
 
- "amb"format;
- 8if- encodingis- "PCM_U",- "ULAW"or- "ALAW"
- 16if- encodingis- "PCM_S"or not provided.
- 32if- encodingis- "PCM_F"
 
 
 
 
 - Supported formats/encodings/bit depth/compression are; - "wav",- "amb"
- 32-bit floating-point PCM 
- 32-bit signed integer PCM 
- 24-bit signed integer PCM 
- 16-bit signed integer PCM 
- 8-bit unsigned integer PCM 
- 8-bit mu-law 
- 8-bit a-law 
 - Note: Default encoding/bit depth is determined by the dtype of the input Tensor. 
- "mp3"
- Fixed bit rate (such as 128kHz) and variable bit rate compression. Default: VBR with high quality. 
- "flac"
- 8-bit 
- 16-bit 
- 24-bit (default) 
 
- "ogg",- "vorbis"
- Different quality level. Default: approx. 112kbps 
 
- "sph"
- 8-bit signed integer PCM 
- 16-bit signed integer PCM 
- 24-bit signed integer PCM 
- 32-bit signed integer PCM (default) 
- 8-bit mu-law 
- 8-bit a-law 
- 16-bit a-law 
- 24-bit a-law 
- 32-bit a-law 
 
- "amr-nb"
- Bitrate ranging from 4.75 kbit/s to 12.2 kbit/s. Default: 4.75 kbit/s 
- "gsm"
- Lossy Speech Compression, CPU intensive. 
- "htk"
- Uses a default single-channel 16-bit PCM format. 
 - Note - To save into formats that - libsoxdoes not handle natively, (such as- "mp3",- "flac",- "ogg"and- "vorbis"), your installation of- torchaudiohas to be linked to- libsoxand corresponding codec libraries such as- libmador- libmp3lameetc.
Soundfile Backend¶
The "soundfile" backend is available when SoundFile is installed. This backend is the default on Windows.
You can switch from another backend to the "soundfile" backend with the following;
torchaudio.set_audio_backend("soundfile")
info¶
- torchaudio.backend.soundfile_backend.info(filepath: str, format: Optional[str] = None) AudioMetaData[source]¶
- Get signal information of an audio file. - Note - filepathargument is intentionally annotated as- stronly, even though it accepts- pathlib.Pathobject as well. This is for the consistency with- "sox_io"backend, which has a restriction on type annotation due to TorchScript compiler compatiblity.- Parameters:
- filepath (path-like object or file-like object) – Source of audio data. 
- format (str or None, optional) – Not used. PySoundFile does not accept format hint. 
 
- Returns:
- meta data of the given audio. 
- Return type:
 
load¶
- torchaudio.backend.soundfile_backend.load(filepath: str, frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None) Tuple[Tensor, int][source]¶
- Load audio data from file. - Note - The formats this function can handle depend on the soundfile installation. This function is tested on the following formats; - WAV - 32-bit floating-point 
- 32-bit signed integer 
- 16-bit signed integer 
- 8-bit unsigned integer 
 
- FLAC 
- OGG/VORBIS 
- SPHERE 
 - By default ( - normalize=True,- channels_first=True), this function returns Tensor with- float32dtype, and the shape of [channel, time].- Warning - normalizeargument does not perform volume normalization. It only converts the sample type to torch.float32 from the native sample type.- When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer, 24-bit signed integer, and 8-bit unsigned integer, by providing - normalize=False, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is,- int32tensor for 32-bit signed PCM,- int16for 16-bit signed PCM and- uint8for 8-bit unsigned PCM. Since torch does not support- int24dtype, 24-bit signed PCM are converted to- int32tensors.- normalizeargument has no effect on 32-bit floating-point WAV and other formats, such as- flacand- mp3.- For these formats, this function always returns - float32Tensor with values.- Note - filepathargument is intentionally annotated as- stronly, even though it accepts- pathlib.Pathobject as well. This is for the consistency with- "sox_io"backend, which has a restriction on type annotation due to TorchScript compiler compatiblity.- Parameters:
- filepath (path-like object or file-like object) – Source of audio data. 
- frame_offset (int, optional) – Number of frames to skip before start reading data. 
- num_frames (int, optional) – Maximum number of frames to read. - -1reads all the remaining samples, starting from- frame_offset. This function may return the less number of frames if there is not enough frames in the given file.
- normalize (bool, optional) – - When - True, this function converts the native sample type to- float32. Default:- True.- If input file is integer WAV, giving - Falsewill change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.
- channels_first (bool, optional) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel]. 
- format (str or None, optional) – Not used. PySoundFile does not accept format hint. 
 
- Returns:
- Resulting Tensor and sample rate.
- If the input file has integer wav format and normalization is off, then it has integer type, else - float32type. If- channels_first=True, it has [channel, time] else [time, channel].
 
- Return type:
- (torch.Tensor, int) 
 
save¶
- torchaudio.backend.soundfile_backend.save(filepath: str, src: Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None, format: Optional[str] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]¶
- Save audio data to file. - Note - The formats this function can handle depend on the soundfile installation. This function is tested on the following formats; - WAV - 32-bit floating-point 
- 32-bit signed integer 
- 16-bit signed integer 
- 8-bit unsigned integer 
 
- FLAC 
- OGG/VORBIS 
- SPHERE 
 - Note - filepathargument is intentionally annotated as- stronly, even though it accepts- pathlib.Pathobject as well. This is for the consistency with- "sox_io"backend, which has a restriction on type annotation due to TorchScript compiler compatiblity.- Parameters:
- filepath (str or pathlib.Path) – Path to audio file. 
- src (torch.Tensor) – Audio data to save. must be 2D tensor. 
- sample_rate (int) – sampling rate 
- channels_first (bool, optional) – If - True, the given tensor is interpreted as [channel, time], otherwise [time, channel].
- compression (python:float of None, optional) – Not used. It is here only for interface compatibility reson with “sox_io” backend. 
- format (str or None, optional) – - Override the audio format. When - filepathargument is path-like object, audio format is inferred from file extension. If the file extension is missing or different, you can specify the correct format with this argument.- When - filepathargument is file-like object, this argument is required.- Valid values are - "wav",- "ogg",- "vorbis",- "flac"and- "sph".
- encoding (str or None, optional) – - Changes the encoding for supported formats. This argument is effective only for supported formats, sush as - "wav",- ""flac"and- "sph". Valid values are;- "PCM_S"(signed integer Linear PCM)
- "PCM_U"(unsigned integer Linear PCM)
- "PCM_F"(floating point PCM)
- "ULAW"(mu-law)
- "ALAW"(a-law)
 
- bits_per_sample (int or None, optional) – Changes the bit depth for the supported formats. When - formatis one of- "wav",- "flac"or- "sph", you can change the bit depth. Valid values are- 8,- 16,- 24,- 32and- 64.
 
 - Supported formats/encodings/bit depth/compression are: - "wav"
- 32-bit floating-point PCM 
- 32-bit signed integer PCM 
- 24-bit signed integer PCM 
- 16-bit signed integer PCM 
- 8-bit unsigned integer PCM 
- 8-bit mu-law 
- 8-bit a-law 
 - Note:
- Default encoding/bit depth is determined by the dtype of the input Tensor. 
 
- "flac"
- 8-bit 
- 16-bit (default) 
- 24-bit 
 
- "ogg",- "vorbis"
- Doesn’t accept changing configuration. 
 
- "sph"
- 8-bit signed integer PCM 
- 16-bit signed integer PCM 
- 24-bit signed integer PCM 
- 32-bit signed integer PCM (default) 
- 8-bit mu-law 
- 8-bit a-law 
- 16-bit a-law 
- 24-bit a-law 
- 32-bit a-law 
 
 
