torchaudio.models¶
The models subpackage contains definitions of models for addressing common audio tasks.
ConvTasNet¶
- 
class torchaudio.models.ConvTasNet(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]¶
- Conv-TasNet: a fully-convolutional time-domain audio separation network - Parameters
- num_sources (int) – The number of sources to split. 
- enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>. 
- enc_num_feats (int) – The feature dimensions passed to mask generator, <N>. 
- msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>. 
- msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>. 
- msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>. 
- msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>. 
- msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>. 
 
 - Note - This implementation corresponds to the “non-causal” setting in the paper. - Reference:
- Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation - Luo, Yi and Mesgarani, Nima 
 
 - 
forward(input: torch.Tensor) → torch.Tensor[source]¶
- Perform source separation. Generate audio source waveforms. - Parameters
- input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames] 
- Returns
- 3D Tensor with shape [batch, channel==num_sources, frames] 
- Return type
 
 
Wav2Letter¶
- 
class torchaudio.models.Wav2Letter(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]¶
- Wav2Letter model architecture from the Wav2Letter an End-to-End ConvNet-based Speech Recognition System. - \(\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}\) - Parameters
 - 
forward(x: torch.Tensor) → torch.Tensor[source]¶
- Parameters
- x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length). 
- Returns
- Predictor tensor of dimension (batch_size, number_of_classes, input_length). 
- Return type
- Tensor 
 
 
WaveRNN¶
- 
class torchaudio.models.WaveRNN(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]¶
- WaveRNN model based on the implementation from fatchord. - The original implementation was introduced in “Efficient Neural Audio Synthesis”. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length. - Parameters
- upsample_scales – the list of upsample scales. 
- n_classes – the number of output classes. 
- hop_length – the number of samples between the starts of consecutive frames. 
- n_res_block – the number of ResBlock in stack. (Default: - 10)
- n_rnn – the dimension of RNN layer. (Default: - 512)
- n_fc – the dimension of fully connected layer. (Default: - 512)
- kernel_size – the number of kernel size in the first Conv1d layer. (Default: - 5)
- n_freq – the number of bins in a spectrogram. (Default: - 128)
- n_hidden – the number of hidden dimensions of resblock. (Default: - 128)
- n_output – the number of output dimensions of melresnet. (Default: - 128)
 
 - Example
- >>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200) >>> waveform, sample_rate = torchaudio.load(file) >>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) >>> specgram = MelSpectrogram(sample_rate)(waveform) # shape: (n_batch, n_channel, n_freq, n_time) >>> output = wavernn(waveform, specgram) >>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes) 
 - 
forward(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]¶
- Pass the input through the WaveRNN model. - Parameters
- waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length) 
- specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time) 
 
- Returns
- (n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes) 
- Return type
- Tensor shape