torchaudio.models¶
The models subpackage contains definitions of models for addressing common audio tasks.
ConvTasNet¶
- 
class torchaudio.models.ConvTasNet(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3, msk_activate: str = 'sigmoid')[source]¶
- Conv-TasNet: a fully-convolutional time-domain audio separation network Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation [1]. - Parameters
- num_sources (int, optional) – The number of sources to split. 
- enc_kernel_size (int, optional) – The convolution kernel size of the encoder/decoder, <L>. 
- enc_num_feats (int, optional) – The feature dimensions passed to mask generator, <N>. 
- msk_kernel_size (int, optional) – The convolution kernel size of the mask generator, <P>. 
- msk_num_feats (int, optional) – The input/output feature dimension of conv block in the mask generator, <B, Sc>. 
- msk_num_hidden_feats (int, optional) – The internal feature dimension of conv block of the mask generator, <H>. 
- msk_num_layers (int, optional) – The number of layers in one conv block of the mask generator, <X>. 
- msk_num_stacks (int, optional) – The numbr of conv blocks of the mask generator, <R>. 
- msk_activate (str, optional) – The activation function of the mask output (Default: - sigmoid).
 
 - Note - This implementation corresponds to the “non-causal” setting in the paper. - 
forward(input: torch.Tensor) → torch.Tensor[source]¶
- Perform source separation. Generate audio source waveforms. - Parameters
- input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames] 
- Returns
- 3D Tensor with shape [batch, channel==num_sources, frames] 
- Return type
- Tensor 
 
 
DeepSpeech¶
- 
class torchaudio.models.DeepSpeech(n_feature: int, n_hidden: int = 2048, n_class: int = 40, dropout: float = 0.0)[source]¶
- DeepSpeech model architecture from Deep Speech: Scaling up end-to-end speech recognition [2]. - Parameters
- n_feature – Number of input features 
- n_hidden – Internal hidden unit size. 
- n_class – Number of output classes 
 
 - 
forward(x: torch.Tensor) → torch.Tensor[source]¶
- Parameters
- x (torch.Tensor) – Tensor of dimension (batch, channel, time, feature). 
- Returns
- Predictor tensor of dimension (batch, time, class). 
- Return type
- Tensor 
 
 
Tacotron2¶
- 
class torchaudio.models.Tacotron2(mask_padding: bool = False, n_mels: int = 80, n_symbol: int = 148, n_frames_per_step: int = 1, symbol_embedding_dim: int = 512, encoder_embedding_dim: int = 512, encoder_n_convolution: int = 3, encoder_kernel_size: int = 5, decoder_rnn_dim: int = 1024, decoder_max_step: int = 2000, decoder_dropout: float = 0.1, decoder_early_stopping: bool = True, attention_rnn_dim: int = 1024, attention_hidden_dim: int = 128, attention_location_n_filter: int = 32, attention_location_kernel_size: int = 31, attention_dropout: float = 0.1, prenet_dim: int = 256, postnet_n_convolution: int = 5, postnet_kernel_size: int = 5, postnet_embedding_dim: int = 512, gate_threshold: float = 0.5)[source]¶
- Tacotron2 model based on the implementation from Nvidia. - The original implementation was introduced in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [3]. - Parameters
- mask_padding (bool, optional) – Use mask padding (Default: - False).
- n_mels (int, optional) – Number of mel bins (Default: - 80).
- n_symbol (int, optional) – Number of symbols for the input text (Default: - 148).
- n_frames_per_step (int, optional) – Number of frames processed per step, only 1 is supported (Default: - 1).
- symbol_embedding_dim (int, optional) – Input embedding dimension (Default: - 512).
- encoder_n_convolution (int, optional) – Number of encoder convolutions (Default: - 3).
- encoder_kernel_size (int, optional) – Encoder kernel size (Default: - 5).
- encoder_embedding_dim (int, optional) – Encoder embedding dimension (Default: - 512).
- decoder_rnn_dim (int, optional) – Number of units in decoder LSTM (Default: - 1024).
- decoder_max_step (int, optional) – Maximum number of output mel spectrograms (Default: - 2000).
- decoder_dropout (float, optional) – Dropout probability for decoder LSTM (Default: - 0.1).
- decoder_early_stopping (bool, optional) – Continue decoding after all samples are finished (Default: - True).
- attention_rnn_dim (int, optional) – Number of units in attention LSTM (Default: - 1024).
- attention_hidden_dim (int, optional) – Dimension of attention hidden representation (Default: - 128).
- attention_location_n_filter (int, optional) – Number of filters for attention model (Default: - 32).
- attention_location_kernel_size (int, optional) – Kernel size for attention model (Default: - 31).
- attention_dropout (float, optional) – Dropout probability for attention LSTM (Default: - 0.1).
- prenet_dim (int, optional) – Number of ReLU units in prenet layers (Default: - 256).
- postnet_n_convolution (int, optional) – Number of postnet convolutions (Default: - 5).
- postnet_kernel_size (int, optional) – Postnet kernel size (Default: - 5).
- postnet_embedding_dim (int, optional) – Postnet embedding dimension (Default: - 512).
- gate_threshold (float, optional) – Probability threshold for stop token (Default: - 0.5).
 
 - 
forward(tokens: torch.Tensor, token_lengths: torch.Tensor, mel_specgram: torch.Tensor, mel_specgram_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶
- Pass the input through the Tacotron2 model. This is in teacher forcing mode, which is generally used for training. - The input - tokensshould be padded with zeros to length max of- token_lengths. The input- mel_specgramshould be padded with zeros to length max of- mel_specgram_lengths.- Parameters
- tokens (Tensor) – The input tokens to Tacotron2 with shape (n_batch, max of token_lengths). 
- token_lengths (Tensor) – The valid length of each sample in - tokenswith shape (n_batch, ).
- mel_specgram (Tensor) – The target mel spectrogram with shape (n_batch, n_mels, max of mel_specgram_lengths). 
- mel_specgram_lengths (Tensor) – The length of each mel spectrogram with shape (n_batch, ). 
 
- Returns
- Tensor
- Mel spectrogram before Postnet with shape (n_batch, n_mels, max of mel_specgram_lengths). 
- Tensor
- Mel spectrogram after Postnet with shape (n_batch, n_mels, max of mel_specgram_lengths). 
- Tensor
- The output for stop token at each time step with shape (n_batch, max of mel_specgram_lengths). 
- Tensor
- Sequence of attention weights from the decoder with shape (n_batch, max of mel_specgram_lengths, max of token_lengths). 
 
- Return type
- [Tensor, Tensor, Tensor, Tensor] 
 
 - 
infer(tokens: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
- Using Tacotron2 for inference. The input is a batch of encoded sentences ( - tokens) and its corresponding lengths (- lengths). The output is the generated mel spectrograms, its corresponding lengths, and the attention weights from the decoder.- The input tokens should be padded with zeros to length max of - lengths.- Parameters
- tokens (Tensor) – The input tokens to Tacotron2 with shape (n_batch, max of lengths). 
- lengths (Tensor or None, optional) – The valid length of each sample in - tokenswith shape (n_batch, ). If- None, it is assumed that the all the tokens are valid. Default:- None
 
- Returns
- Tensor
- The predicted mel spectrogram with shape (n_batch, n_mels, max of mel_specgram_lengths). 
- Tensor
- The length of the predicted mel spectrogram with shape (n_batch, ). 
- Tensor
- Sequence of attention weights from the decoder with shape (n_batch, max of mel_specgram_lengths, max of lengths). 
 
- Return type
- (Tensor, Tensor, Tensor) 
 
 
Wav2Letter¶
- 
class torchaudio.models.Wav2Letter(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]¶
- Wav2Letter model architecture from Wav2Letter: an End-to-End ConvNet-based Speech Recognition System [4]. - \(\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}\) - Parameters
 - 
forward(x: torch.Tensor) → torch.Tensor[source]¶
- Parameters
- x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length). 
- Returns
- Predictor tensor of dimension (batch_size, number_of_classes, input_length). 
- Return type
- Tensor 
 
 
Wav2Vec2.0 / HuBERT¶
Model¶
Wav2Vec2Model¶
- 
class torchaudio.models.Wav2Vec2Model(feature_extractor: torch.nn.Module, encoder: torch.nn.Module, aux: Optional[torch.nn.Module] = None)[source]¶
- Encoder model used in wav2vec 2.0 [5]. - Note - To build the model, please use one of the factory functions. - Parameters
- feature_extractor (torch.nn.Module) – Feature extractor that extracts feature vectors from raw audio Tensor. 
- encoder (torch.nn.Module) – Encoder that converts the audio features into the sequence of probability distribution (in negative log-likelihood) over labels. 
- aux (torch.nn.Module or None, optional) – Auxiliary module. If provided, the output from encoder is passed to this module. 
 
 - 
extract_features(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None, num_layers: Optional[int] = None) → Tuple[List[torch.Tensor], Optional[torch.Tensor]][source]¶
- Extract feature vectors from raw waveforms - This returns the list of outputs from the intermediate layers of transformer block in encoder. - Parameters
- waveforms (Tensor) – Audio tensor of shape (batch, frames). 
- lengths (Tensor or None, optional) – Indicates the valid length of each audio in the batch. Shape: (batch, ). When the - waveformscontains audios with different durations, by providing- lengthsargument, the model will compute the corresponding valid output lengths and apply proper mask in transformer attention layer. If- None, it is assumed that the entire audio waveform length is valid.
- num_layers (int or None, optional) – If given, limit the number of intermediate layers to go through. Providing 1 will stop the computation after going through one intermediate layers. If not given, the outputs from all the intermediate layers are returned. 
 
- Returns
- List of Tensors
- Features from requested layers. Each Tensor is of shape: (batch, time frame, feature dimension) 
- Tensor or None
- If - lengthsargument was provided, a Tensor of shape (batch, ) is returned. It indicates the valid length in time axis of each feature Tensor.
 
- Return type
- (List[Tensor], Optional[Tensor]) 
 
 - 
forward(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶
- Compute the sequence of probability distribution over labels. - Parameters
- waveforms (Tensor) – Audio tensor of shape (batch, frames). 
- lengths (Tensor or None, optional) – Indicates the valid length of each audio in the batch. Shape: (batch, ). When the - waveformscontains audios with different durations, by providing- lengthsargument, the model will compute the corresponding valid output lengths and apply proper mask in transformer attention layer. If- None, it is assumed that all the audio in- waveformshave valid length. Default:- None.
 
- Returns
- Tensor
- The sequences of probability distribution (in logit) over labels. Shape: (batch, frames, num labels). 
- Tensor or None
- If - lengthsargument was provided, a Tensor of shape (batch, ) is returned. It indicates the valid length in time axis of the output Tensor.
 
- Return type
- (Tensor, Optional[Tensor]) 
 
 
Factory Functions¶
wav2vec2_model¶
- 
torchaudio.models.wav2vec2_model(extractor_mode: str, extractor_conv_layer_config: Optional[List[Tuple[int, int, int]]], extractor_conv_bias: bool, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_pos_conv_kernel: int, encoder_pos_conv_groups: int, encoder_num_layers: int, encoder_num_heads: int, encoder_attention_dropout: float, encoder_ff_interm_features: int, encoder_ff_interm_dropout: float, encoder_dropout: float, encoder_layer_norm_first: bool, encoder_layer_drop: float, aux_num_out: Optional[int]) → torchaudio.models.Wav2Vec2Model[source]¶
- Build a custom Wav2Vec2Model - Note - The “feature extractor” below corresponds to ConvFeatureExtractionModel in the original - fairseqimplementation. This is referred as “(convolutional) feature encoder” in the wav2vec 2.0 [5] paper.- The “encoder” below corresponds to TransformerEncoder, and this is referred as “Transformer” in the paper. - Parameters
- extractor_mode (str) – - Operation mode of feature extractor. Valid values are - "group_norm"or- "layer_norm". If- "group_norm", then a single normalization is applied in the first convolution block. Otherwise, all the convolution blocks will have layer normalization.- This option corresponds to - extractor_modefrom- fairseq.
- extractor_conv_layer_config (list of python:integer tuples or None) – - Configuration of convolution layers in feature extractor. List of convolution configuration, i.e. - [(output_channel, kernel_size, stride), ...]- If - Noneis provided, then the following default value is used.- [ (512, 10, 5), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 2, 2), (512, 2, 2), ] - This option corresponds to - conv_feature_layersfrom- fairseq.
- extractor_conv_bias (bool) – - Whether to include bias term to each convolution operation. - This option corresponds to - conv_biasfrom- fairseq.
- encoder_embed_dim (int) – - The dimension of embedding in encoder. - This option corresponds to - encoder_embed_dimfrom- fairseq.
- encoder_projection_dropout (float) – - The dropout probability applied after the input feature is projected to - encoder_embed_dim.- This option corresponds to - dropout_inputfrom- fairseq.
- encoder_pos_conv_kernel (int) – - The kernel size of convolutional positional embeddings. - This option corresponds to - conv_posfrom- fairseq.
- encoder_pos_conv_groups (int) – - The number of groups of convolutional positional embeddings. - This option corresponds to - conv_pos_groupsfrom- fairseq.
- encoder_num_layers (int) – - The number of self attention layers in transformer block. - This option corresponds to - encoder_layersfrom- fairseq.
- encoder_num_heads (int) – - The number of heads in self attention layers. - This option corresponds to - encoder_attention_headsfrom- fairseq.
- encoder_attention_dropout (float) – - The dropout probability applied after softmax in self-attention layer. - This option corresponds to - attention_dropoutfrom- fairseq.
- encoder_ff_interm_features (int) – - The dimension of hidden features in feed forward layer. - This option corresponds to - encoder_ffn_embed_dimfrom- fairseq.
- encoder_ff_interm_dropout (float) – - The dropout probability applied in feedforward layer. - This option correspinds to - activation_dropoutfrom- fairseq.
- encoder_dropout (float) – - The dropout probability applied at the end of feed forward layer. - This option corresponds to - dropoutfrom- fairseq.
- encoder_layer_norm_first (bool) – - Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers. In encoder layer, two layer norms are applied before and after self attention. If False, in transformer layer, layer norm is applied after features are fed to encoder layers. In encoder layer, two layer norms are applied after self attention, before and after feed forward. - This option corresponds to - layer_norm_firstfrom- fairseq.
- encoder_layer_drop (float) – - Probability to drop each encoder layer during training. - This option corresponds to - layerdropfrom- fairseq.
- aux_num_out (int or None) – When provided, attach an extra linear layer on top of encoder, which can be used for fine-tuning. 
 
- Returns
- The resulting model. 
- Return type
 
wav2vec2_base¶
- 
torchaudio.models.wav2vec2_base(encoder_projection_dropout: float = 0.1, encoder_attention_dropout: float = 0.1, encoder_ff_interm_dropout: float = 0.1, encoder_dropout: float = 0.1, encoder_layer_drop: float = 0.1, aux_num_out: Optional[int] = None) → torchaudio.models.Wav2Vec2Model[source]¶
- Build Wav2Vec2Model with “base” architecture from wav2vec 2.0 [5] - Parameters
- encoder_projection_dropout (float) – See - wav2vec2_model().
- encoder_attention_dropout (float) – See - wav2vec2_model().
- encoder_ff_interm_dropout (float) – See - wav2vec2_model().
- encoder_dropout (float) – See - wav2vec2_model().
- encoder_layer_drop (float) – See - wav2vec2_model().
- aux_num_out (int or None, optional) – See - wav2vec2_model().
 
- Returns
- The resulting model. 
- Return type
 
wav2vec2_large¶
- 
torchaudio.models.wav2vec2_large(encoder_projection_dropout: float = 0.1, encoder_attention_dropout: float = 0.1, encoder_ff_interm_dropout: float = 0.1, encoder_dropout: float = 0.1, encoder_layer_drop: float = 0.1, aux_num_out: Optional[int] = None) → torchaudio.models.Wav2Vec2Model[source]¶
- Build Wav2Vec2Model with “large” architecture from wav2vec 2.0 [5] - Parameters
- encoder_projection_dropout (float) – See - wav2vec2_model().
- encoder_attention_dropout (float) – See - wav2vec2_model().
- encoder_ff_interm_dropout (float) – See - wav2vec2_model().
- encoder_dropout (float) – See - wav2vec2_model().
- encoder_layer_drop (float) – See - wav2vec2_model().
- aux_num_out (int or None, optional) – See - wav2vec2_model().
 
- Returns
- The resulting model. 
- Return type
 
wav2vec2_large_lv60k¶
- 
torchaudio.models.wav2vec2_large_lv60k(encoder_projection_dropout: float = 0.1, encoder_attention_dropout: float = 0.0, encoder_ff_interm_dropout: float = 0.1, encoder_dropout: float = 0.0, encoder_layer_drop: float = 0.1, aux_num_out: Optional[int] = None) → torchaudio.models.Wav2Vec2Model[source]¶
- Build Wav2Vec2Model with “large lv-60k” architecture from wav2vec 2.0 [5] - Parameters
- encoder_projection_dropout (float) – See - wav2vec2_model().
- encoder_attention_dropout (float) – See - wav2vec2_model().
- encoder_ff_interm_dropout (float) – See - wav2vec2_model().
- encoder_dropout (float) – See - wav2vec2_model().
- encoder_layer_drop (float) – See - wav2vec2_model().
- aux_num_out (int or None, optional) – See - wav2vec2_model().
 
- Returns
- The resulting model. 
- Return type
 
hubert_base¶
- 
torchaudio.models.hubert_base(encoder_projection_dropout: float = 0.1, encoder_attention_dropout: float = 0.1, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.1, encoder_layer_drop: float = 0.05, aux_num_out: Optional[int] = None) → torchaudio.models.Wav2Vec2Model[source]¶
- Build HuBERT model with “base” architecture from HuBERT [6] - Parameters
- encoder_projection_dropout (float) – See - wav2vec2_model().
- encoder_attention_dropout (float) – See - wav2vec2_model().
- encoder_ff_interm_dropout (float) – See - wav2vec2_model().
- encoder_dropout (float) – See - wav2vec2_model().
- encoder_layer_drop (float) – See - wav2vec2_model().
- aux_num_out (int or None, optional) – See - wav2vec2_model().
 
- Returns
- The resulting model. 
- Return type
 
hubert_large¶
- 
torchaudio.models.hubert_large(encoder_projection_dropout: float = 0.0, encoder_attention_dropout: float = 0.0, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.0, encoder_layer_drop: float = 0.0, aux_num_out: Optional[int] = None) → torchaudio.models.Wav2Vec2Model[source]¶
- Build HuBERT model with “large” architecture from HuBERT [6] - Parameters
- encoder_projection_dropout (float) – See - wav2vec2_model().
- encoder_attention_dropout (float) – See - wav2vec2_model().
- encoder_ff_interm_dropout (float) – See - wav2vec2_model().
- encoder_dropout (float) – See - wav2vec2_model().
- encoder_layer_drop (float) – See - wav2vec2_model().
- aux_num_out (int or None, optional) – See - wav2vec2_model().
 
- Returns
- The resulting model. 
- Return type
 
hubert_xlarge¶
- 
torchaudio.models.hubert_xlarge(encoder_projection_dropout: float = 0.0, encoder_attention_dropout: float = 0.0, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.0, encoder_layer_drop: float = 0.0, aux_num_out: Optional[int] = None) → torchaudio.models.Wav2Vec2Model[source]¶
- Build HuBERT model with “extra large” architecture from HuBERT [6] - Parameters
- encoder_projection_dropout (float) – See - wav2vec2_model().
- encoder_attention_dropout (float) – See - wav2vec2_model().
- encoder_ff_interm_dropout (float) – See - wav2vec2_model().
- encoder_dropout (float) – See - wav2vec2_model().
- encoder_layer_drop (float) – See - wav2vec2_model().
- aux_num_out (int or None, optional) – See - wav2vec2_model().
 
- Returns
- The resulting model. 
- Return type
 
Utility Functions¶
import_huggingface_model¶
- 
torchaudio.models.wav2vec2.utils.import_huggingface_model(original: torch.nn.Module) → torchaudio.models.Wav2Vec2Model[source]¶
- Build Wav2Vec2Model from the corresponding model object of Hugging Face’s Transformers. - Parameters
- original (torch.nn.Module) – An instance of - Wav2Vec2ForCTCfrom- transformers.
- Returns
- Imported model. 
- Return type
 - Example
- >>> from torchaudio.models.wav2vec2.utils import import_huggingface_model >>> >>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") >>> model = import_huggingface_model(original) >>> >>> waveforms, _ = torchaudio.load("audio.wav") >>> logits, _ = model(waveforms) 
 
import_fairseq_model¶
- 
torchaudio.models.wav2vec2.utils.import_fairseq_model(original: torch.nn.Module) → torchaudio.models.Wav2Vec2Model[source]¶
- Build Wav2Vec2Model from the corresponding model object of fairseq. - Parameters
- original (torch.nn.Module) – An instance of fairseq’s Wav2Vec2.0 or HuBERT model. One of - fairseq.models.wav2vec.wav2vec2_asr.Wav2VecEncoder,- fairseq.models.wav2vec.wav2vec2.Wav2Vec2Modelor- fairseq.models.hubert.hubert_asr.HubertEncoder.
- Returns
- Imported model. 
- Return type
 - Example - Loading pretrain-only model
- >>> from torchaudio.models.wav2vec2.utils import import_fairseq_model >>> >>> # Load model using fairseq >>> model_file = 'wav2vec_small.pt' >>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file]) >>> original = model[0] >>> imported = import_fairseq_model(original) >>> >>> # Perform feature extraction >>> waveform, _ = torchaudio.load('audio.wav') >>> features, _ = imported.extract_features(waveform) >>> >>> # Compare result with the original model from fairseq >>> reference = original.feature_extractor(waveform).transpose(1, 2) >>> torch.testing.assert_allclose(features, reference) 
- Example - Fine-tuned model
- >>> from torchaudio.models.wav2vec2.utils import import_fairseq_model >>> >>> # Load model using fairseq >>> model_file = 'wav2vec_small_960h.pt' >>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file]) >>> original = model[0] >>> imported = import_fairseq_model(original.w2v_encoder) >>> >>> # Perform encoding >>> waveform, _ = torchaudio.load('audio.wav') >>> emission, _ = imported(waveform) >>> >>> # Compare result with the original model from fairseq >>> mask = torch.zeros_like(waveform) >>> reference = original(waveform, mask)['encoder_out'].transpose(0, 1) >>> torch.testing.assert_allclose(emission, reference) 
 
WaveRNN¶
- 
class torchaudio.models.WaveRNN(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]¶
- WaveRNN model based on the implementation from fatchord. - The original implementation was introduced in Efficient Neural Audio Synthesis [7]. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length. - Parameters
- upsample_scales – the list of upsample scales. 
- n_classes – the number of output classes. 
- hop_length – the number of samples between the starts of consecutive frames. 
- n_res_block – the number of ResBlock in stack. (Default: - 10)
- n_rnn – the dimension of RNN layer. (Default: - 512)
- n_fc – the dimension of fully connected layer. (Default: - 512)
- kernel_size – the number of kernel size in the first Conv1d layer. (Default: - 5)
- n_freq – the number of bins in a spectrogram. (Default: - 128)
- n_hidden – the number of hidden dimensions of resblock. (Default: - 128)
- n_output – the number of output dimensions of melresnet. (Default: - 128)
 
 - Example
- >>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200) >>> waveform, sample_rate = torchaudio.load(file) >>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) >>> specgram = MelSpectrogram(sample_rate)(waveform) # shape: (n_batch, n_channel, n_freq, n_time) >>> output = wavernn(waveform, specgram) >>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes) 
 - 
forward(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]¶
- Pass the input through the WaveRNN model. - Parameters
- waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length) 
- specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time) 
 
- Returns
- shape (n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes) 
- Return type
- Tensor 
 
 - 
infer(specgram: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶
- Inference method of WaveRNN. - This function currently only supports multinomial sampling, which assumes the network is trained on cross entropy loss. - Parameters
- specgram (Tensor) – Batch of spectrograms. Shape: (n_batch, n_freq, n_time). 
- lengths (Tensor or None, optional) – Indicates the valid length of each audio in the batch. Shape: (batch, ). When the - specgramcontains spectrograms with different durations, by providing- lengthsargument, the model will compute the corresponding valid output lengths. If- None, it is assumed that all the audio in- waveformshave valid length. Default:- None.
 
- Returns
- Tensor
- The inferred waveform of size (n_batch, 1, n_time). 1 stands for a single channel. 
- Tensor or None
- If - lengthsargument was provided, a Tensor of shape (batch, ) is returned. It indicates the valid length in time axis of the output Tensor.
 
- Return type
- (Tensor, Optional[Tensor]) 
 
 
References¶
- 1
- Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167. 
- 2
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014. arXiv:1412.5567. 
- 3
- Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and others. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783. IEEE, 2018. 
- 4
- Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. 2016. arXiv:1609.03193. 
- 5(1,2,3,4,5)
- Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. arXiv:2006.11477. 
- 6(1,2,3)
- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: self-supervised speech representation learning by masked prediction of hidden units. 2021. arXiv:2106.07447. 
- 7
- Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. 2018. arXiv:1802.08435.