ava.models package

Submodules

ava.models.vae module

A Variational Autoencoder (VAE) for spectrogram data.

VAE References

[1]

Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

https://arxiv.org/abs/1312.6114

[2]

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint arXiv:1401.4082 (2014).

https://arxiv.org/abs/1401.4082

class ava.models.vae.VAE(save_dir='', lr=0.001, z_dim=32, model_precision=10.0, device_name='auto')[source]

Bases: sphinx.ext.autodoc.importer._MockObject

Variational Autoencoder class for single-channel images.

save_dir

Directory where the model is saved. Defaults to ''.

Type:str, optional
lr

Model learning rate. Defaults to 1e-3.

Type:float, optional
z_dim

Latent dimension. Defaults to 32.

Type:int, optional
model_precision

Precision of the observation model. Defaults to 10.0.

Type:float, optional
device_name

Name of device to train the model on. When 'auto' is passed, 'cuda' is chosen if torch.cuda.is_available(), otherwise 'cpu' is chosen. Defaults to 'auto'.

Type:{‘cpu’, ‘cuda’, ‘auto’}, optional

Notes

The model is trained to maximize the standard ELBO objective:

\[\mathcal{L} = \mathbb{E}_{q(z|x)} log p(x,z) + \mathbb{H}[q(z|x)]\]

where \(p(x,z) = p(z)p(x|z)\) and \(\mathbb{H}\) is differential entropy. The prior \(p(z)\) is a unit spherical normal distribution. The conditional distribution \(p(x|z)\) is set as a spherical normal distribution to prevent overfitting. The variational distribution, \(q(z|x)\) is an approximately rank-1 multivariate normal distribution. Here, \(q(z|x)\) and \(p(x|z)\) are parameterized by neural networks. Gradients are passed through stochastic layers via the reparameterization trick, implemented by the PyTorch rsample method.

The dimensions of the network are hard-coded for use with 128 x 128 spectrograms. Although a desired latent dimension can be passed to __init__, the dimensions of the network limit the practical range of values roughly 8 to 64 dimensions. Fiddling with the image dimensions will require updating the parameters of the layers defined in _build_network.

decode(z)[source]

Compute \(p(x|z)\).

\[p(x|z) = \mathcal{N}(\mu, \Lambda)\]
\[\Lambda = \mathtt{model\_precision} \cdot I\]

where \(\mu\) is a deterministic function of z, \(\Lambda\) is a precision matrix, and \(I\) is the identity matrix.

Parameters:z (torch.Tensor) – Batch of latent samples with shape [batch_size, self.z_dim]
Returns:x – Batch of means mu, described above. Shape: [batch_size, X_DIM=128*128]
Return type:torch.Tensor
encode(x)[source]

Compute \(q(z|x)\).

\[q(z|x) = \mathcal{N}(\mu, \Sigma)\]
\[\Sigma = u u^{T} + \mathtt{diag}(d)\]

where \(\mu\), \(u\), and \(d\) are deterministic functions of x and \(\Sigma\) denotes a covariance matrix.

Parameters:x (torch.Tensor) – The input images, with shape: [batch_size, height=128, width=128]
Returns:
  • mu (torch.Tensor) – Posterior mean, with shape [batch_size, self.z_dim]
  • u (torch.Tensor) – Posterior covariance factor, as defined above. Shape: [batch_size, self.z_dim]
  • d (torch.Tensor) – Posterior diagonal factor, as defined above. Shape: [batch_size, self.z_dim]
forward(x, return_latent_rec=False)[source]

Send x round trip and compute a loss.

In more detail: Given x, compute \(q(z|x)\) and sample: \(\hat{z} \sim q(z|x)\) . Then compute \(\log p(x|\hat{z})\), the log-likelihood of x, the input, given \(\hat{z}\), the latent sample. We will also need the likelihood of \(\hat{z}\) under the model’s prior: \(p(\hat{z})\), and the entropy of the latent conditional distribution, \(\mathbb{H}[q(z|x)]\) . ELBO can then be estimated as:

\[1/N \sum_{i=1}^N \mathbb{E}_{\hat{z} \sim q(z|x_i)} \log p(x_i,\hat{z}) + \mathbb{H}[q(z|x_i)]\]

where \(N\) denotes the number of samples from the data distribution and the expectation is estimated using a single latent sample, \(\hat{z}\). In practice, the outer expectation is estimated using minibatches.

Parameters:
  • x (torch.Tensor) – A batch of samples from the data distribution (spectrograms). Shape: [batch_size, height=128, width=128]
  • return_latent_rec (bool, optional) – Whether to return latent means and reconstructions. Defaults to False.
Returns:

  • loss (torch.Tensor) – Negative ELBO times the batch size. Shape: []
  • latent (numpy.ndarray, if return_latent_rec) – Latent means. Shape: [batch_size, self.z_dim]
  • reconstructions (numpy.ndarray, if return_latent_rec) – Reconstructed means. Shape: [batch_size, height=128, width=128]

get_latent(loader)[source]

Get latent means for all syllable in the given loader.

Parameters:loader (torch.utils.data.Dataloader) – ava.models.vae_dataset.SyllableDataset Dataloader.
Returns:latent – Latent means. Shape: [len(loader.dataset), self.z_dim]
Return type:numpy.ndarray

Note

  • Make sure your loader is not set to shuffle if you’re going to match these with labels or other fields later.
load_state(filename)[source]

Load all the model parameters from the given .tar file.

The .tar file should be written by self.save_state.

Parameters:filename (str) – File containing a model state.

Note

  • self.lr, self.save_dir, and self.z_dim are not loaded.
save_state(filename)[source]

Save all the model parameters to the given file.

test_epoch(test_loader)[source]

Test the model on a held-out test set, return an ELBO estimate.

Parameters:test_loader (torch.utils.data.Dataloader) – ava.models.vae_dataset.SyllableDataset Dataloader for test set
Returns:elbo – An unbiased estimate of the ELBO, estimated using samples from test_loader.
Return type:float
train_epoch(train_loader)[source]

Train the model for a single epoch.

Parameters:train_loader (torch.utils.data.Dataloader) – ava.models.vae_dataset.SyllableDataset Dataloader for training set
Returns:elbo – A biased estimate of the ELBO, estimated using samples from train_loader.
Return type:float
train_loop(loaders, epochs=100, test_freq=2, save_freq=10, vis_freq=1)[source]

Train the model for multiple epochs, testing and saving along the way.

Parameters:
  • loaders (dictionary) – Dictionary mapping the keys 'test' and 'train' to respective torch.utils.data.Dataloader objects.
  • epochs (int, optional) – Number of (possibly additional) epochs to train the model for. Defaults to 100.
  • test_freq (int, optional) – Testing is performed every test_freq epochs. Defaults to 2.
  • save_freq (int, optional) – The model is saved every save_freq epochs. Defaults to 10.
  • vis_freq (int, optional) – Syllable reconstructions are plotted every vis_freq epochs. Defaults to 1.
visualize(loader, num_specs=5, gap=(2, 6), save_filename='reconstruction.pdf')[source]

Plot spectrograms and their reconstructions.

Spectrograms are chosen at random from the Dataloader Dataset.

Parameters:
  • loader (torch.utils.data.Dataloader) – Spectrogram Dataloader
  • num_specs (int, optional) – Number of spectrogram pairs to plot. Defaults to 5.
  • gap (int or tuple of two ints, optional) – The vertical and horizontal gap between images, in pixels. Defaults to (2,6).
  • save_filename (str, optional) – Where to save the plot, relative to self.save_dir. Defaults to 'temp.pdf'.
Returns:

  • specs (numpy.ndarray) – Spectgorams from loader.
  • rec_specs (numpy.ndarray) – Corresponding spectrogram reconstructions.

ava.models.vae.X_DIM = 16384

freq_bins * time_bins

Type:Processed spectrogram dimension
ava.models.vae.X_SHAPE = (128, 128)

[freq_bins, time_bins]

Type:Processed spectrogram shape

ava.models.vae_dataset module

Methods for feeding syllable data to the VAE.

Meant to be used with ava.models.vae.VAE.

class ava.models.vae_dataset.SyllableDataset(filenames, sylls_per_file, transform=None)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

torch.utils.data.Dataset for animal vocalization syllables

ava.models.vae_dataset.get_syllable_data_loaders(partition, batch_size=64, shuffle=(True, False), num_workers=4)[source]

Return a pair of DataLoaders given a test/train split.

Parameters:
  • partition (dictionary) – Test train split: a dictionary that maps the keys ‘test’ and ‘train’ to disjoint lists of .hdf5 filenames containing syllables.
  • batch_size (int, optional) – Batch size of the returned Dataloaders. Defaults to 32.
  • shuffle (tuple of bools, optional) – Whether to shuffle data for the train and test Dataloaders, respectively. Defaults to (True, False).
  • num_workers (int, optional) – How many subprocesses to use for data loading. Defaults to 3.
Returns:

dataloaders – Dictionary mapping two keys, 'test' and 'train', to respective torch.utils.data.Dataloader objects.

Return type:

dictionary

ava.models.vae_dataset.get_syllable_partition(dirs, split, shuffle=True, max_num_files=None)[source]

Partition the filenames into a random test/train split.

Parameters:
  • dirs (list of strings) – List of directories containing saved syllable hdf5 files.
  • split (float) – Portion of the hdf5 files to use for training, \(0 < \mathtt{split} \leq 1.0\)
  • shuffle (bool, optional) – Whether to shuffle the hdf5 files. Defaults to True.
  • max_num_files ({int, None}, optional) – The number of files in the train and test partitions <= max_num_files. If None, all files are used. Defaults to None.
Returns:

partition – Contains two keys, 'test' and 'train', that map to lists of hdf5 files. Defines the random test/train split.

Return type:

dict

ava.models.window_vae_dataset module

Methods for feeding randomly sampled spectrogram data to the shotgun VAE.

Meant to be used with ava.models.vae.VAE.

TO DO

  • replace affinewarp with ava.preprocessing.warping
ava.models.window_vae_dataset.DEFAULT_WARP_PARAMS = {'l2_reg_scale': 1e-07, 'n_knots': 0, 'smoothness_reg_scale': 0.1, 'warp_reg_scale': 0.01}

Default time-warping parameters sent to affinewarp

class ava.models.window_vae_dataset.FixedWindowDataset(audio_filenames, roi_filenames, p, transform=None, dataset_length=2048, min_spec_val=None)[source]

Bases: sphinx.ext.autodoc.importer._MockObject

write_hdf5_files(save_dir, num_files=500, sylls_per_file=100)[source]

Write hdf5 files containing spectrograms of random audio chunks.

  • Write to multiple directories.

Note

  • This should be consistent with ava.preprocessing.preprocess.process_sylls.
Parameters:
  • save_dir (str) – Directory to save hdf5s in.
  • num_files (int, optional) – Number of files to save. Defaults to 500.
  • sylls_per_file (int, optional) – Number of syllables in each file. Defaults to 100.
class ava.models.window_vae_dataset.WarpedWindowDataset(audio_filenames, p, transform=None, dataset_length=2048, load_warp=False, save_warp=True, start_q=-0.1, stop_q=1.1, warp_fn=None, warp_params={}, warp_type='spectrogram')[source]

Bases: sphinx.ext.autodoc.importer._MockObject

get_specific_item(query_filename, quantile)[source]

Return a specific window of birdsong as a Numpy array.

Parameters:
  • query_filename (str) – Audio filename.
  • quantile (float) – 0 <= quantile <= 1
Returns:

spec – Spectrogram.

Return type:

numpy.ndarray

get_whole_warped_spectrogram(query_filename, time_bins=128)[source]

Get an entire warped song motif.

Parameters:
  • query_filename (str) – Which audio file to use.
  • time_bins (int, optional) – Number of time bins.
Returns:

spec – Spectrogram.

Return type:

numpy.ndarray

write_hdf5_files(save_dir, num_files=400, sylls_per_file=100)[source]

Write hdf5 files containing spectrograms of random audio chunks.

Note

This should be consistent with ava.preprocessing.preprocess.process_sylls.

  • Add the option to also write segments. This could be useful for noise removal.
Parameters:
  • save_dir (str) – Where to write.
  • num_files (int, optional) – Number of files to write. Defaults to 400.
  • sylls_per_file (int, optional) – Number of spectrograms to write per file. Defaults to 100.
ava.models.window_vae_dataset.get_fixed_window_data_loaders(partition, p, batch_size=64, shuffle=(True, False), num_workers=4, min_spec_val=None)[source]

Get DataLoaders for training and testing: fixed-duration shotgun VAE

Parameters:
  • partition (dict) – Output of ava.models.window_vae_dataset.get_window_partition.
  • p (dict) – Preprocessing parameters. Must contain keys: …
  • batch_size (int, optional) – Defaults to 64.
  • shuffle (tuple of bool, optional) – Whether to shuffle train and test sets, respectively. Defaults to (True, False).
  • num_workers (int, optional) – Number of CPU workers to feed data to the network. Defaults to 4.
Returns:

loaders – Maps the keys 'train' and 'test' to their respective DataLoaders.

Return type:

dict

ava.models.window_vae_dataset.get_warped_window_data_loaders(audio_dirs, p, batch_size=64, num_workers=4, load_warp=False, warp_fn=None, warp_params={}, warp_type='spectrogram')[source]

Get DataLoaders for training and testing: warped shotgun VAE

Warning

  • Audio files must all be the same duration! You can use segmenting.utils.write_segments_to_audio to extract audio from song segments, writing them as separate .wav files.
  • Add a train/test split!
Parameters:
  • audio_dirs (list of str) – Audio directories.
  • p (dict) – Preprocessing parameters. Must contain keys: 'window_length', 'nperseg', 'noverlap', 'min_freq', 'max_freq', 'spec_min_val', and 'spec_max_val'.
  • batch_size (int, optional) – DataLoader batch size. Defaults to 64.
  • num_workers (int, optional) – Number of CPU workers to retrieve data for the model. Defaults to 4.
  • load_warp (bool, optional) – Whether to load a previously saved time warping result. Defaults to False.
  • warp_fn ({str, None}, optional) – Where the x-knots and y-knots should be saved and loaded. Defaults to None.
  • warp_params (dict, optional) – Parameters passed to affinewarp. Defaults to {}.
  • warp_type ({'amplitude', 'spectrogram', 'null'}, optional) – Whether to time-warp using ampltidue traces, full spectrograms, or not warp at all. Defaults to 'spectrogram'.
Returns:

loaders – Maps the keys 'train' and 'test' to their respective DataLoaders.

Return type:

dict

ava.models.window_vae_dataset.get_window_partition(audio_dirs, roi_dirs, split=0.8, shuffle=True, exclude_empty_roi_files=True)[source]

Get a train/test split for fixed-duration shotgun VAE.

Parameters:
  • audio_dirs (list of str) – Audio directories.
  • roi_dirs (list of str) – ROI (segment) directories.
  • split (float, optional) – Train/test split. Defaults to 0.8, indicating an 80/20 train/test split.
  • shuffle (bool, optional) – Whether to shuffle at the audio file level. Defaults to True.
  • exclude_empty_roi_files (bool, optional) – Defaults to True.
Returns:

partition – Defines the test/train split. The keys 'test' and 'train' each map to a dictionary with keys 'audio' and 'rois', which both map to numpy arrays containing filenames.

Return type:

dict

Module contents

AVA models module

Contains

ava.models.vae
Defines the variational autoencoder (VAE).
ava.models.vae_dataset
Feeds syllable data to the VAE.
ava.models.window_vae_dataset
Feeds random data to the (shotgun) VAE.