Ressources for the Audio Application⚓
This application is based on sound recordings, mostly from the Silent Cities Dataset (except Session 1), from which we extract subdatasets specifically for this course.
Ressources for Session 2⚓
Dataset⚓
The data for Lab Session 2 is based on sound recordings, from the Silent Cities Dataset, site 059. A few features :
- Audio was recorded automatically one minute every 10 minutes at 48 kHz using an Open Acoustics Audiomoth device
- We selected 100 sounds that include a bird song, and 100 sounds that do NOT include a bird song
- The data was split for train and test (80/20)
Latent Space⚓
As for Lab 1, the sounds have been put in a latent space using CNN14, a deep learning model from this paper : PANNs. We will delve into the details of Deep Learning and feature extraction from course 4.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab2.npz. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test". Class names are also included in the key "class_names".
In that regard, files are to be loaded using the following snipet code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Work to do⚓
Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.
As an example, using the K-Nearest Neighbour algorithm with K=10 gives the following (poor!) results :
Precision | Recall | F1-score | |
---|---|---|---|
bird | 0.52 | 0.48 | 0.50 |
other | 0.37 | 0.41 | 0.17 |
You should be able to replicate these results using the function classification_report
from scikit-learn:
1 2 |
|
This example is a quite difficult one, as a default method gives results which may not be better than chance. Try to see how far you get, especially for the "bird" class, as the "other" class may be very difficult. Good luck !
Ressources for Session 1⚓
Dataset⚓
For Lab Session 1, we use all training examples from two classes of the ESC-50 dataset.
- The two classes are "Dog barking" and "Fireworks".
- Duration of sounds are 5 seconds
- Sampling rate is 44.1 kHz
More details can be found in the original paper.
Visualisation - Listening to a few sounds⚓
Dog
Fireworks
Latent Space⚓
The 80 sounds have been put in a latent space using CNN14, a deep learning model from this paper : PANNs. We will delve into the details of Deep Learning and feature extraction from course 4.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab1.npz.
Work to do⚓
Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.