Ressources for the Text Application⚓
This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.
Ressources for Session 2⚓
Dataset⚓
Main features :
- 300 texts, containing only the first 30 words of each description.
- On the 300 texts, 100 of them are from the "Conifer" class, 100 from the "Fern" class, and 100 from the "Moss" class.
- "Conifer", "Fern" and "Moss" are respectively labelled as class "0", "1", and "2".
- For each class, 80 out of the 100 examples are used for training, and 20 for testing. The training/test split was made randomly, and examples are already given to you as training and testing ones.
Latent Space⚓
As for Lab 1, the 300 texts have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.
The training and testing datasets are available in the embeddings-text-lab2.npz file. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test".
In that regard, files are to be loaded using the following snipet code:
1 2 3 4 5 6 7 |
|
Work to do⚓
Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.
As an example, using the K-Nearest Neighbour algorithm with K=10 for classification already achieves good performance, with an accuracy of 0.7. Detailed results can be found in the following table:
Precision | Recall | F1-score | |
---|---|---|---|
0 | 0.71 | 0.75 | 0.73 |
1 | 0.71 | 0.60 | 0.65 |
2 | 0.68 | 0.75 | 0.71 |
You should be able to replicate these results using the function classification_report
from scikit-learn:
1 2 |
|
Ressources for Session 1⚓
Dataset⚓
Main features :
- 200 texts
- On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.
Visualisation of a few examples⚓
These are some samples in the dataset. They are extracted from Wikipedia pages:
-
'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'
-
'Joey Altman is an American chef, restaurateur, TV host and writer.'
-
'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'
Latent Space⚓
The 200 text have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.
Work to do⚓
Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.