Ressources for the Text Application⚓

This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.

Ressources for Session 2⚓

Dataset⚓

Main features :

300 texts, containing only the first 30 words of each description.
On the 300 texts, 100 of them are from the "Conifer" class, 100 from the "Fern" class, and 100 from the "Moss" class.
"Conifer", "Fern" and "Moss" are respectively labelled as class "0", "1", and "2".
For each class, 80 out of the 100 examples are used for training, and 20 for testing. The training/test split was made randomly, and examples are already given to you as training and testing ones.

Latent Space⚓

As for Lab 1, the 300 texts have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

The training and testing datasets are available in the embeddings-text-lab2.npz file. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test".

In that regard, files are to be loaded using the following snipet code:

train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']

# X_train should have a shape of (240, 768), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (60), i.e. the number of test samples. Each sample's label is indexed via an integer. 
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do⚓

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, using the K-Nearest Neighbour algorithm with K=10 for classification already achieves good performance, with an accuracy of 0.7. Detailed results can be found in the following table:

	Precision	Recall	F1-score
0	0.71	0.75	0.73
1	0.71	0.60	0.65
2	0.68	0.75	0.71

You should be able to replicate these results using the function classification_report from scikit-learn:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Ressources for Session 1⚓

Dataset⚓

Main features :

200 texts
On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.

Visualisation of a few examples⚓

These are some samples in the dataset. They are extracted from Wikipedia pages:

'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'
'Joey Altman is an American chef, restaurateur, TV host and writer.'
'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'

Latent Space⚓

The 200 text have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.

Work to do⚓

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.