Aller au contenu

Ressources for the Text Application

This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.

Ressources for Session 2

Dataset

Main features :

  • 300 texts, containing only the first 30 words of each description.
  • On the 300 texts, 100 of them are from the "Conifer" class, 100 from the "Fern" class, and 100 from the "Moss" class.
  • "Conifer", "Fern" and "Moss" are respectively labelled as class "0", "1", and "2".
  • For each class, 80 out of the 100 examples are used for training, and 20 for testing. The training/test split was made randomly, and examples are already given to you as training and testing ones.

Latent Space

As for Lab 1, the 300 texts have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

The training and testing datasets are available in the embeddings-text-lab2.npz file. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test".

In that regard, files are to be loaded using the following snipet code:

1
2
3
4
5
6
7
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']

# X_train should have a shape of (240, 768), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (60), i.e. the number of test samples. Each sample's label is indexed via an integer. 
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, using the K-Nearest Neighbour algorithm with K=10 for classification already achieves good performance, with an accuracy of 0.7. Detailed results can be found in the following table:

Precision Recall F1-score
0 0.71 0.75 0.73
1 0.71 0.60 0.65
2 0.68 0.75 0.71

You should be able to replicate these results using the function classification_report from scikit-learn:

1
2
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Ressources for Session 1

Dataset

Main features :

  • 200 texts
  • On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.

Visualisation of a few examples

These are some samples in the dataset. They are extracted from Wikipedia pages:

  • 'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'

  • 'Joey Altman is an American chef, restaurateur, TV host and writer.'

  • 'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'

Latent Space

The 200 text have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.

Work to do

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.