TP Node Level Parallelism⚓

Source codes can be found in this repository: Sources

Warm up⚓

We will start by setting up the environment for multi-node MPI execution. First you will need to connect to a couple of servers other than the one you are running on, that we will call

fl-tp-br-XXX
fl-tp-br-YYY
fl-tp-br-ZZZ

To make sure that no password is asked when you connect, you can generate a ssh key (which will be seen on all nodes thanks to the shared HOME filesystem)

ssh-keygen -t rsa -f $HOME/.ssh/id_rsa_mpi
eval `ssh-agent`
ssh-add $HOME/.ssh/id_rsa_mpi

To finish setting up the ssh environment you will have to validate the identity of all elec servers. To speed things up you can execute the following

cat known_hosts_elec >> $HOME/.ssh/known_hosts

Now recover the main_1.cpp file, compile it using the OpenMPI C++ compiler and run it on the three nodes

mpicxx -Wall main_1.cpp -o main_1
mpirun -np 4 -host localhost:1,fl-tp-br-XXX:1,fl-tp-br-YYY:1,fl-tp-br-ZZZ:1 main_1

The :1 suffix after the name of the nodes indicates how many MPI processes can be spawned on the corresponding node. The sum of all of these values must be greater or equal to the number of processes you want to run (-np).

Question 1

Now analyze the execution command and the code to answer these questions:

Verify that different processor names are printed.
What does the program do? Do all nodes execute the same code
Look at the man page to understand the options passed to mpirun. How can you run the program with a different number of nodes?
Now look at the code. How does the structure of the code differ w.r.t a non-parallel C++ code? How is the MPI environment initialized and destroyed?
The master and compute nodes execute different code. What makes this possible?

Brute force it!⚓

The program in main_2.cpp attempt to find the string that was used to generate the sha512 hash you will provide as input. Hash functions are functions that map an arbitrary sequence of bits into a finite sequence of bits. By construction these functions are surjective and do not admit an inverse, which makes them perfect for security applications. One of their most common use cases is to verify passwords: any decent website or service provider will not store your password on their system as it would be at risk of being divulged if a data breach were to occur; instead they will store a hash of your password. When you send your password for authentication they will compute its hash and compare it to the stored value, which will (almost) unequivocally verify that your password is correct. Hashes can also be used to verify data authenticity, integrity, ... in addition to applications outside of cryptography, such as constructing data maps (see std::map).

When attackers obtain password hashes out of data breach they usually want to recover the original password. But given the non-invertibility of hash functions they have to use brute force and try several combination of words until they can guess one that generates the correct hash (more advanced techniques exist to speed up the process).

The objective of this exercise is to recover the original word out of the hash using a brute force approach on the "cluster" at your disposal. You will have to compute the hashes of all the words in the dictionary file using the computation nodes, until you get the find the correct one. Several parts of the code have been removed on purpose and you will have to set up the communication part of the program. To simplify things, the master node will not be doing any of the computations itself, it will only be in charge of dispatching sequences of words to the computation nodes.

To set up the environment, you will download a dictionary of words and passwords specially designed for the kind of task we want to do! The download and decompressing might take a bit of time...

wget https://crackstation.net/files/crackstation-human-only.txt.gz
gzip -d crackstation-human-only.txt.gz

Question 2

Compare the dictionary you have downloaded to the system dictionary /usr/share/dict/words in terms of size, content,...

To simplify things for your first attempt at password cracking, we will create auxiliary dictionaries that contain only passwords made of 8 characters

grep -x '.\{8\}' /usr/share/dict/words > words-8-chars.txt
grep -x '.\{8\}' crackstation-human-only.txt > crackstation-human-only-8-chars.txt

Working on this simplified dictionary will allow you to know in advance the length of the messages you need to send between nodes, use this wisely!

Question 3

The code in main_2.cpp provides a skeleton of the cracking program you will have implement. Before compiling and executing it you will have to fill in the and implement the different functions indicated by a "TODO".

How would you parallelize the brute force search of the hash?
Have a look at the code for the master and the computation nodes and understand the order in which the different functions are called.
What are the master and computation nodes going to do?
Check the man MPI documentation to figure out what functions you will have to use.
With this in mind you can start implementing the different functions. At first you should keep using the system dictionary that is easier to work with.

To verify if your code runs, start by generating the hash of a common word:

MY_HASH=$(echo -e -n "asteroid" | sha512sum | cut -f1 -d" ")

then you can try to compile and run the code

mpicxx -Wall -O3 -std=c++17 main_2.cpp -o main_2 -lcrypto
mpirun -np 4 -host localhost:1,fl-tp-br-XXX:1,fl-tp-br-YYY:1,fl-tp-br-ZZZ:1 main_2 \
words-8-chars.txt 10000 $MY_HASH

Once everything is working you can start loading larger parts of the dictionary by changing the first argument, and try to find more original words. If you feel the system dictionary is too poor, you can use the crackstation-human-only-8-chars.txt file you have downloaded as dictionary.

Question 4

Now that you have a running code answer these questions:

How does the computation time scale with the number of nodes/number of words?
Try loading all the words in the crackstation-human-only-8-chars.txt, are you encountering any issues?
Does the program stop its work immediately after finding the hash? If not why? Is it intended? How would you solve this?

Question 5

Bonus: Modify your code to make it work with words of arbitrary size!

Phantom cluster⚓

In this exercise, we will use the previous experience to compute the discrete Fourier transform (DFT) of a signal of the form

\[ \left[v\right]_n = \left(1 + A \cos(2 \pi f_m t_n)\right) \cos(2 \pi f_p t_n) \]

and \(t_n = t_\text{min} + n d_t\). The DFT can be computed by (i) computing the DFT matrix

\[ \left[W\right]_{mn} = \frac{\omega^{mn}}{\sqrt{N}}\,, \quad (m,n) \in [0, N-1]^2\,, \]

where \(\omega = \exp\left(-2 \pi \mathrm{i} / N\right)\) and implement a (parallel) matrix-vector product (MVP)

\[ \left[W s\right]_{m} = \sum_{k=0}^{N-1} \left[W\right]_{mk} \left[s\right]_{k}\,\quad m \in [0, N-1]\,. \]

This way of computing the DFT is extremely inefficient and is only used here for didactic purposes. The fast Fourier transform (FFT) algorithm should be used in any other context!

In HPC, system matrices are often too big to store or multiply in a single node. We will replicate this setting here by distributing the rows of the DFT matrix to different nodes. Each row will be computed and stored on a single compute node.

Question 6

To verify that your code is working, the result of the DFT will be displayed in graph form. What do you expect to see, precisely ?

Question 7

Analyze the provided skeleton code asking yourself the same questions as in the previous exercise and complete the missing functions.

To compile and run the code use the following commands

mpicxx -Wall -O3 -std=c++17 main_3.cpp -o main_3
mpirun -np 4 -host localhost:1,fl-tp-br-XXX:1,fl-tp-br-YYY:1,fl-tp-br-ZZZ:1 main_3 1000 10 100 0 1

The arguments of main_3 are, in order, \(N\), \(f_m\), \(f_p\), \(t_\text{min}\), and \(t_\text{max}\).

Question 8

We have made the choice of splitting the DFT matrix into rows, is it the most efficient way of handling its distribution? Consider the case of a very large matrix for which you can not store a full row on one of your nodes.

Question 9

Implement the solution to the previous problem. What additional challenges will you encounter?