Parallelizing over GPU cards

De
Aller à la navigation Aller à la recherche

Without data transfer to the GPGPU card

The EASEA parallelization of an evolutionary algorithm over GPGPU cards is straightforward (just add the "-cuda" option on the easena compile line) provided the evaluation function does not need data to evaluate individuals.

For example, the weierstrass example (in the examples directory) can be compiled with either:

$ make easeaclean ; easena weierstrass.ez ; make

On a PC with an Intel Core i7-9700K overclocked to 4.6GHz and an NVIDIA GEFORCE RTX,2080 Ti, the runtime over 100 generations is the following:

------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
      0	         0.989s	           2048	           2048	1.232901840e+02	1.4e+02	4.8e+00	1.6e+02
	...
     99	        97.065s	         204800	         204800	5.162325287e+01	5.2e+01	1.4e-01	5.3e+01

Now, when compiled with:

$ make easeaclean ; easena weierstrass.ez -cuda ; make

the result becomes:

------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.080s	           2048	           2048	1.232649612e+02	1.4e+02	4.8e+00	1.6e+02
	...
    99	         1.784s	         204800	         204800	5.043983459e+01	5.1e+01	1.5e-01	5.1e+01

so a x52,96 speedup is observed!

This looks very nice, but is it really the maximum that can be obtained?

The processor of the NVIDIA RTX2080Ti card contains 4352 cores, but the population of the evolutionary algorithm is only 2048!

Let's have a look at what happens by launching:

$ ./weierstrass --popSize 4352 --nbOffspring 4352
------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.092s	           4352	           4352	1.195859451e+02	1.4e+02	4.8e+00	1.6e+02
	...
    99	         3.150s	         435200	         435200	3.926587677e+01	4.0e+01	7.4e-02	4.0e+01

What seems strange is that execution time went from 1.784s to 3.150s, meaning that even though with 2048 individuals, there were less individuals than cores in the card, adding individuals up to 4352 was not done at constant time...

This comes from the fact that the SPMD parallelization of NVIDIA cards heavily relies on a scheduling between threads that is based on pipelining, therefore implementing "spatio-temporal" parallelism: in order to maximize efficiency, it is necessary to "overload" the cores, i.e. assign more threads than there are physical cores available on the card (spatial parallelism), to benefit from pipelining (temporal parallelism).

Launching the run with 8704 individuals yields the following results

$ ./weierstrass --popSize 8704 --nbOffspring 8704
------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.091s	           8704	           8704	1.195859451e+02	1.4e+02	4.7e+00	1.6e+02
	...
    99	         5.480s	         870400	         870400	3.589046860e+01	3.6e+01	9.3e-02	3.7e+01

Elapsed time has not doubled even though population time has doubled...

Let's launch the run with 17408 individuals:

$ ./weierstrass --popSize 17408 --nbOffspring 17408
------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.108s	          17408	          17408	1.195859451e+02	1.4e+02	4.7e+00	1.6e+02
	...
    99	        10.328s	        1740800	        1740800	3.181253624e+01	3.2e+01	9.5e-02	3.3e+01

Here, the runtime has doubled, meaning that the card is properly loaded.