Différences entre les versions de « Parallelizing over GPU cards »

De
Aller à la navigation Aller à la recherche
 
Ligne 11 : Ligne 11 :
 
  |NUMBER|    TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
 
  |NUMBER|    TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
 
  ------------------------------------------------------------------------------------------------
 
  ------------------------------------------------------------------------------------------------
      0         0.989s           2048           2048 1.232901840e+02 1.4e+02 4.8e+00 1.6e+02
+
      0         0.989s           2048           2048 1.232901840e+02 1.4e+02 4.8e+00 1.6e+02
 
  ...
 
  ...
    99         97.065s         204800         204800 5.162325287e+01 5.2e+01 1.4e-01 5.3e+01
+
      99         97.065s         204800         204800 5.162325287e+01 5.2e+01 1.4e-01 5.3e+01
  
 
Now, when compiled with:  
 
Now, when compiled with:  
Ligne 24 : Ligne 24 :
 
  |NUMBER|    TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
 
  |NUMBER|    TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
 
  ------------------------------------------------------------------------------------------------
 
  ------------------------------------------------------------------------------------------------
       0         0.083s           2048           2048 1.232649612e+02 1.4e+02 4.8e+00 1.6e+02
+
       0         0.080s           2048           2048 1.232649612e+02 1.4e+02 4.8e+00 1.6e+02
 
  ...
 
  ...
     99         1.802s         204800         204800 5.043983459e+01 5.1e+01 1.5e-01 5.1e+01
+
     99         1.784s         204800         204800 5.043983459e+01 5.1e+01 1.5e-01 5.1e+01
  
so a x52,96 speedup is observed. One can notice that the result is not identical even though the seed is the same, because the random number generator used on the CPU and on the GPU are different.
+
so a x52,96 speedup is observed!
 +
 
 +
This looks very nice, but is it really the maximum that can be obtained?
 +
 
 +
The processor of the NVIDIA RTX2080Ti card contains 4352 cores, but the population of the evolutionary algorithm is only 2048!
 +
 
 +
Let's have a look at what happens by launching:
 +
$ ./weierstrass --popSize 4352 --nbOffspring 4352
 +
 
 +
------------------------------------------------------------------------------------------------
 +
|GENER.|    ELAPSED    |    PLANNED    |    ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
 +
|NUMBER|    TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
 +
------------------------------------------------------------------------------------------------
 +
      0         0.092s           4352           4352 1.195859451e+02 1.4e+02 4.8e+00 1.6e+02
 +
...
 +
    99         3.150s         435200         435200 3.926587677e+01 4.0e+01 7.4e-02 4.0e+01
 +
 
 +
What seems strange is that execution time went from 1.784s to 3.150s, meaning that even though with 2048 individuals, there were less individuals than cores in the card, adding individuals up to 4352 was not done at constant time...
 +
 
 +
This comes from the fact that the SPMD parallelization of NVIDIA cards heavily relies on a scheduling between threads that is based on pipelining, therefore implementing "spatio-temporal" parallelism: in order to maximize efficiency, it is necessary to "overload" the cores, i.e. assign more threads than there are physical cores available on the card (spatial parallelism), to benefit from pipelining (temporal parallelism).
 +
 
 +
Launching the run with 8704 individuals yields the following results
 +
 
 +
$ ./weierstrass --popSize 8704 --nbOffspring 8704
 +
------------------------------------------------------------------------------------------------
 +
|GENER.|    ELAPSED    |    PLANNED    |    ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
 +
|NUMBER|    TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
 +
------------------------------------------------------------------------------------------------
 +
      0         0.091s           8704           8704 1.195859451e+02 1.4e+02 4.7e+00 1.6e+02
 +
...
 +
    99         5.480s         870400         870400 3.589046860e+01 3.6e+01 9.3e-02 3.7e+01
 +
 
 +
Elapsed time has not doubled even though population time has doubled...
 +
 
 +
Let's launch the run with 17408 individuals:
 +
$ ./weierstrass --popSize 17408 --nbOffspring 17408
 +
------------------------------------------------------------------------------------------------
 +
|GENER.|    ELAPSED    |    PLANNED    |    ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
 +
|NUMBER|    TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
 +
------------------------------------------------------------------------------------------------
 +
      0         0.108s           17408           17408 1.195859451e+02 1.4e+02 4.7e+00 1.6e+02
 +
...
 +
    99         10.328s         1740800         1740800 3.181253624e+01 3.2e+01 9.5e-02 3.3e+01
 +
 
 +
Here, the runtime has doubled, meaning that the card is properly loaded.

Version actuelle datée du 5 mai 2020 à 12:18

Without data transfer to the GPGPU card

The EASEA parallelization of an evolutionary algorithm over GPGPU cards is straightforward (just add the "-cuda" option on the easena compile line) provided the evaluation function does not need data to evaluate individuals.

For example, the weierstrass example (in the examples directory) can be compiled with either:

$ make easeaclean ; easena weierstrass.ez ; make

On a PC with an Intel Core i7-9700K overclocked to 4.6GHz and an NVIDIA GEFORCE RTX,2080 Ti, the runtime over 100 generations is the following:

------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
      0	         0.989s	           2048	           2048	1.232901840e+02	1.4e+02	4.8e+00	1.6e+02
	...
     99	        97.065s	         204800	         204800	5.162325287e+01	5.2e+01	1.4e-01	5.3e+01

Now, when compiled with:

$ make easeaclean ; easena weierstrass.ez -cuda ; make

the result becomes:

------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.080s	           2048	           2048	1.232649612e+02	1.4e+02	4.8e+00	1.6e+02
	...
    99	         1.784s	         204800	         204800	5.043983459e+01	5.1e+01	1.5e-01	5.1e+01

so a x52,96 speedup is observed!

This looks very nice, but is it really the maximum that can be obtained?

The processor of the NVIDIA RTX2080Ti card contains 4352 cores, but the population of the evolutionary algorithm is only 2048!

Let's have a look at what happens by launching:

$ ./weierstrass --popSize 4352 --nbOffspring 4352
------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.092s	           4352	           4352	1.195859451e+02	1.4e+02	4.8e+00	1.6e+02
	...
    99	         3.150s	         435200	         435200	3.926587677e+01	4.0e+01	7.4e-02	4.0e+01

What seems strange is that execution time went from 1.784s to 3.150s, meaning that even though with 2048 individuals, there were less individuals than cores in the card, adding individuals up to 4352 was not done at constant time...

This comes from the fact that the SPMD parallelization of NVIDIA cards heavily relies on a scheduling between threads that is based on pipelining, therefore implementing "spatio-temporal" parallelism: in order to maximize efficiency, it is necessary to "overload" the cores, i.e. assign more threads than there are physical cores available on the card (spatial parallelism), to benefit from pipelining (temporal parallelism).

Launching the run with 8704 individuals yields the following results

$ ./weierstrass --popSize 8704 --nbOffspring 8704
------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.091s	           8704	           8704	1.195859451e+02	1.4e+02	4.7e+00	1.6e+02
	...
    99	         5.480s	         870400	         870400	3.589046860e+01	3.6e+01	9.3e-02	3.7e+01

Elapsed time has not doubled even though population time has doubled...

Let's launch the run with 17408 individuals:

$ ./weierstrass --popSize 17408 --nbOffspring 17408
------------------------------------------------------------------------------------------------
|GENER.|    ELAPSED    |    PLANNED    |     ACTUAL    |BEST INDIVIDUAL|  AVG  | WORST | STAND |
|NUMBER|     TIME      | EVALUATION NB | EVALUATION NB |    FITNESS    |FITNESS|FITNESS|  DEV  |
------------------------------------------------------------------------------------------------
     0	         0.108s	          17408	          17408	1.195859451e+02	1.4e+02	4.7e+00	1.6e+02
	...
    99	        10.328s	        1740800	        1740800	3.181253624e+01	3.2e+01	9.5e-02	3.3e+01

Here, the runtime has doubled, meaning that the card is properly loaded.