Différences entre les versions de « Parallelizing over GPU cards »
Ligne 11 : | Ligne 11 : | ||
|NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | | ||
------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ | ||
− | + | 0 0.989s 2048 2048 1.232901840e+02 1.4e+02 4.8e+00 1.6e+02 | |
... | ... | ||
− | + | 99 97.065s 204800 204800 5.162325287e+01 5.2e+01 1.4e-01 5.3e+01 | |
Now, when compiled with: | Now, when compiled with: | ||
Ligne 24 : | Ligne 24 : | ||
|NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | | ||
------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ | ||
− | 0 0. | + | 0 0.080s 2048 2048 1.232649612e+02 1.4e+02 4.8e+00 1.6e+02 |
... | ... | ||
− | 99 1. | + | 99 1.784s 204800 204800 5.043983459e+01 5.1e+01 1.5e-01 5.1e+01 |
− | so a x52,96 speedup is observed | + | so a x52,96 speedup is observed! |
+ | |||
+ | This looks very nice, but is it really the maximum that can be obtained? | ||
+ | |||
+ | The processor of the NVIDIA RTX2080Ti card contains 4352 cores, but the population of the evolutionary algorithm is only 2048! | ||
+ | |||
+ | Let's have a look at what happens by launching: | ||
+ | $ ./weierstrass --popSize 4352 --nbOffspring 4352 | ||
+ | |||
+ | ------------------------------------------------------------------------------------------------ | ||
+ | |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | | ||
+ | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | | ||
+ | ------------------------------------------------------------------------------------------------ | ||
+ | 0 0.092s 4352 4352 1.195859451e+02 1.4e+02 4.8e+00 1.6e+02 | ||
+ | ... | ||
+ | 99 3.150s 435200 435200 3.926587677e+01 4.0e+01 7.4e-02 4.0e+01 | ||
+ | |||
+ | What seems strange is that execution time went from 1.784s to 3.150s, meaning that even though with 2048 individuals, there were less individuals than cores in the card, adding individuals up to 4352 was not done at constant time... | ||
+ | |||
+ | This comes from the fact that the SPMD parallelization of NVIDIA cards heavily relies on a scheduling between threads that is based on pipelining, therefore implementing "spatio-temporal" parallelism: in order to maximize efficiency, it is necessary to "overload" the cores, i.e. assign more threads than there are physical cores available on the card (spatial parallelism), to benefit from pipelining (temporal parallelism). | ||
+ | |||
+ | Launching the run with 8704 individuals yields the following results | ||
+ | |||
+ | $ ./weierstrass --popSize 8704 --nbOffspring 8704 | ||
+ | ------------------------------------------------------------------------------------------------ | ||
+ | |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | | ||
+ | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | | ||
+ | ------------------------------------------------------------------------------------------------ | ||
+ | 0 0.091s 8704 8704 1.195859451e+02 1.4e+02 4.7e+00 1.6e+02 | ||
+ | ... | ||
+ | 99 5.480s 870400 870400 3.589046860e+01 3.6e+01 9.3e-02 3.7e+01 | ||
+ | |||
+ | Elapsed time has not doubled even though population time has doubled... | ||
+ | |||
+ | Let's launch the run with 17408 individuals: | ||
+ | $ ./weierstrass --popSize 17408 --nbOffspring 17408 | ||
+ | ------------------------------------------------------------------------------------------------ | ||
+ | |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | | ||
+ | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | | ||
+ | ------------------------------------------------------------------------------------------------ | ||
+ | 0 0.108s 17408 17408 1.195859451e+02 1.4e+02 4.7e+00 1.6e+02 | ||
+ | ... | ||
+ | 99 10.328s 1740800 1740800 3.181253624e+01 3.2e+01 9.5e-02 3.3e+01 | ||
+ | |||
+ | Here, the runtime has doubled, meaning that the card is properly loaded. |
Version actuelle datée du 5 mai 2020 à 12:18
Without data transfer to the GPGPU card
The EASEA parallelization of an evolutionary algorithm over GPGPU cards is straightforward (just add the "-cuda" option on the easena compile line) provided the evaluation function does not need data to evaluate individuals.
For example, the weierstrass example (in the examples directory) can be compiled with either:
$ make easeaclean ; easena weierstrass.ez ; make
On a PC with an Intel Core i7-9700K overclocked to 4.6GHz and an NVIDIA GEFORCE RTX,2080 Ti, the runtime over 100 generations is the following:
------------------------------------------------------------------------------------------------ |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | ------------------------------------------------------------------------------------------------ 0 0.989s 2048 2048 1.232901840e+02 1.4e+02 4.8e+00 1.6e+02 ... 99 97.065s 204800 204800 5.162325287e+01 5.2e+01 1.4e-01 5.3e+01
Now, when compiled with:
$ make easeaclean ; easena weierstrass.ez -cuda ; make
the result becomes:
------------------------------------------------------------------------------------------------ |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | ------------------------------------------------------------------------------------------------ 0 0.080s 2048 2048 1.232649612e+02 1.4e+02 4.8e+00 1.6e+02 ... 99 1.784s 204800 204800 5.043983459e+01 5.1e+01 1.5e-01 5.1e+01
so a x52,96 speedup is observed!
This looks very nice, but is it really the maximum that can be obtained?
The processor of the NVIDIA RTX2080Ti card contains 4352 cores, but the population of the evolutionary algorithm is only 2048!
Let's have a look at what happens by launching:
$ ./weierstrass --popSize 4352 --nbOffspring 4352
------------------------------------------------------------------------------------------------ |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | ------------------------------------------------------------------------------------------------ 0 0.092s 4352 4352 1.195859451e+02 1.4e+02 4.8e+00 1.6e+02 ... 99 3.150s 435200 435200 3.926587677e+01 4.0e+01 7.4e-02 4.0e+01
What seems strange is that execution time went from 1.784s to 3.150s, meaning that even though with 2048 individuals, there were less individuals than cores in the card, adding individuals up to 4352 was not done at constant time...
This comes from the fact that the SPMD parallelization of NVIDIA cards heavily relies on a scheduling between threads that is based on pipelining, therefore implementing "spatio-temporal" parallelism: in order to maximize efficiency, it is necessary to "overload" the cores, i.e. assign more threads than there are physical cores available on the card (spatial parallelism), to benefit from pipelining (temporal parallelism).
Launching the run with 8704 individuals yields the following results
$ ./weierstrass --popSize 8704 --nbOffspring 8704 ------------------------------------------------------------------------------------------------ |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | ------------------------------------------------------------------------------------------------ 0 0.091s 8704 8704 1.195859451e+02 1.4e+02 4.7e+00 1.6e+02 ... 99 5.480s 870400 870400 3.589046860e+01 3.6e+01 9.3e-02 3.7e+01
Elapsed time has not doubled even though population time has doubled...
Let's launch the run with 17408 individuals:
$ ./weierstrass --popSize 17408 --nbOffspring 17408 ------------------------------------------------------------------------------------------------ |GENER.| ELAPSED | PLANNED | ACTUAL |BEST INDIVIDUAL| AVG | WORST | STAND | |NUMBER| TIME | EVALUATION NB | EVALUATION NB | FITNESS |FITNESS|FITNESS| DEV | ------------------------------------------------------------------------------------------------ 0 0.108s 17408 17408 1.195859451e+02 1.4e+02 4.7e+00 1.6e+02 ... 99 10.328s 1740800 1740800 3.181253624e+01 3.2e+01 9.5e-02 3.3e+01
Here, the runtime has doubled, meaning that the card is properly loaded.