Parallelizing over CPU threads

Launch the executable code with option "----nbCPUThreads n"

Example with the weierstrass example (in the "examples" directory downloaded with easea) on a 8 cores Intel Core i7-9700K CPU:

$ ./weierstrass --nbGen 10
...
9	9.83		20480		9.16e+01  1.04e+02  4.12e+00  1.17e+02
finished computation at Sun May  3 13:16:58 2020
elapsed time: 9.83656s
EASEA LOG [INFO]: Seed: 1588504609

When launched with 2 threads, the result of the parallelized algorithm is the following:

$ ./weierstrass --nbCPUThreads 2 --nbGen 10 --seed 1588504609
...
9	4.95		20480		9.16e+01  1.04e+02  4.12e+00  1.17e+02
finished computation at Sun May  3 17:21:44 2020 
elapsed time: 4.9501s
EASEA LOG [INFO]: Seed: 1588504609

Please note that launched with the same seed, the results of the parallel algorithm are identical to those of the sequential one (only the fitness function is run in parallel). The speedup factor is x1.987.

Now, with 4 threads: $ ./weierstrass --nbCPUThreads 4 --nbGen 10 --seed 1588504609

...
9	2.51		20480		9.16e+01  1.04e+02  4.12e+00  1.17e+02
finished computation at Sun May  3 17:25:07 2020
elapsed time: 2.51069s

...the speedup factor is x3,918.

With 8 threads (remember that this is an 8-core CPU):

$ ./weierstrass --nbCPUThreads 8 --nbGen 10 --seed 1588504609
 ...
9	1.23		20480		9.16e+01  1.04e+02  4.12e+00  1.17e+02
finished computation at Sun May  3 17:28:48 2020
elapsed time: 1.23833s

... the speedup factor is x7,943.

If we increase the number of threads (20) :

$ ./weierstrass --nbCPUThreads 20 --nbGen 10 --seed 1588504609
...
9	1.23		20480		9.16e+01  1.04e+02  4.12e+00  1.17e+02
EASEA LOG [INFO]: Stopping criterion is reached 
finished computation at Sun May  3 17:30:16 2020
elapsed time: 1.23324s

... the speedup factor is nearly identical x7,976

Please note that this is not always the case...

On a Macbook Air 8.1 with Dual-Core Intel Core i5 (1 processor, 2 cores, activated hyper-threading), execution time is:

1 thread: 10.5831s
2 threads : 6.42092s, speedup = x1,648
3 threads: 5.4376s, speedup = x1,946
4 threads: 4.98705s, speedup = x2,122 (!)
5 threads: 4.93427s, speedup = x2,145 (!!)
6 threads: 4.83617s, speedup = x2,188 (!!!)
7 threads: 4.90261s, speedup = x2,158
8 threads: 4.90446s, speedup = x2,157

A small variability is normal (the system is running many other processes in parallel, temperature is also a factor), but it is interesting to see that on this 2-core processor, the best speedup > 2 (!) was obtained with... 6 threads

There must be some serious scheduling optimization going on down there... :-)

As a conclusion, the general advice would to ask for more threads than there are cores in the processor, and let the system deal with the scheduling and load balancing.

Parallelizing over CPU threads

Menu de navigation

Rechercher