Parallelizing over CPU threads

De
Aller à la navigation Aller à la recherche

Launch the executable code with option "----nbCPUThreads n"

Example with the weierstrass example (in the "examples" directory downloaded with easea) on a 8 cores Intel Core i7-9700K CPU:

$ ./weierstrass
[||||||||||||||||||||||||||||||||||||||||] 100% 33.132s

When launched with 2 threads, the result of the parallelized algorithm is the following:

$ ./weierstrass --nbCPUThreads 2
[||||||||||||||||||||||||||||||||||||||||] 100% 16.696s

Please note that launched with the same seed, the results of the parallel algorithm are identical to those of the sequential one (only the fitness function is run in parallel). The speedup factor is 1,984.

Now, with 4 threads:

$ ./weierstrass --nbCPUThreads 4
[||||||||||||||||||||||||||||||||||||||||] 100% 8.446s

...the speedup factor is x3,922.

With 8 threads (remember that this is an 8-core CPU):

$ ./weierstrass --nbCPUThreads 8
[||||||||||||||||||||||||||||||||||||||||] 100% 4.07s

... the speedup factor is x8,14! (Speedup > 8 for an 8 cores processor... not bad! :-)

If we increase the number of threads (20) :

$ ./weierstrass --nbCPUThreads 20
[||||||||||||||||||||||||||||||||||||||||] 100% 4.069s...

the speedup factor is nearly identical: x8,142

It is nice to obtain a speedup that is greater than the number of cores of the machine :-)

Please note that the max speed-up is not always obtained with a number of threads = the number of cores of the processor... On a Macbook Air 8.1 with Dual-Core Intel Core i5 (1 processor, 2 cores, activated hyper-threading), execution time is:

1 thread: 10.5831s
2 threads: 6.42092s, speedup = x1,648
3 threads: 5.43762s, speedup = x1,946
4 threads: 4.98705s, speedup = x2,122 (!)
5 threads: 4.93427s, speedup = x2,145 (!!)
6 threads: 4.83617s, speedup = x2,188 (!!!)
7 threads: 4.90261s, speedup = x2,158
8 threads: 4.90446s, speedup = x2,157

A small variability is normal (the system is running many other processes in parallel, temperature is also a factor), but it is interesting to see that on this 2-core processor, the best speedup > 2 (!) was obtained with... 6 threads (not 2).

There must be some serious scheduling optimization going on down there... :-)

As a conclusion, the general advice would to ask for more threads than there are cores in the processor, and let the system deal with the scheduling and load balancing.