Différences entre les versions de « Parallelizing over CPU threads »

De
Aller à la navigation Aller à la recherche
Ligne 3 : Ligne 3 :
 
Example with the weierstrass example (in the "examples" directory downloaded with easea) on a 8 cores Intel Core i7-9700K CPU:
 
Example with the weierstrass example (in the "examples" directory downloaded with easea) on a 8 cores Intel Core i7-9700K CPU:
 
  $ ./weierstrass
 
  $ ./weierstrass
  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% 33.132s
+
  [||||||||||||||||||||||||||||||||||||||||] 100% 33.132s
  
 
When launched with 2 threads, the result of the parallelized algorithm is the following:
 
When launched with 2 threads, the result of the parallelized algorithm is the following:
 
  $ ./weierstrass --nbCPUThreads 2
 
  $ ./weierstrass --nbCPUThreads 2
  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% 16.696s
+
  [||||||||||||||||||||||||||||||||||||||||] 100% 16.696s
  
 
Please note that launched with the same seed, the results of the parallel algorithm are identical to those of the sequential one (only the fitness function is run in parallel). The speedup factor is 1,984.
 
Please note that launched with the same seed, the results of the parallel algorithm are identical to those of the sequential one (only the fitness function is run in parallel). The speedup factor is 1,984.
Ligne 13 : Ligne 13 :
 
Now, with 4 threads:
 
Now, with 4 threads:
 
  $ ./weierstrass --nbCPUThreads 4
 
  $ ./weierstrass --nbCPUThreads 4
  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% 8.446s
+
  [||||||||||||||||||||||||||||||||||||||||] 100% 8.446s
 
...the speedup factor is x3,922.
 
...the speedup factor is x3,922.
  
 
With 8 threads (remember that this is an 8-core CPU):
 
With 8 threads (remember that this is an 8-core CPU):
 
  $ ./weierstrass --nbCPUThreads 8
 
  $ ./weierstrass --nbCPUThreads 8
  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% 4.07s
+
  [||||||||||||||||||||||||||||||||||||||||] 100% 4.07s
 
... the speedup factor is x8,14! Not bad! :-)
 
... the speedup factor is x8,14! Not bad! :-)
  
 
If we increase the number of threads (20) :
 
If we increase the number of threads (20) :
 
  $ ./weierstrass --nbCPUThreads 20
 
  $ ./weierstrass --nbCPUThreads 20
  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% 4.069s...
+
  [||||||||||||||||||||||||||||||||||||||||] 100% 4.069s...
 
the speedup factor is nearly identical: x8,142
 
the speedup factor is nearly identical: x8,142
  
Ligne 39 : Ligne 39 :
 
  8 threads: 4.90446s, speedup = x2,157
 
  8 threads: 4.90446s, speedup = x2,157
  
A small variability is normal (the system is running many other processes in parallel, temperature is also a factor), but it is interesting to see that on this 2-core processor, the best speedup > 2 (!) was obtained with... 6 threads.
+
A small variability is normal (the system is running many other processes in parallel, temperature is also a factor), but it is interesting to see that on this 2-core processor, the best speedup > 2 (!) was obtained with... 6 threads (not 2).
  
 
There must be some serious scheduling optimization going on down there... :-)
 
There must be some serious scheduling optimization going on down there... :-)
  
 
As a conclusion, the general advice would to '''ask for more threads than there are cores in the processor''', and let the system deal with the scheduling and load balancing.
 
As a conclusion, the general advice would to '''ask for more threads than there are cores in the processor''', and let the system deal with the scheduling and load balancing.

Version du 4 mai 2020 à 11:46

Launch the executable code with option "----nbCPUThreads n"

Example with the weierstrass example (in the "examples" directory downloaded with easea) on a 8 cores Intel Core i7-9700K CPU:

$ ./weierstrass
[||||||||||||||||||||||||||||||||||||||||] 100% 33.132s

When launched with 2 threads, the result of the parallelized algorithm is the following:

$ ./weierstrass --nbCPUThreads 2
[||||||||||||||||||||||||||||||||||||||||] 100% 16.696s

Please note that launched with the same seed, the results of the parallel algorithm are identical to those of the sequential one (only the fitness function is run in parallel). The speedup factor is 1,984.

Now, with 4 threads:

$ ./weierstrass --nbCPUThreads 4
[||||||||||||||||||||||||||||||||||||||||] 100% 8.446s

...the speedup factor is x3,922.

With 8 threads (remember that this is an 8-core CPU):

$ ./weierstrass --nbCPUThreads 8
[||||||||||||||||||||||||||||||||||||||||] 100% 4.07s

... the speedup factor is x8,14! Not bad! :-)

If we increase the number of threads (20) :

$ ./weierstrass --nbCPUThreads 20
[||||||||||||||||||||||||||||||||||||||||] 100% 4.069s...

the speedup factor is nearly identical: x8,142

It is nice to obtain a speedup that is greater than the number of cores of the machine :-)

Please note that the max speed-up is not always obtained with a number of threads = the number of cores of the processor... On a Macbook Air 8.1 with Dual-Core Intel Core i5 (1 processor, 2 cores, activated hyper-threading), execution time is:

1 thread: 10.5831s
2 threads: 6.42092s, speedup = x1,648
3 threads: 5.43762s, speedup = x1,946
4 threads: 4.98705s, speedup = x2,122 (!)
5 threads: 4.93427s, speedup = x2,145 (!!)
6 threads: 4.83617s, speedup = x2,188 (!!!)
7 threads: 4.90261s, speedup = x2,158
8 threads: 4.90446s, speedup = x2,157

A small variability is normal (the system is running many other processes in parallel, temperature is also a factor), but it is interesting to see that on this 2-core processor, the best speedup > 2 (!) was obtained with... 6 threads (not 2).

There must be some serious scheduling optimization going on down there... :-)

As a conclusion, the general advice would to ask for more threads than there are cores in the processor, and let the system deal with the scheduling and load balancing.