Updated almost 11 years ago by Knödlseder Jürgen
Computation Benchmarks¶
Mac OS X benchmarks¶
Here a summary of some benchmarks that have been obtained on a 2.66 GHz Intel Core i7 Mac OS X 10.6.8 system for executing 100000000 (one hundred million) times a given computation (execution times in seconds). The benchmarks have been obtained in double precision and single precision:
Computation | double | float | Comment |
pow(x,2) | 1.90 | 1.19 | |
x*x | 0.49 | 0.48 | Prefer multiplication over pow(x,2) |
pow(x,2.01) | 7.96 | 4.19 | pow is a very time consuming operation |
x/a | 0.98 | 1.22 | |
x*b (where b=1/a) | 0.46 | 0.66 | Prefer multiplication by the inverse over division |
x+1.5 | 0.40 | 0.40 | |
x-1.5 | 0.49 | 0.49 | Prefer addition over subtraction |
sin | 4.66 | 2.39 | |
cos | 4.64 | 2.46 | |
tan | 5.40 | 2.84 | tan is pretty time consuming |
acos | 2.18 | 0.94 | |
sqrt | 1.29 | 1.37 | |
log10 | 2.60 | 2.48 | |
log | 2.72 | 2.33 | |
exp | 7.17 | 7.20 | exp is a very time consuming operation (comparable to pow ) |
Note that pow
, exp
and the trigonometric functions are significantly (a factor of about 2) faster using single precision compared to double precision.
Benchmark comparison for different systems¶
And here a comparison for various computing systems (double precision). In this comparison, the Mac is about the fastest, galileo (which is a 32 Bit system) is pretty fast for multiplications, kepler is the laziest (AMD related? Multi-core related?), fermi and the CI13 virtual box are about the same (there is no notable difference between gcc and clang on the virtual box).
Computation | Mac OS X | galileo | kepler | dirac | fermi | CI13 (gcc 4.8.0) | CI13 (clang 3.1) |
pow(x,2) | 1.90 | 5.73 | 4.83 | 3.5 | 2.65 | 1.94 | 1.99 |
x*x | 0.49 | 0.31 | 1.04 | 1.06 | 0.5 | 0.58 | 0.57 |
pow(x,2.01) | 7.96 | 10.96 | 17.53 | 17.73 | 11.11 | 8.71 | 8.44 |
x/a | 0.98 | 1.24 | 1.87 | 1.92 | 1.03 | 1.15 | 1.16 |
x*b (where b=1/a) | 0.46 | 0.27 | 0.99 | 0.99 | 0.51 | 0.54 | 0.54 |
x+1.5 | 0.40 | 0.27 | 0.96 | 1.02 | 0.43 | 0.47 | 0.47 |
x-1.5 | 0.49 | 0.27 | 1.08 | 1.1 | 0.57 | 0.47 | 0.47 |
sin | 4.66 | 4.76 | 10.46 | 10.44 | 6.72 | 5.62 | 5.52 |
cos | 4.64 | 4.68 | 10.16 | 10.28 | 6.35 | 5.65 | 5.62 |
tan | 5.40 | 6.27 | 15.23 | 15.4 | 8.61 | 8.11 | 7.98 |
acos | 2.18 | 9.57 | 7.49 | 7.75 | 4.48 | 3.86 | 2.93 |
sqrt | 1.29 | 3.29 | 2.33 | 2.4 | 0.97 | 2.02 | 1.84 |
log10 | 2.60 | 5.33 | 12.91 | 12.58 | 7.71 | 6.54 | 6.47 |
log | 2.72 | 5.15 | 10.64 | 10.66 | 6.32 | 5.26 | 5.09 |
exp | 7.17 | 10 | 4.78 | 4.8 | 1.85 | 2.03 | 2.02 |
Underlined numbers show the fastest, bold numbers the slowest computations.
And the same for single precision:
Computation | Mac OS X | galileo | kepler | dirac | fermi | CI13 (gcc 4.8.0) | CI13 (clang 3.1) |
pow(x,2) | 1.19 | 1.77 | 3.27 | 3 | 1.35 | 1.54 | 0.9 |
x*x | 0.48 | 0.3 | 0.99 | 1 | 0.47 | 0.54 | 0.54 |
pow(x,2.01) | 4.19 | 10.64 | 29.81 | 30.21 | 14.42 | 13 | 12.29 |
x/a | 1.22 | 1.24 | 2.77 | 2.79 | 1.2 | 1.37 | 1.4 |
x*b (where b=1/a) | 0.66 | 0.27 | 1.72 | 1.74 | 0.67 | 0.76 | 0.79 |
x+1.5 | 0.40 | 0.27 | 1.03 | 1.04 | 0.4 | 0.46 | 0.47 |
x-1.5 | 0.49 | 0.27 | 1.13 | 1.14 | 0.54 | 0.47 | 0.47 |
sin | 2.39 | 4.92 | 116.41 | 119.06 | 54 | 41.2 | 40.22 |
cos | 2.46 | 4.85 | 116.47 | 119.27 | 53.93 | 40.91 | 40.3 |
tan | 2.84 | 6.47 | 120.69 | 122 | 55.14 | 42.36 | 41.83 |
acos | 0.94 | 9.02 | 8.6 | 8.71 | 3.86 | 2.81 | 2.38 |
sqrt | 1.37 | 2.27 | 3.77 | 3.75 | 1.5 | 1.84 | 1.55 |
log10 | 2.48 | 4.15 | 12.74 | 12.59 | 6.28 | 5.74 | 4.97 |
log | 2.33 | 3.83 | 10.07 | 10.42 | 5.16 | 4.88 | 4.11 |
exp | 7.20 | 9.96 | 17.51 | 18.32 | 10.77 | 10.21 | 10.18 |
Note the enormous speed penalty of trigonometric functions on most of the systems. Floating point arithmetics are only faster on Mac OS X.
Here the specifications of the machines used for benchmarking:- Mac OS X: 2.66 GHz Intel Core i7 Mac OS X 10.6.8, gcc 4.2.1
- galileo: 32 Bit, Intel Xeon, 2.8 GHz, gcc 3.2.2
- kepler: 64 Bit, AMD Opteron 6174, 12C, 2.20 GHz, gcc 4.1.2
- dirac: 64 Bit, AMD Opteron 6174, 12C, 2.20 GHz, gcc 4.1.2
- fermi: 64 Bit, Intel(R) Xeon(R) CPU E5450 @ 3.00GHz, gcc 4.1.2
- CI13: 64 Bit (virtual box)
Behind the scenes¶
Here now some information to understand what happens.
Kepler¶
I did some experiments to see how the compiled code differs for different variants of the sin function call. In particular, I testedstd::sin(double)
std::sin(float)
sin(double)
sin(float)
It turned out that the call to std::sin(float)
calls the function sinf
, while all other codes call sin
. The execution time difference is therefore related to different implementations of sin
and sinf
on Kepler.
Note that sin
and sinf
are implement in /lib64/libm.so.6
on Kepler. This library is part of the GNU C library libc (see http://www.gnu.org/software/libc/).
Using std::sin(double)¶
When std::sin(double)
is used, the sin
function will be called by the processor. Note that the same behavior is obtained when calling sin(double)
(without the std prefix).
$ nano stdsin.cpp #include <cmath> int main(void) { double arg = 1.0; double result = std::sin(arg); return 0; } $ g++ -S stdsin.cpp $ more stdsin.s main: .LFB97: pushq %rbp .LCFI0: movq %rsp, %rbp .LCFI1: subq $32, %rsp .LCFI2: movabsq $4607182418800017408, %rax movq %rax, -16(%rbp) movq -16(%rbp), %rax movq %rax, -24(%rbp) movsd -24(%rbp), %xmm0 call sin movsd %xmm0, -24(%rbp) movq -24(%rbp), %rax movq %rax, -8(%rbp) movl $0, %eax leave ret
Using std::sin(float)¶
When std::sin(float)
is used, the sinf
function will be called by the processor.
$ nano floatstdsin.cpp #include <cmath> int main(void) { float arg = 1.0; float result = std::sin(arg); return 0; } $ g++ -S floatstdsin.cpp $ more floatstdsin.s main: .LFB97: pushq %rbp .LCFI3: movq %rsp, %rbp .LCFI4: subq $32, %rsp .LCFI5: movl $0x3f800000, %eax movl %eax, -8(%rbp) movl -8(%rbp), %eax movl %eax, -20(%rbp) movss -20(%rbp), %xmm0 call _ZSt3sinf movss %xmm0, -20(%rbp) movl -20(%rbp), %eax movl %eax, -4(%rbp) movl $0, %eax leave ret _ZSt3sinf: .LFB57: pushq %rbp .LCFI0: movq %rsp, %rbp .LCFI1: subq $16, %rsp .LCFI2: movss %xmm0, -4(%rbp) movl -4(%rbp), %eax movl %eax, -12(%rbp) movss -12(%rbp), %xmm0 call sinf movss %xmm0, -12(%rbp) movl -12(%rbp), %eax movl %eax, -12(%rbp) movss -12(%rbp), %xmm0 leave ret
Using sin(float)¶
When sin(float)
is used, the compiler will perform an implicit conversion to double
and then call the sin
function.
$ nano floatsin.cpp #include <cmath> int main(void) { float arg = 1.0; float result = sin(arg); return 0; } $ g++ -S floatsin.cpp $ more floatsin.s main: .LFB97: pushq %rbp .LCFI0: movq %rsp, %rbp .LCFI1: subq $16, %rsp .LCFI2: movl $0x3f800000, %eax movl %eax, -8(%rbp) cvtss2sd -8(%rbp), %xmm0 call sin cvtsd2ss %xmm0, %xmm0 movss %xmm0, -4(%rbp) movl $0, %eax leave ret
Mac OS X¶
And now the same experiment on Mac OS X. It turns out that the code generated by the compiler has the same structure, and the functions that are called are again _sin
and _sinf
(all function names have a _
prepended on Mac OS X). This means that the implementation of the _sinf
function on Mac OS X is considerably faster than the implementation on kepler.
Note that _sin
and _sinf
are implement in /usr/lib/libSystem.B.dylib
on my Mac OS X.
Using std::sin(double)¶
When std::sin(double)
is used, the _sin
function will be called by the processor. Note that the same behavior is obtained when calling sin(double)
(without the std prefix).
$ nano stdsin.cpp #include <cmath> int main(void) { double arg = 1.0; double result = std::sin(arg); return 0; } $ g++ -S stdsin.cpp $ more stdsin.s _main: LFB127: pushq %rbp LCFI0: movq %rsp, %rbp LCFI1: subq $16, %rsp LCFI2: movabsq $4607182418800017408, %rax movq %rax, -8(%rbp) movsd -8(%rbp), %xmm0 call _sin movsd %xmm0, -16(%rbp) movl $0, %eax leave ret
Using std::sin(float)¶
When std::sin(float)
is used, the _sinf
function will be called by the processor.
$ nano floatstdsin.cpp #include <cmath> int main(void) { float arg = 1.0; float result = std::sin(arg); return 0; } $ g++ -S floatstdsin.cpp $ more floatstdsin.s _main: LFB127: pushq %rbp LCFI3: movq %rsp, %rbp LCFI4: subq $16, %rsp LCFI5: movl $0x3f800000, %eax movl %eax, -4(%rbp) movss -4(%rbp), %xmm0 call __ZSt3sinf movss %xmm0, -8(%rbp) movl $0, %eax leave ret LFB87: pushq %rbp LCFI0: movq %rsp, %rbp LCFI1: subq $16, %rsp LCFI2: movss %xmm0, -4(%rbp) movss -4(%rbp), %xmm0 call _sinf leave ret
Using sin(float)¶
When sin(float)
is used, the compiler will perform an implicit conversion to double
and then call the _sin
function.
$ nano floatsin.cpp #include <cmath> int main(void) { float arg = 1.0; float result = sin(arg); return 0; } $ g++ -S floatsin.cpp $ more floatsin.s _main: LFB127: pushq %rbp LCFI0: movq %rsp, %rbp LCFI1: subq $16, %rsp LCFI2: movl $0x3f800000, %eax movl %eax, -4(%rbp) cvtss2sd -4(%rbp), %xmm0 call _sin cvtsd2ss %xmm0, %xmm0 movss %xmm0, -8(%rbp) movl $0, %eax leave ret