{{lastupdated_at}} by {{lastupdated_by}} {{>toc}} h1. Computation Benchmarks h2. Mac OS X benchmarks Here a summary of some benchmarks that have been obtained on a 2.66 GHz Intel Core i7 Mac OS X 10.6.8 system for executing 100000000 (one hundred million) times a given computation (execution times in seconds). The benchmarks have been obtained in double precision and single precision: | *Computation* | *double* | *float* | *Comment* | | pow(x,2) | 1.90 | 1.19 | | | x*x | 0.49 | 0.48 | *Prefer multiplication over pow(x,2)* | | pow(x,2.01) | 7.96 | 4.19 | @pow@ is a very time consuming operation | | x/a | 0.98 | 1.22 | | | x*b (where b=1/a) | 0.46 | 0.66 | *Prefer multiplication by the inverse over division* | | x+1.5 | 0.40 | 0.40 | | | x-1.5 | 0.49 | 0.49 | *Prefer addition over subtraction* | | sin(x) | 4.66 | 2.39 | | | cos(x) | 4.64 | 2.46 | | | tan(x) | 5.40 | 2.84 | @tan@ is pretty time consuming | | acos(x) | 2.18 | 0.94 | | | sqrt(x) | 1.29 | 1.37 | | | log10(x) | 2.60 | 2.48 | | | log(x) | 2.72 | 2.33 | | | exp(x) | 7.17 | 7.20 | @exp@ is a very time consuming operation (comparable to @pow@) | Note that @pow@, @exp@ and the trigonometric functions are significantly (a factor of about 2) faster using single precision compared to double precision. h2. Benchmark comparison for different systems And here a comparison for various computing systems (double precision). In this comparison, the Mac is about the fastest, galileo (which is a 32 Bit system) is pretty fast for multiplications, kepler is the laziest (AMD related? Multi-core related?), fermi and the CI13 virtual box are about the same (there is no notable difference between gcc and clang on the virtual box). | *Computation* | *Mac OS X* | *galileo* | *kepler* | *dirac* | *fermi* | *CI13 (gcc 4.8.0)* | *CI13 (clang 3.1)* | | pow(x,2) | +1.90+ | *5.73* | 4.83 | 3.5 | 2.65 | 1.94 | 1.99 | | x*x | 0.49 | +0.31+ | *1.04* | *1.06* | 0.5 | 0.58 | 0.57 | | pow(x,2.01) | +7.96+ | 10.96 | *17.53* | *17.73* | 11.11 | 8.71 | 8.44 | | x/a | +0.98+ | 1.24 | *1.87* | *1.92* | 1.03 | 1.15 | 1.16 | | x*b (where b=1/a) | 0.46 | +0.27+ | *0.99* | *0.99* | 0.51 | 0.54 | 0.54 | | x+1.5 | 0.40 | +0.27+ | *0.96* | *1.02* | 0.43 | 0.47 | 0.47 | | x-1.5 | 0.49 | +0.27+ | *1.08* | *1.1* | 0.57 | 0.47 | 0.47 | | sin(x) | +4.66+ | 4.76 | *10.46* | *10.44* | 6.72 | 5.62 | 5.52 | | cos(x) | +4.64+ | 4.68 | *10.16* | *10.28* | 6.35 | 5.65 | 5.62 | | tan(x) | +5.40+ | 6.27 | *15.23* | *15.4* | 8.61 | 8.11 | 7.98 | | acos(x) | +2.18+ | *9.57* | 7.49 | 7.75 | 4.48 | 3.86 | 2.93 | | sqrt(x) | 1.29 | *3.29* | 2.33 | 2.4 | +0.97+ | 2.02 | 1.84 | | log10(x) | +2.60+ | 5.33 | *12.91* | *12.58* | 7.71 | 6.54 | 6.47 | | log(x) | +2.72+ | 5.15 | *10.64* | *10.66* | 6.32 | 5.26 | 5.09 | | exp(x) | 7.17 | *10* | 4.78 | 4.8| +1.85+ | 2.03 | 2.02 | +Underlined+ numbers show the fastest, *bold* numbers the slowest computations. And the same for single precision: | *Computation* | *Mac OS X* | *galileo* | *kepler* | *dirac* | *fermi* | *CI13 (gcc 4.8.0)* | *CI13 (clang 3.1)* | | pow(x,2) | +1.19+ | 1.77 | *3.27* | 3 | 1.35 | 1.54 | 0.9 | | x*x | 0.48 | +0.3+ | *0.99* | *1* | 0.47 | 0.54 | 0.54 | | pow(x,2.01) | +4.19+ | 10.64 | *29.81* | *30.21* | 14.42 | 13 | 12.29 | | x/a | 1.22 | 1.24 | *2.77* | *2.79* | +1.2+ | 1.37 | 1.4 | | x*b (where b=1/a) | 0.66 | +0.27+ | *1.72* | *1.74* | 0.67 | 0.76 | 0.79 | | x+1.5 | 0.40 | +0.27+ | *1.03* | *1.04* | 0.4 | 0.46 | 0.47 | | x-1.5 | 0.49 | +0.27+ | *1.13* | *1.14* | 0.54 | 0.47 | 0.47 | | sin(x) | +2.39+ | 4.92 | *116.41* | *119.06* | 54 | 41.2 | 40.22 | | cos(x) | +2.46+ | 4.85 | *116.47* | *119.27* | 53.93 | 40.91 | 40.3 | | tan(x) | +2.84+ | 6.47 | *120.69* | *122* | 55.14 | 42.36 | 41.83 | | acos(x) | +0.94+ | *9.02* | 8.6 | 8.71 | 3.86 | 2.81 | 2.38 | | sqrt(x) | +1.37+ | 2.27 | *3.77* | *3.75* | 1.5 | 1.84 | 1.55 | | log10(x) | +2.48+ | 4.15 | *12.74* | *12.59* | 6.28 | 5.74 | 4.97 | | log(x) | +2.33+ | 3.83 | *10.07* | *10.42* | 5.16 | 4.88 | 4.11 | | exp(x) | +7.20+ | 9.96 | *17.51* | *18.32* | 10.77 | 10.21 | 10.18 | Note the *enormous speed penalty of trigonometric functions on most of the systems*. Floating point arithmetics are only faster on Mac OS X. Here the specifications of the machines used for benchmarking: * Mac OS X: 2.66 GHz Intel Core i7 Mac OS X 10.6.8, gcc 4.2.1 * galileo: 32 Bit, Intel Xeon, 2.8 GHz, gcc 3.2.2 * kepler: 64 Bit, AMD Opteron 6174, 12C, 2.20 GHz, gcc 4.1.2 * dirac: 64 Bit, AMD Opteron 6174, 12C, 2.20 GHz, gcc 4.1.2 * fermi: 64 Bit, Intel(R) Xeon(R) CPU E5450 @ 3.00GHz, gcc 4.1.2 * CI13: 64 Bit (virtual box) h2. Behind the scenes Here now some information to understand what happens. h3. Kepler I did some experiments to see how the compiled code differs for different variants of the sin function call. In particular, I tested * @std::sin(double)@ * @std::sin(float)@ * @sin(double)@ * @sin(float)@ It turned out that the call to @std::sin(float)@ calls the function @sinf@, while all other codes call @sin@. The execution time difference is therefore related to different implementations of @sin@ and @sinf@ on Kepler. Note that @sin@ and @sinf@ are implement in @/lib64/libm.so.6@ on Kepler. This library is part of the GNU C library libc (see http://www.gnu.org/software/libc/). h4. Using std::sin(double) When @std::sin(double)@ is used, the @sin@ function will be called by the processor. Note that the same behavior is obtained when calling @sin(double)@ (without the std prefix).
$ nano stdsin.cpp
  #include 
  int main(void)
  {
      double arg    = 1.0;
      double result = std::sin(arg);
      return 0;
  }   
$ g++ -S stdsin.cpp
$ more stdsin.s
main:
.LFB97:
	pushq	%rbp
.LCFI0:
	movq	%rsp, %rbp
.LCFI1:
	subq	$32, %rsp
.LCFI2:
	movabsq	$4607182418800017408, %rax
	movq	%rax, -16(%rbp)
	movq	-16(%rbp), %rax
	movq	%rax, -24(%rbp)
	movsd	-24(%rbp), %xmm0
	call	sin
	movsd	%xmm0, -24(%rbp)
	movq	-24(%rbp), %rax
	movq	%rax, -8(%rbp)
	movl	$0, %eax
	leave
	ret
h4. Using std::sin(float) When @std::sin(float)@ is used, the @sinf@ function will be called by the processor.
$ nano floatstdsin.cpp
  #include 
  int main(void)
  {
      float arg    = 1.0;
      float result = std::sin(arg);
      return 0;
  }   
$ g++ -S floatstdsin.cpp
$ more floatstdsin.s
main:
.LFB97:
	pushq	%rbp
.LCFI3:
	movq	%rsp, %rbp
.LCFI4:
	subq	$32, %rsp
.LCFI5:
	movl	$0x3f800000, %eax
	movl	%eax, -8(%rbp)
	movl	-8(%rbp), %eax
	movl	%eax, -20(%rbp)
	movss	-20(%rbp), %xmm0
	call	_ZSt3sinf
	movss	%xmm0, -20(%rbp)
	movl	-20(%rbp), %eax
	movl	%eax, -4(%rbp)
	movl	$0, %eax
	leave
	ret
_ZSt3sinf:
.LFB57:
	pushq	%rbp
.LCFI0:
	movq	%rsp, %rbp
.LCFI1:
	subq	$16, %rsp
.LCFI2:
	movss	%xmm0, -4(%rbp)
	movl	-4(%rbp), %eax
	movl	%eax, -12(%rbp)
	movss	-12(%rbp), %xmm0
	call	sinf
	movss	%xmm0, -12(%rbp)
	movl	-12(%rbp), %eax
	movl	%eax, -12(%rbp)
	movss	-12(%rbp), %xmm0
	leave
	ret
h4. Using sin(float) When @sin(float)@ is used, the compiler will perform an implicit conversion to @double@ and then call the @sin@ function.
$ nano floatsin.cpp
  #include 
  int main(void)
  {
      float arg    = 1.0;
      float result = sin(arg);
      return 0;
  }   
$ g++ -S floatsin.cpp
$ more floatsin.s
main:
.LFB97:
	pushq	%rbp
.LCFI0:
	movq	%rsp, %rbp
.LCFI1:
	subq	$16, %rsp
.LCFI2:
	movl	$0x3f800000, %eax
	movl	%eax, -8(%rbp)
	cvtss2sd	-8(%rbp), %xmm0
	call	sin
	cvtsd2ss	%xmm0, %xmm0
	movss	%xmm0, -4(%rbp)
	movl	$0, %eax
	leave
	ret
h3. Mac OS X And now the same experiment on Mac OS X. It turns out that the code generated by the compiler has the same structure, and the functions that are called are again @_sin@ and @_sinf@ (all function names have a @_@ prepended on Mac OS X). This means that the implementation of the @_sinf@ function on Mac OS X is considerably faster than the implementation on kepler. Note that @_sin@ and @_sinf@ are implement in @/usr/lib/libSystem.B.dylib@ on my Mac OS X. h4. Using std::sin(double) When @std::sin(double)@ is used, the @_sin@ function will be called by the processor. Note that the same behavior is obtained when calling @sin(double)@ (without the std prefix).
$ nano stdsin.cpp
  #include 
  int main(void)
  {
      double arg    = 1.0;
      double result = std::sin(arg);
      return 0;
  }   
$ g++ -S stdsin.cpp
$ more stdsin.s
_main:
LFB127:
        pushq   %rbp
LCFI0:
        movq    %rsp, %rbp
LCFI1:
        subq    $16, %rsp
LCFI2:
        movabsq $4607182418800017408, %rax
        movq    %rax, -8(%rbp)
        movsd   -8(%rbp), %xmm0
        call    _sin
        movsd   %xmm0, -16(%rbp)
        movl    $0, %eax
        leave
        ret
h4. Using std::sin(float) When @std::sin(float)@ is used, the @_sinf@ function will be called by the processor.
$ nano floatstdsin.cpp
  #include 
  int main(void)
  {
      float arg    = 1.0;
      float result = std::sin(arg);
      return 0;
  }   
$ g++ -S floatstdsin.cpp
$ more floatstdsin.s
_main:
LFB127:
        pushq   %rbp
LCFI3:
        movq    %rsp, %rbp
LCFI4:
        subq    $16, %rsp
LCFI5:
        movl    $0x3f800000, %eax
        movl    %eax, -4(%rbp)
        movss   -4(%rbp), %xmm0
        call    __ZSt3sinf
        movss   %xmm0, -8(%rbp)
        movl    $0, %eax
        leave
        ret
LFB87:
        pushq   %rbp
LCFI0:
        movq    %rsp, %rbp
LCFI1:
        subq    $16, %rsp
LCFI2:
        movss   %xmm0, -4(%rbp)
        movss   -4(%rbp), %xmm0
        call    _sinf
        leave
        ret
h4. Using sin(float) When @sin(float)@ is used, the compiler will perform an implicit conversion to @double@ and then call the @_sin@ function.
$ nano floatsin.cpp
  #include 
  int main(void)
  {
      float arg    = 1.0;
      float result = sin(arg);
      return 0;
  }   
$ g++ -S floatsin.cpp
$ more floatsin.s
_main:
LFB127:
        pushq   %rbp
LCFI0:
        movq    %rsp, %rbp
LCFI1:
        subq    $16, %rsp
LCFI2:
        movl    $0x3f800000, %eax
        movl    %eax, -4(%rbp)
        cvtss2sd        -4(%rbp), %xmm0
        call    _sin
        cvtsd2ss        %xmm0, %xmm0
        movss   %xmm0, -8(%rbp)
        movl    $0, %eax
        leave
        ret