gcc - Benchmarking a pure C++ function -

how prevent gcc/clang inlining , optimizing out multiple invocations of pure function?

i trying benchmark code of form

int __attribute__ ((noinline)) my_loop(int const* array, int len) {    // use array compute result.  } 

my benchmark code looks this:

int main() {   const int number = 2048;    // own aligned_malloc implementation.   int* input = (int*)aligned_malloc(sizeof(int) * number, 32);   // fill array random numbers.   make_random(input, number);   const int num_runs = 10000000;   (int = 0; < num_runs; i++) {      const int result = my_loop(input, number); // call pure function.   }   // since program exits don't free input. } 

as expected clang seems able turn no-op @ o2 (perhaps @ o1).

a few things tried benchmark implementation are:

  • accumulate intermediate results in integer , print results @ end:

    const int num_runs = 10000000; uint64_t total = 0; (int = 0; < num_runs; i++) {   total += my_loop(input, number); // call pure function. } printf("total %llu\n", total); 

    sadly doesn't seem work. clang @ least smart enough realize pure function , transforms benchmark this:

    int result = my_loop(); uint64_t total = num_runs * result; printf("total %llu\n", total); 
  • set atomic variable using release semantics @ end of every loop iteration:

    const int num_runs = 10000000; std::atomic<uint64_t> result_atomic(0); (int = 0; < num_runs; i++) {   int result = my_loop(input, number); // call pure function.   // tried std::memory_order_release too.   result_atomic.store(result, std::memory_order_seq_cst); } printf("result %llu\n", result_atomic.load()); 

    my hope since atomics introduce happens-before relationship, clang forced execute code. sadly still did optimization above , sets value of atomic num_runs * result in 1 shot instead of running num_runs iterations of function.

  • set volatile int @ end of every loop along summing total.

    const int num_runs = 10000000; uint64_t total = 0; volatile int trigger = 0; (int = 0; < num_runs; i++) {   total += my_loop(input, number); // call pure function.   trigger = 1; } // if take printf out, clang optimizes code away again. printf("total %llu\n", total); 

    this seems trick , benchmarks seem work. not ideal number of reasons.

  • per understanding of c++11 memory model volatile set operations not establish happens before relationship can't sure compiler not decide same num_runs * result_of_1_run optimization .

  • also method seems undesirable since have overhead (however tiny) of setting volatile int on every run of loop.

is there canonical way of preventing clang/gcc optimizing result away. maybe pragma or something? bonus points if ideal method works across compilers.

you can insert instruction directly assembly. uses macro splitting assembly, e.g. separating loads calculations , branching.

#define gcc_split_block(str)  __asm__( "//\n\t// " str "\n\t//\n" ); 

then in source insert

gcc_split_block("keep please")

before , after functions


Popular posts from this blog

javascript - Karma not able to start PhantomJS on Windows - Error: spawn UNKNOWN -

Nuget pack csproj using nuspec -