gcc - Benchmarking a pure C++ function -

- September 15, 2014

how prevent gcc/clang inlining , optimizing out multiple invocations of pure function?

i trying benchmark code of form

int __attribute__ ((noinline)) my_loop(int const* array, int len) {    // use array compute result.  }

my benchmark code looks this:

int main() {   const int number = 2048;    // own aligned_malloc implementation.   int* input = (int*)aligned_malloc(sizeof(int) * number, 32);   // fill array random numbers.   make_random(input, number);   const int num_runs = 10000000;   (int = 0; < num_runs; i++) {      const int result = my_loop(input, number); // call pure function.   }   // since program exits don't free input. }

as expected clang seems able turn no-op @ o2 (perhaps @ o1).

a few things tried benchmark implementation are:

accumulate intermediate results in integer , print results @ end:

const int num_runs = 10000000; uint64_t total = 0; (int = 0; < num_runs; i++) {   total += my_loop(input, number); // call pure function. } printf("total %llu\n", total);

sadly doesn't seem work. clang @ least smart enough realize pure function , transforms benchmark this:

int result = my_loop(); uint64_t total = num_runs * result; printf("total %llu\n", total);

set atomic variable using release semantics @ end of every loop iteration:

const int num_runs = 10000000; std::atomic<uint64_t> result_atomic(0); (int = 0; < num_runs; i++) {   int result = my_loop(input, number); // call pure function.   // tried std::memory_order_release too.   result_atomic.store(result, std::memory_order_seq_cst); } printf("result %llu\n", result_atomic.load());

my hope since atomics introduce happens-before relationship, clang forced execute code. sadly still did optimization above , sets value of atomic num_runs * result in 1 shot instead of running num_runs iterations of function.

set volatile int @ end of every loop along summing total.

const int num_runs = 10000000; uint64_t total = 0; volatile int trigger = 0; (int = 0; < num_runs; i++) {   total += my_loop(input, number); // call pure function.   trigger = 1; } // if take printf out, clang optimizes code away again. printf("total %llu\n", total);

this seems trick , benchmarks seem work. not ideal number of reasons.

per understanding of c++11 memory model volatile set operations not establish happens before relationship can't sure compiler not decide same num_runs * result_of_1_run optimization .
also method seems undesirable since have overhead (however tiny) of setting volatile int on every run of loop.

is there canonical way of preventing clang/gcc optimizing result away. maybe pragma or something? bonus points if ideal method works across compilers.

you can insert instruction directly assembly. uses macro splitting assembly, e.g. separating loads calculations , branching.

#define gcc_split_block(str)  __asm__( "//\n\t// " str "\n\t//\n" );

then in source insert

gcc_split_block("keep please")

before , after functions

Search This Blog

Dil

gcc - Benchmarking a pure C++ function -

Comments

Post a Comment

Popular posts from this blog

c# - Store DBContext Log in other EF table -

c# - Display ASPX Popup control in RowDeleteing Event (ASPX Gridview) -

Nuget pack csproj using nuspec -