gcc - Benchmarking a pure C++ function -
how prevent gcc/clang inlining , optimizing out multiple invocations of pure function?
i trying benchmark code of form
int __attribute__ ((noinline)) my_loop(int const* array, int len) { // use array compute result. }
my benchmark code looks this:
int main() { const int number = 2048; // own aligned_malloc implementation. int* input = (int*)aligned_malloc(sizeof(int) * number, 32); // fill array random numbers. make_random(input, number); const int num_runs = 10000000; (int = 0; < num_runs; i++) { const int result = my_loop(input, number); // call pure function. } // since program exits don't free input. }
as expected clang seems able turn no-op @ o2 (perhaps @ o1).
a few things tried benchmark implementation are:
accumulate intermediate results in integer , print results @ end:
const int num_runs = 10000000; uint64_t total = 0; (int = 0; < num_runs; i++) { total += my_loop(input, number); // call pure function. } printf("total %llu\n", total);
sadly doesn't seem work. clang @ least smart enough realize pure function , transforms benchmark this:
int result = my_loop(); uint64_t total = num_runs * result; printf("total %llu\n", total);
set atomic variable using release semantics @ end of every loop iteration:
const int num_runs = 10000000; std::atomic<uint64_t> result_atomic(0); (int = 0; < num_runs; i++) { int result = my_loop(input, number); // call pure function. // tried std::memory_order_release too. result_atomic.store(result, std::memory_order_seq_cst); } printf("result %llu\n", result_atomic.load());
my hope since atomics introduce
happens-before
relationship, clang forced execute code. sadly still did optimization above , sets value of atomicnum_runs * result
in 1 shot instead of runningnum_runs
iterations of function.set volatile int @ end of every loop along summing total.
const int num_runs = 10000000; uint64_t total = 0; volatile int trigger = 0; (int = 0; < num_runs; i++) { total += my_loop(input, number); // call pure function. trigger = 1; } // if take printf out, clang optimizes code away again. printf("total %llu\n", total);
this seems trick , benchmarks seem work. not ideal number of reasons.
per understanding of c++11 memory model
volatile set operations
not establishhappens before
relationship can't sure compiler not decide samenum_runs * result_of_1_run
optimization .also method seems undesirable since have overhead (however tiny) of setting volatile int on every run of loop.
is there canonical way of preventing clang/gcc optimizing result away. maybe pragma or something? bonus points if ideal method works across compilers.
you can insert instruction directly assembly. uses macro splitting assembly, e.g. separating loads calculations , branching.
#define gcc_split_block(str) __asm__( "//\n\t// " str "\n\t//\n" );
then in source insert
gcc_split_block("keep please")
before , after functions
Comments
Post a Comment