Optimizations: Compute Bound Code

Many Ways of Optimization

There are many ways to try to optimize code

  • Unnecessary ones

  • Using better algorithms (e.g. sorting and binary search)

  • Function call elimination (inlining vs. spaghetti)

  • Loop unrolling

  • Strength reduction (e.g. using shift instead of mult/div)

  • Tail call elimination

Unnecessary Optimizations

if (x != 0)
    x = 0;
  • The rumour goes that this is not faster than unconditional writing

  • Produces more instructions, at least

Inlining (1)

Facts up front:

  • Function calls are generally fast

  • A little slower when definition is in a shared library

  • Instruction cache, if used judiciously, makes repeated calls even faster

  • But, as always: it depends

Possible inlining candidate

int add(int l, int r)
{
    return l + r;
}

Inlining (2)

A couple rules

  • Always write clear code

  • Never not define a function because of performance reasons

    • Readability first

    • Can always inline later, during optimization

  • Don’t inline large functions ⟶ instruction cache pollution when called from different locations

  • Use static for implementation specific functions ⟶ compiler has much more freedom

Inlining (3)

GCC …

  • Does not optimize by default

  • Ignores explicit inline when not optimizing

  • -finline-small-functions (enabled at -O2): inline when function call overhead is larger than body (even when not declared inline)

  • -finline-functions (enabled at -O3): all functions considered for inlining ⟶ heuristics

  • -finline-functions-called-once (enabled at -O1, -O2, -O3, -Os`): all static functions that …

  • More ⟶ info gcc

Register Allocation (1)

  • Register access is orders of magnitude faster than main memory access

    • ⟶ Best to keep variables in registers rather than memory

  • CPUs have varying numbers of registers

    • register keyword should not be overused

    • Ignored anyway by most compilers

  • Register allocation

    • Compiler performs flow analysis

    • Live vs. dead variables

    • “Spills” registers when allocation changes

Compiler generally makes better choices than the programmer!

Register Allocation (2)

GCC …

  • -fira-* (for Integrated Register Allocator)

  • RTFM please

  • A lot of tuning opportunities for those who care

Peephole Optimization

  • Peephole: manageable set of instructions; “window”

  • Common term for a group of optimizations that operate on a small scale

    • Common subexpression elimination

    • Strength reduction

    • Constant folding

  • Small scale ⟶ “basic block”

Peephole Optimization: Common Subexpression Elimination

Sometimes one writes redundant code, in order to not compromise readability by introducing yet another variable …

a = b + c + d;
x = b + c + y;

This can be transformed into

tmp = b + c; /* common subexpression */
a = tmp + d;
x = tmp + y;

Peephole Optimization: Strength Reduction

Most programmers prefer to say what they mean (fortunately) …

x = y * 2;

The same effect, but cheaper, is brought about by …

x = y << 1;

If one knows the “strength” of the operators involved (compilers tend to know), then even this transformation can be opportune …

x = y * 3; /* y*(4-1) == y*4-y */
x = (y << 2) - y;

Peephole Optimization: Constant Folding

Another one that might look stupid but readable …

x = 42;
y = x + 1;

… is likely to be transformed into …

x = 42;
y = 43;

Consider transitive and repeated folding and propagation ⟶ pretty results

Loop Invariants

The following bogus code …

while (1) {
    x = 42; /* loop invariant */
    y += 2;
}

… will likely end up as …

x = 42;
while (1)
    y += 2;

At least with a minimal amount of optimization enabled (GCC: -fmove-loop-invariants, enabled with -O1 already)

Loop Unrolling

If a loop body is run a known number of times, the loop counter can be omitted.

for (i=0; i<4; i++)
    dst[i] = src[i];

This can be written as

dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
dst[3] = src[3];
  • Complicated heuristics: does the performance gain outweigh instruction cache thrashing?

  • ⟶ I’d keep my fingers from it!

Tail Call Optimization

int f(int i)
{
   do_something(i);
   return g(i+1);
}
  • g() is called at the end

  • f()’s stack frame is not used afterwards

  • Optimization: g() can use f()’s stack frame

CPU Optimization, Last Words

Once more: Write clean Code!

  • All optimization techniques explained are performed automatically, by the compiler

  • Theory behind optimization is well understood ⟶ engineering discipline

  • Compilers generally perform optimizations better than a programmer would

    • … let alone portably, on different CPUs!

  • “Optimization” is a misnomer ⟶ “Improvement”

    • Compiler cannot make arbitrary code “optimal”

    • Bigger picture is always up to the programmer

    • ⟶ Once more: Write clean Code!

  • Work together with compiler ⟶ use static, const

GCC: Optimization “Levels”

  • -O0: optimization off; the default

  • -O1: most basic optimizations; does as much as possible without compromising compilation time too much

  • -O2: recommended; does everything which has no size impact, is

    unagressive, and doesn’t completely chew compilation time

  • -O3: highest level possible; somewhat agressive, can break things sometimes, eats up your CPU and memory while compiling

  • -Os: optimize for size; all of -O2 that doesn’t increase size

  • -Og (since GCC 4.8): “developer mode”; turns on options that don’t interfere with debugging or compilation time

Further Reading