Optimizations: Compute Bound Code¶
Many Ways of Optimization¶
There are many ways to try to optimize code …
Unnecessary ones
Using better algorithms (e.g. sorting and binary search)
Function call elimination (inlining vs. spaghetti)
Loop unrolling
Strength reduction (e.g. using shift instead of mult/div)
Tail call elimination
…
Unnecessary Optimizations¶
if (x != 0)
x = 0;
The rumour goes that this is not faster than unconditional writing
Produces more instructions, at least
Inlining (1)¶
Facts up front:
Function calls are generally fast
A little slower when definition is in a shared library
Instruction cache, if used judiciously, makes repeated calls even faster
But, as always: it depends
Possible inlining candidate
int add(int l, int r)
{
return l + r;
}
Inlining (2)¶
A couple rules
Always write clear code
Never not define a function because of performance reasons
Readability first
Can always inline later, during optimization
Don’t inline large functions ⟶ instruction cache pollution when called from different locations
Use
static
for implementation specific functions ⟶ compiler has much more freedom
Inlining (3)¶
GCC …
Does not optimize by default
Ignores explicit
inline
when not optimizing-finline-small-functions
(enabled at-O2
): inline when function call overhead is larger than body (even when not declared inline)-finline-functions
(enabled at-O3
): all functions considered for inlining ⟶ heuristics-finline-functions-called-once
(enabled at-O1, -O2, -O3, -Os`
): allstatic
functions that …More ⟶
info gcc
Register Allocation (1)¶
Register access is orders of magnitude faster than main memory access
⟶ Best to keep variables in registers rather than memory
CPUs have varying numbers of registers
register
keyword should not be overusedIgnored anyway by most compilers
Register allocation
Compiler performs flow analysis
Live vs. dead variables
“Spills” registers when allocation changes
Compiler generally makes better choices than the programmer!
Register Allocation (2)¶
GCC …
-fira-*
(for Integrated Register Allocator)A lot of tuning opportunities for those who care
Peephole Optimization¶
Peephole: manageable set of instructions; “window”
Common term for a group of optimizations that operate on a small scale
Common subexpression elimination
Strength reduction
Constant folding
Small scale ⟶ “basic block”
Peephole Optimization: Common Subexpression Elimination¶
Sometimes one writes redundant code, in order to not compromise readability by introducing yet another variable …
a = b + c + d;
x = b + c + y;
This can be transformed into
tmp = b + c; /* common subexpression */
a = tmp + d;
x = tmp + y;
Peephole Optimization: Strength Reduction¶
Most programmers prefer to say what they mean (fortunately) …
x = y * 2;
The same effect, but cheaper, is brought about by …
x = y << 1;
If one knows the “strength” of the operators involved (compilers tend to know), then even this transformation can be opportune …
x = y * 3; /* y*(4-1) == y*4-y */
x = (y << 2) - y;
Peephole Optimization: Constant Folding¶
Another one that might look stupid but readable …
x = 42;
y = x + 1;
… is likely to be transformed into …
x = 42;
y = 43;
Consider transitive and repeated folding and propagation ⟶ pretty results
Loop Invariants¶
The following bogus code …
while (1) {
x = 42; /* loop invariant */
y += 2;
}
… will likely end up as …
x = 42;
while (1)
y += 2;
At least with a minimal amount of optimization enabled (GCC:
-fmove-loop-invariants
, enabled with -O1
already)
Loop Unrolling¶
If a loop body is run a known number of times, the loop counter can be omitted.
for (i=0; i<4; i++)
dst[i] = src[i];
This can be written as
dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
dst[3] = src[3];
Complicated heuristics: does the performance gain outweigh instruction cache thrashing?
⟶ I’d keep my fingers from it!
Tail Call Optimization¶
int f(int i)
{
do_something(i);
return g(i+1);
}
g()
is called at the endf()
’s stack frame is not used afterwardsOptimization:
g()
can usef()
’s stack frame
CPU Optimization, Last Words¶
Once more: Write clean Code!
All optimization techniques explained are performed automatically, by the compiler
Theory behind optimization is well understood ⟶ engineering discipline
Compilers generally perform optimizations better than a programmer would
… let alone portably, on different CPUs!
“Optimization” is a misnomer ⟶ “Improvement”
Compiler cannot make arbitrary code “optimal”
Bigger picture is always up to the programmer
⟶ Once more: Write clean Code!
Work together with compiler ⟶ use
static
,const
GCC: Optimization “Levels”¶
-O0
: optimization off; the default-O1
: most basic optimizations; does as much as possible without compromising compilation time too much-O2
: recommended; does everything which has no size impact, isunagressive, and doesn’t completely chew compilation time
-O3
: highest level possible; somewhat agressive, can break things sometimes, eats up your CPU and memory while compiling-Os
: optimize for size; all of-O2
that doesn’t increase size-Og
(since GCC 4.8): “developer mode”; turns on options that don’t interfere with debugging or compilation time