Coding Tip: Separate

I’ve been trying to put together a rather long list of coding tips to post here, so as to share what I’ve learned about coding for performance. I found it hard to put them all together and keep the post down to a few pages, so I’ve decided to post the tips over time. So, here goes my attempt at a first tip!

Separate your concepts – one operation per line. So instead of:

a[x] = b[x]*c[x]

do this:

temp1 = b[x]
temp2 = c[x]
temp3 = temp1 * temp2
a[x] = temp3

“Why?”, you might ask. Well, this simple code breaks down into 2 loads, one multiple, and one store. Putting all of it on one line might save you space in your .cpp file, but in an inner loop you will want to expand it as I have done for better performance.

The performance for this code in an inner loop is gained by hinting to the compiler and the CPU what is going on and what depends on what. Putting all of the loads up front allows it to possibly stream the loads better, and the same with the stores at the end. The compiler can look at this code and have an easier time figuring out how to unroll it. At run-time, it is possible that other instructions could be put in-between some of the operations to use the time wasted waiting for operations to complete (out of order execution). So, if your loads take too long and if it knows that the operations are independent of each other, it might be possible for it to do something else while waiting.

Well, that is my first tip. We’ll see how it goes as to whether I will do more programming tips or something else. I hope this helps.

-Edward

5 Responses to Coding Tip: Separate

  1. Mihai says:

    I think this depends a lot on the platform, compiler and the compiler settings.The VS2005 compiles with default settings definitely generates more instructions and is slower in the second case.I would go with Knuth here: “Premature optimization is the root of all evil”Or maybe you can detail compiler switches, pragmas, show some of the generated asm.

  2. Edward Kandrot says:

    Very strange. This is with a release rather than debug build?This is not an optimization, but a coding style that allows the compiler to optimize and the CPU at runtime to do out of order execution. Once you have profiled and found which of your inner loops are eating all of your time, this is where this coding style comes in handy. I have not yet talked about how to find out where to optimize, but might do so at a later point in time.Thank you for trying out the code and commenting.

  3. john_maloney says:

    Please explain your meaning of “concept”.You say that the exhibited code decomposes to:* two (2) loads,* one (1) multiple,* one (1) store.What is a “multiple”? Is this idea specific to a particular Instruction Set Architecture?- John

  4. Edward says:

    Sorry John, I meant “multiply” not “multiple” – I need idea checkers rather than spell checkers. I used the word concept to stand for any single operation, but wanted to make sure that it was understood that it was more than just the math operation in my simple example. There are implicit operations (like loads and stores) that happen as well and that these need to be thought of when separating operations.

  5. Edward says:

    Mihai, I tried the example code under MSVC 2005 with the default compiler options for release. I created two functions, one for each way I described. The compiler unrolled the loops, so it generated more code, but the code generated was not only the same, it didn’t even generate two functions, just one that it called twice. It recognized, in this simple example, thatr[i] = a[i] * b[i];was the same asint tempa = a[i];int tempb = b[i];int tempt = tempa * tempb;r[i] = tempt;For more complex functions, the hints it gets from a programmer breaking down the operations for it allow it the possibility of generating more efficient code, depending on the compiler. I have not seen a good compiler generate worse code by using this programming style, except in debug mode where it adds loads and stores for each line of code.The general idea is the more your code is RISC-like the faster it will probably run, since most modern chips are RISC internally, even if they have CISC instruction sets. CPU design currently focuses on making the simple operations as fast as possible, and penalizing the more complex operations. By coding this way, one can better avoid having the compiler map ones code into the complex instruction set.I hope I have answered you questions.