Plumbing repairs – Argh! I never have all the parts that I need to complete my project. Even though I keep a supply of valves and fittings in my garage, I always seem to need one more trip to the store to get the job done. At least in my household, the time to fix a sink has more to do with driving to and from the hardware shop than the actual repairing.


Strangely, microprocessor performance is a lot like my plumbing repairs. If a processor doesn't have the data or instructions that it needs, it must get the information it requires before it can continue. If the needed information is in its caches (equivalent to the collection of parts in my garage) then the delay will be fairly insignificant. However, if it needs to fetch the information from main memory (a trip to the hardware store) then overall performance will sag. In many applications it is this access timing (or latency) that determines performance more than any other factor.


The latency problem becomes even more interesting as we add more processor cores into the mix. Continuing with the plumbing analogy, imagine if seven other do-it-yourselfers arrive at the hardware store at the same time as I do -- this is the same as eight processor cores attempting to access memory at the same time. Chances are pretty good that some of us are going to have to wait in line while the others are paying for their goods. The average time for each of us increases.


What if everyone in town decided to visit the same hardware store at the same time as me? I could easily spend days waiting in traffic before I even reached the store parking lot. Of course, this never happens because it’s extremely unlikely that everyone will go the store at the same time. Furthermore, there are many hardware stores in my town – this tends to balance the load during peak hours. The equivalent multiprocessing solution for limiting congestion is to have multiple memory channels. More memory channels help to balance the load and reduce the average access latency.


Just to underscore the importance of watching latency, let’s look at a few numbers. For most microprocessors, accessing the L1 cache takes between two and four nanoseconds. Contrast this with main memory latencies of sixty to one hundred eighty nanoseconds. Programs with poor cache utilization (and hence, long latencies) can easily run fifteen to ninety times slower!


Okay, latency is important. But where do we go from here? As a starting point I recommend obtaining a copy of “Intel® 64 and IA-32 Architectures Optimization Reference Manual”. This can be freely downloaded from Intel's website and contains a wealth of information about optimizing for memory accesses. Emerson Network Power Embedded Computing also has sophisticated tools for estimating multi-core processing performance. I will gladly address whatever questions you might have. I look forward to hearing from you.