Skip navigation

When I was a kid, I loved super heroes and I definitely was a fan of the Super Friends (which you can still watch today on Boomerang). One of the consistent themes was, just like most super hero teams (ala X-men, Avengers, Fantastic Four, The Incredibles, etc.), is that - despite some members having incredible powers - it is almost always the effort of the whole team that was needed to carry the day. Even today, in the modern version of Super Friends - known as "" (JLU) - that mantra continues. In one episode of JLU, a superhero known as "" came to the rescue by shrinking down to microscopic size to slip into a highly armored bomb to defuse it just before it blew. In an amusing coincidence - the Intel® Atom™ brand is actually filling a similar role in the Intel team of super-powered processors.


As the world's smallest x86-compatible processor, the Intel® Atom™ processor series has an incredibly low TDP (from 0.65W to 2.4W) and an almost more incredibly low price (from $20 to $135). Specifically in the case of the embedded (long life: 7+ years of manufacturing support) versions - the Z510 and Z530 - they both have a TDP of 2W and a price of $20 and $70 respectively (1ku direct tray price). Also, there is the embedded "sidekick" to consider - the Intel® US15W Chipset - which also has an incredibly low TDP and price (2.3W and $25 1ku direct tray). Together, the embedded-versions of the Intel® Atom™ Processors with its System Controller Hub (SCH - or an all-in-one chipset) - the total solution has a lower TDP than the next lowest-TDP embedded Intel® Processor available today (the Intel® Celeron® M Processor ULV 423 @ 5.5W). So, from that perspective, the embedded Intel® Atom™ brand is able to squeeze into lower thermal envelopes and go into lower price points than we had in the past - all with a processor that still has multi-GHz performance and, in some cases, the ability to handle more than one thread at the same time (two logical cores in one physical core). The big advantage here isn't just that we can put a chip with the Intel logo on it into deeper places - we were able to do that with the ARM-based PXA line that we sold off to Marvell - but it is that we can do it while keeping 100\% compatibility with software written for Intel® Architecture processors. Let me repeat that - it is that important - Intel® Atom™ Processors can run the exact same code that could run on say the Quad-Core Intel® Xeon® Processor E5440 (the - as defined by SPECpower_ssj2008). Imagine that - you can go from say a 10W-fanless embedded board to a 2.83 GHz, dual-socket, quad-core processor rack-mounted telco server and run exactly the same software. There is only one phrase that describes that: Ultimate Embedded Scalability (UES) - and that is the "superpower" of the embedded Intel® Atom™ brand.


From a development standpoint, it is a lot like golf - the best golfers in the world will tell you that you want to develop a very consistent swing, and then change your clubs to change the distance. The same is true with a lot of embedded development - if you can create rock solid software (the swing) then you can just change the processor (the club) to get the fit you need - while having a remarkably constant product line. What would be an example of this? How about printers - have the same code running in the low-end, home-office oriented printer as you do in your gargantuan, hall-filling, gazillion pages a minute corporate one. Why would this be good? Now, you have just 1 print driver that you (the printer maker) need to develop regardless if the salesperson using it is printing from her home office, printing at the branch office, or printing from main corporate headquarters. One driver and one current version for every printer in use and, since the printer would be using the same Intel® Architecture that the clients are using - even more overlap and code savings. Every code jockey would be an Intel® Architecture developer, all using the same compilers, the same job trackers, the same everything. If this doesn't sound like a big deal - go ask a printer maker (or any other embedded developer) that had previously had to use an non-Intel architecture based-processor when developing some of their lower-end products and ask them how much effort they waste every day having developers on different systems, using different tools, and then trying to get it all work together nicely at the end - but bring tissues because I'd expect you'll see some tears...


But wait, there's more...


Ultimate Embedded Scalability is not the only thing that the embedded Intel® Atom™ brand brings to the table - here are some ways that it could improve our everyday lives:




  • A Smarter Door: Why are we using 14th century technology to secure our 21st century homes? Embed an Intel® Atom™ Processor into your front door and it could recognize you as you approach, unlock & open for you (even with the Star Trek "" if you wished), have an embedded webcam in lieu of a peep hole (see who is at the door even when you're not home), watch for burglars, electronically sign for packages (the package guy holds the bar code up to the built-in scanner and you get an email) and even let friends leave vidMail messages.

  • A Smarter Car: People already have navigation systems and DVD players, but what about the next step - such as being able to get an instant answer to the command, "Find the cheapest gasoline in a two-mile radius" - alerts about a traffic jam on your route AND the best way around it - eliminating car keys by using thumb scanners and facial recognition - and "PathTracking" for when your teenager borrows the car (did they really go to the library or just drive by it on the way to the mall so that they can say "I drove to the Library" without lying?)

  • A Smarter Robot: As I mentioned in my blog on robots - iRobot wants to sell you 4 different robots to take care of your floor and more than that if you have a two story house! I just want 1 - and I want it to climb stairs (and vacuum them while we're at it), pick the toys that my son leave around, and feed the organic pets when the bowl get low - but it will take a lot more brains than iRobot is using today:)


In the end, the embedded Intel® Atom™ brand will be a big success by fitting into small places - and follow-on products to the line down road will extend that reach even further - it feels like a great time to be involved with the Embedded Intel movement. Feel free to add you own great ideas in the comment space below and let's start inventing the future today

Message Edited by serenajoy on 03-11-2009 08:05 PM

Embedded applications such as imaging, network intrusion detection, and call processing are good examples of applications that benefit from multi-core. Performance is increased because multiple threads running on multiple cores can process many requests simultaneously. This article provides a high-level summary of optimizing a serial designed application for multi-core. The exercise uses the open-source AMIDE (A Medical Imaging Data Examiner) application as the example. Using rendering techniques, AMIDE generates a display image from complex medical imaging data sets.


When threading an application for multi-core, the largest performance gains come from parallelizing the portions of code where the most CPU-intensive processing is performed. This requires thorough analysis of the application and also suggests that the serial code should be optimized prior to introducing parallelism. Then baseline performance is measured and a goal set for the target multi-core platform.


Threading is introduced to exploit the parallelism in an iterative performance-tuning methodology consisting of the steps illustrated below. Following the 80/20 rule, as the most compute-intensive blocks are optimized (parallelized), making further changes will reach diminishing returns at some point. So, don't lose sight of the goal.





Figure 1 - Performance Tuning Methodology



You can generate an alternative from two approaches to parallelize the code - decomposing by data or by control (functional). When there are a number of independent tasks that run in parallel, the application is suited to functional decomposition. When there is a large set of independent data that is processed through the same operation, data decomposition may be better.



Once the parallelization approach is decided, threads can be introduced into the code by multiple methods. These include the implicit (compiler based) threading model of OpenMP, and explicit threading, such as POSIX (pthreads). Explicit threading is coded manually by the programmer. You have complete control but also bear responsibility for all of the thread management, such as starting and stopping threads, synchronization code, etc. On the other hand, with implicit threading the compiler creates all that underlying parallel code for you. Implicit threading can be appealing to a developer because it minimizes the coding effort, reduces the chance for bugs, provides OS portability, and also allows one code base to be used for both the serial and parallel versions of an application. With either method, after threading the code it needs to be tested/debugged to ensure that no errors were introduced.



The final step in the tuning cycle is measuring the code's performance. Key metrics that affect the overall performance include: cache hit rate, CPU utilization, synchronization overhead, and thread stall overhead. From these data you can determine the efficiency and level of optimization of the code. With multiple threads executing simultaneously, there are many opportunities for bottlenecks and resource contention issues. Sophisticated analysis tools are available to help quickly identify these problem areas, enabling you to tune your application accordingly.



The study of AMIDE concentrated on its core image rendering engine. Data flow analysis identified the major code blocks and the amount of time spent in each, and performance analysis identified the area of code where the majority of processing occurred. Since the application was found to have inherent data parallelism, it was threaded using the data decomposition approach and POSIX was chosen as the threading method. OpenMP would have been a good candidate for implementing the threads, but I suspect the folks who threaded AMIDE were already comfortable with POSIX threads.



The net result of threading AMIDE to run on an Intel® four core system (two dual-core processors) was measured performance at 3.3x compared to the single core baseline.



And how long does this effort take, you ask? The answer depends on the nature of the starting code, the parallelization experience of the engineers, and their knowledge of the specific application. Real-world examples, such as AMIDE, have shown that it's possible to migrate an application from single- to multi-core, with excellent performance results, in only days.



Visit to read the full AMIDE case study.






Message Edited by serenajoy on 03-11-2009 08:39 PM

Compared with single-core, multi-core parallelism enables processing a constant volume of data in less time (quicker turnaround), more data within a constant time (increased throughput), or a combination of both. Symmetric Multiprocessing (SMP) is a computer system with multiple CPUs that share the same Operating System (OS) and main memory. SMP OSs are well suited for multi-core due to their inherent parallel processing capabilities. SMP treats the multi-core hardware as a shared resource with a single OS image running across all cores. Processes are dynamically assigned to run in the available cores in a truly parallel manner.


Let me define a few terms. Process is a term generally used to describe the "heavyweight" unit of execution that is a collection of resources required for program instruction execution, such as virtual memory, I/O descriptors, the runtime stack, signal handlers, and other control resources. A thread of execution is associated with a process and is viewed more as the "lightweight" unit of execution because threads share the process's environment, which makes context switches between shares efficient. Threads also share an address space with other threads. Task is a term commonly used interchangeably with "process" and "thread," but more accurately is simply a group of instructions that are a part of a program and associated more with real-time operating systems.


Existing applications that already break out processing into concurrent jobs can realize multi-core benefits with few, if any, changes. For example, a networked printer application with separate threads for image processing and network protocols should have higher performance if those threads can run in parallel. Optimization for serial code can be achieved by multi-threading the compute- or data-intensive portion of a program to extract its parallelism. Although this can be tricky, if done correctly it produces the best performance and scalability. Write the code once and performance will scale on systems with any number of cores. Various Intel® Software Development Products are available to support this effort, including performance analysis, thread debugging and profiling, performance libraries, and C compilers.



To achieve optimal results, the software developer will benefit by understanding a few subtleties of the multi-core architecture and tuning the SMP implementation to take advantage of features that are platform processor architecture-specific, such as shared L2 cache. The Intel® multi-core processor family comprises uni-processor systems in which all cores share a common L2 cache; as well as dual- and multi-processor . These variants can affect software performance.



The SMP OS normally assigns processes to the available cores on a first-available basis. At some point the process will relinquish control to the OS, e.g., pending an I/O request or when the OS wants to give a time slice for execution to another process. When execution is resumed, the OS might well assign that same process to a different core from where it left off. In the uni-processor case, since all cores share a common cache, the caching effect is the same regardless of where the process executes. However, in "multi-package" systems- multi-core processors that do not share last level cache- if a process running on one core is suspended and then resumes on a core served by another cache, chances of a cache miss are greater, resulting in a missed opportunity to take advantage of the performance benefit of shared L2 cache. This condition can be circumvented by employing the SMP technique known as "processor affinity" where the programmer manually "pins," or restricts, process execution to a specific subset of cores- in this case the packages that share a cache- thus leveraging the shared L2 cache benefit for increased overall performance. This technique is also useful for threads that frequently share data.



SMP is a great software implementation for multi-core systems. It can take advantage of the additional cores by running multiple applications simultaneously (serial or multi-threaded), and for optimum performance and scalability the software should be programmed for parallelism and be aware of the specific processor platform's cache architecture.



While SMP is great, it's not the only system design for multiprocessing. Stay tuned for more choices!



  • Lori


Message Edited by serenajoy on 03-11-2009 08:42 PM

Filter Blog

By date: By tag: