My name is Doug Boom and I do support for company's writing software for Intel's extensive line of Ethernet products. While I'm not a 81341 expert, I've done a bunch of work around performance, and I think I can explain what your seeing.
Reading from main memory will cause a flush of the outstanding cache lines. Basically the CPU and memory controller both stop to make sure all the pending writes are completed and all things are accounted for in terms of internal processing. Writes tend to be cached and therefore much faster. The reads need to be flushed before completing, this causes a stall. Because writes don't get flushed until something like a read or full cacheline happens, it tends to be faster. There may be some scaling problems in the application as well. Also the larger the data being read, the large the area that needs to be locked, which can lead to lower throughput.
Ethernet is typically only 1500 bytes frames, so you can see from your data that this data size is still in the highest rate part of the data set. Even when using jumbo frames, that is only 9KB so its still in the better part of the data you've collected.
Depending on frame size your using, Ethernet can be 1.44 million packets a second of processing. Most of the time the frame goes all the way up the stack so the frame is processed several times (driver, stack, socket, application) before completing. This means it is processed by the Memory subsystem almost every time one of the SW actors (driver, stack) works on it. All of this work adds to more CPU utilization. So I would consider your numbers "normal" given the level of detail outlined so far in your post.
If you want to save some CPU, make sure checksum offloading is enabled on the TX and RX size and increase the MTU to the largest size your switch can handle. That will move more data per transaction which can save some CPU. The offloads also help move some of the workload onto the 82546, saving more CPU. But the stock Linux driver has the checksums on by default, so chances are you're already seeing the benefit of the offloads. For a fun test, invert the checksum offloads (if on turn off, if off turn on) and retest to see the effect on throughput and/or CPU. Another thing is to set the hertz rate from 1000 to the old 100 value which will allow for longer time slices and more time doing work and less moving between tasks. Changing the hertz rate in Linux can have other side effects, but it can be a fun experiment to see the effects. You'll need to rebuild your kernel to make the Hertz change.
Thanks for using Intel networking products and good luck with your design.Message Edited by Doug B on 07-28-2009 03:44 PM to correct for spelling mistakes :)Message Edited by Doug B on 07-28-2009 03:45 PM
My name is Fred.Fan. I am doing the oprofile tunning and memory performance test on IOP341 platform.
1) Before discuss Read performance issue， I want to attach the log of memory test by my simplest test code.
~# ./memory_test -t 1
totoal bytes = 268435456
totoval us = 596164
result 450 MBPS
~# ./memory_test -t 0
totoal bytes = 268435456
totoval us = 1111875
result 241 MBPS
Read is slower as twice times as write operation.
But the result of ramspeed, it is almost 8 times.
And I have do some test by ramspeed on PC and other arm platform, the result is read is much better
2) I have a question why xscale has inverse result.
2.1) For xscale microarchitecture develop manual, we can get the write buffer is no pending and L2 cache is worked as write-back and write-allocate. Write-allocate should improve the write performance.
But write buffer, other arm platform should have such component, too. Why other arm platform, the write is slower than read?
2.2) the memory test is just read or write. so there are no case to stall for write complement. But there are stalls to wait the load buffer complete(buffer is full).
2.3) And there are a question about write behavior when cache miss. If I just write one word and read next word, what will happend? Does the one of belows or other behavior?
2.3.1) the are two buffers(one for write, one for cache fill), the cache fill buffer will wait the complete of write buffer.
2.3.2) load cache line first. and then write to cache. the second read access will hit cache or wait cache data loading complete.
3) In normal, the MCU setting is 4-4-4. If the data width is 32bit, there are one bust access almost need 16 mcu cycles. mcu clock is 266MHz for 533MHz DDR.
DDR2 has one cycle with two data strobe and burst length is 4.
So 16 mcu cycles has 8 words accessed. It is means 533MB/S bandwidth.
But my read test result is 241MB/S. And RAMSPEED's wrost case is 130MB/S.
Unless there are no burst access, there are no reason to get so bad result.
Doug Boom, thanks for your prompt reply.
Currently linux kernel E1000 driver has supported check sum offload and our system is configured as 100Hz in default, but the driver has not supported jumbo frames.
Here are my questions:
Does the jumbo frame feature need extra switch which support large packages? How many cpu loading can jumbo frame feature save?
Are there any other ways to decrease cpu loading?