4 Replies Latest reply on Aug 3, 2009 10:21 AM by DougB

    What cpu loading should be normal for 1Gbit network bandwidth on 81341?

    Green Belt

      I tested on 81341 + 82546 with linux kernel version.

      When the network data rate reaches 940Mb/s, the 81341 cpu loading is up to 100\%. I tested the memory (533M ddr2) write/read speed using ramspeed, and found it is strange write is much faster than read.


      >> INTEGER & READING         1 Kb block: 819.87 Mb/s
      >> INTEGER & READING         2 Kb block: 815.78 Mb/s
      >> INTEGER & READING         4 Kb block: 829.45 Mb/s
      >> INTEGER & READING         8 Kb block: 830.34 Mb/s
      >> INTEGER & READING        16 Kb block: 822.44 Mb/s
      >> INTEGER & READING        32 Kb block: 739.76 Mb/s
      >> INTEGER & READING        64 Kb block: 502.07 Mb/s
      >> INTEGER & READING       128 Kb block: 493.90 Mb/s
      >> INTEGER & READING       256 Kb block: 496.52 Mb/s
      >> INTEGER & READING       512 Kb block: 291.22 Mb/s
      >> INTEGER & READING      1024 Kb block: 134.80 Mb/s
      >> INTEGER & READING      2048 Kb block: 133.39 Mb/s
      >> INTEGER & READING      4096 Kb block: 133.53 Mb/s
      >> INTEGER & READING      8192 Kb block: 133.40 Mb/s
      >> INTEGER & READING     16384 Kb block: 133.41 Mb/s
      >> INTEGER & READING     32768 Kb block: 133.25 Mb/s

      >> INTEGER & WRITING         1 Kb block: 2231.47 Mb/s
      >> INTEGER & WRITING         2 Kb block: 2329.18 Mb/s
      >> INTEGER & WRITING         4 Kb block: 2324.89 Mb/s
      >> INTEGER & WRITING         8 Kb block: 2367.37 Mb/s
      >> INTEGER & WRITING        16 Kb block: 2279.63 Mb/s
      >> INTEGER & WRITING        32 Kb block: 2377.30 Mb/s
      >> INTEGER & WRITING        64 Kb block: 2349.88 Mb/s
      >> INTEGER & WRITING       128 Kb block: 2273.93 Mb/s
      >> INTEGER & WRITING       256 Kb block: 1698.63 Mb/s
      >> INTEGER & WRITING       512 Kb block: 1395.81 Mb/s
      >> INTEGER & WRITING      1024 Kb block: 966.47 Mb/s
      >> INTEGER & WRITING      2048 Kb block: 945.92 Mb/s
      >> INTEGER & WRITING      4096 Kb block: 942.78 Mb/s
      >> INTEGER & WRITING      8192 Kb block: 929.33 Mb/s
      >> INTEGER & WRITING     16384 Kb block: 916.71 Mb/s
      >> INTEGER & WRITING     32768 Kb block: 924.87 Mb/s



      Who can provide the explanation for it? What cpu loading should be normal for 1Gbit bandwidth on 81341?



      Thanks a lot!

        • Re: What CPU loading should be normal for 1Gbit network bandwidth on 81341?
          Green Belt

          My name is Doug Boom and I do support for company's writing software for Intel's extensive line of Ethernet products.  While I'm not a 81341 expert, I've done a bunch of work around performance, and I think I can explain what your seeing.


          Reading from main memory will cause a flush of the outstanding cache lines.  Basically the CPU and memory controller both stop to make sure all the pending writes are completed and all things are accounted for in terms of internal processing.  Writes tend to be cached and therefore much faster.  The reads need to be flushed before completing, this causes a stall.  Because writes don't get flushed until something like a read or full cacheline happens, it tends to be faster.  There may be some scaling problems in the application as well.  Also the larger the data being read, the large the area that needs to be locked, which can lead to lower throughput.


          Ethernet is typically only 1500 bytes frames, so you can see from your data that this data size is still in the highest rate part of the data set.  Even when using jumbo frames, that is only 9KB so its still in the better part of the data you've collected.


          Depending on frame size your using, Ethernet can be 1.44 million packets a second of processing.  Most of the time the frame goes all the way up the stack so the frame is processed several times (driver, stack, socket, application) before completing.  This means it is processed by the Memory subsystem almost every time one of the SW actors (driver, stack) works on it.  All of this work adds to more CPU utilization.  So I would consider your numbers "normal" given the level of detail outlined so far in your post. 


          If you want to save some CPU, make sure checksum offloading is enabled on the TX and RX size and increase the MTU to the largest size your switch can handle.  That will move more data per transaction which can save some CPU.  The offloads also help move some of the workload onto the 82546, saving more CPU.  But the stock Linux driver has the checksums on by default, so chances are you're already seeing the benefit of the offloads.  For a fun test, invert the checksum offloads (if on turn off, if off turn on) and retest to see the effect on throughput and/or CPU.  Another thing is to set the hertz rate from 1000 to the old 100 value which will allow for longer time slices and more time doing work and less moving between tasks.  Changing the hertz rate in Linux can have other side effects, but it can be a fun experiment to see the effects.  You'll need to rebuild your kernel to make the Hertz change.


          Thanks for using Intel networking products and good luck with your design.

          Message Edited by Doug B on 07-28-2009 03:44 PM to correct for spelling mistakes :)
          Message Edited by Doug B on 07-28-2009 03:45 PM
            • Re: What CPU loading should be normal for 1Gbit network bandwidth on 81341?
              Green Belt

              My name is Fred.Fan. I am doing the oprofile tunning and memory performance test on IOP341 platform.


              1) Before discuss Read performance issue, I want to attach the log of memory test by my simplest test code.


               ~# ./memory_test -t 1

              mem_test_write 154


                       totoal bytes = 268435456

                       totoval us = 596164

                       result 450 MBPS

              ~# ./memory_test -t 0

              mem_test_read 127


                       totoal bytes = 268435456

                       totoval us = 1111875

                       result 241 MBPS


              Read is slower as twice times as write operation.

              But the result of ramspeed, it is almost 8 times.


              And I have do some test by ramspeed on PC and other arm platform, the result is read is much better

              than write.


              2) I have a question why xscale has inverse result.

               2.1) For xscale microarchitecture develop manual, we can get the write buffer is no pending and L2 cache is worked as write-back and write-allocate. Write-allocate should improve the write performance.

              But write buffer, other arm platform should have such component, too.  Why other arm platform, the write is slower than read?

               2.2) the memory test is just read or write. so there are no case to stall for write complement. But there are stalls to wait the load buffer complete(buffer is full).

               2.3) And there are a question about write behavior when cache miss. If I just write one word and read next word, what will happend? Does the one of belows or other behavior? 

                     2.3.1) the are two buffers(one for write, one for cache fill), the cache fill buffer will wait the complete of write buffer.

                     2.3.2) load cache line first. and then write to cache. the second read access will hit cache or wait cache data loading complete.  

              3) In normal, the MCU setting is 4-4-4. If the data width is 32bit, there are one bust access almost need 16 mcu cycles. mcu clock is 266MHz for 533MHz DDR.

              DDR2 has one cycle with two data strobe and burst length is 4. 

              So 16 mcu cycles has 8 words accessed. It is means 533MB/S bandwidth.

              But my read test result is  241MB/S. And RAMSPEED's wrost case is 130MB/S.

              Unless there are no burst access, there are no reason to get so bad result. 


              • Re: What CPU loading should be normal for 1Gbit network bandwidth on 81341?
                Green Belt

                Doug Boom, thanks for your prompt reply.

                Currently linux kernel E1000 driver has supported check sum offload and our system is configured as 100Hz in default, but the driver has not supported jumbo frames.

                Here are my questions:

                Does the jumbo frame feature need extra switch which support large packages? How many cpu loading can jumbo frame feature save?

                Are there any other ways to decrease cpu loading?