Welcome to the Intel® Embedded Community.
Since you are using Intel® Tunnel Creek, I want to make you aware of a special place to go with your technical questions. The Intel e-Help desk is staffed by Intel representatives dedicated to answering embedded Intel® architecture product, design and development questions for select Intel processors, including Atom.
The Intel e-Help desk is only available to registered Privileged users. Before you can access e-Help, you will first have to upgrade your community membership to Privileged status. Privileged account status also allows you to access special documents and tools at the Intel® Embedded Design Center. As moderator, from what I can see in your user profile the upgrade of your account should be granted quickly.
If you are interested, Click here to go to your ‘My Account’ page and request Privileged access.
In the meantime, let's see if someone in the community can help you with an answer.
Well, I requested privileged access, but haven't heard anything back yet.
Since I last posted, I tried a few more tricks with the topcliff ethernet to try and improve the TX performance but haven't had any luck. I understand that it probably isn't meant to be a high performance device, but I would like some idea of what I should expect in terms of performance, just to rule out the possibility that I did something terribly wrong in my software. I've learned that I have an A0 stepping Tunnel Creek processor, so it's possible the issue is a matter of an early silicon erratum.
That said, I'm a little disappointed with the topcliff ethernet design in general. Yes, I know this is an embedded platform, and that this implies things should be kept small and simple. But even given that rationale, I don't understand some of the design choices that were made.
For example, there is no support for scatter/gather. You are restricted to using one descriptor per frame. I suppose that isn't *that* unusual, however, there is also a 64-byte alignment restriction on all DMA addresses (the lower 6 bits of the addresses are not decoded). This means all frames must fit into a single buffer, _and _all buffers have to be 64-byte aligned. You can fudge this for the RX path without it causing too many problems, but for the TX path, unless you are lucky enough to have a protocol stack that guarantees it will meet these constraints, you're almost always going to have to copy the data prior to transmission in order to coalesce and/or align everything correctly.
I can only assume those restrictions came about as the result of a desire to limit the complexity of the controller in order to reduce the logic gate count. But that doesn't explain why additional logic was committed to implement a TCP/IP accelerator engine.
Using the TCP/IP accelerator for checksum offload is also problematic. When you enable checksum offload, either on RX or TX, the controller requires that a two byte pad be inserted between the ethernet frame header (the first 14 bytes of the frame) and the frame payload. No TCP/IP stack uses this frame representation internally, which means when you send a frame, you have to always copy the data to coerce it into the form that the controller expects, and when you receive a frame, you have to copy again to coerce it back into what the protocol stack expects. This requires extra CPU cycles which tends to negate the benefit of having checksum offload in the first place.
Also, while the controller supports jumbo frames for gigabit, there is no way to limit the maximum frame size in the receiver. Most controllers provide a register to specify the maximum frame size: any frames over that size will be discarded by the hardware instead of being DMAed to host memory. But the topcliff ethernet lacks this. The controller is able to receive up to 10.3KB of data in a single frame, and there's no way to force it to receive less. It does flag frames largrer than 1514 bytes as 'too long' but it DMAs the data to the host anyway. This means you must populate the RX DMA ring with buffers that are at least 10.3KB in size all the time, even when you don't expect to use jumbo frames. If you only use, say, 1536 byte buffers, and a jumbo frame is received unexpectedly, the chip will DMA all of it to the host and overrun the buffer.
Always using 10.3KB buffers will avoid the potential buffer overruns, but at the cost of using more memory, which seems counter-intuitive for an embedded platform where the usual goal is to conserve memory.
Lastly, there is some ambiguity in the RX TCP/UDP checksum offload feature. The TCP/IP status field has a bit that is set when the TCP/UDP transmit checksum is incorrect. However, there is no bit that indicates whether the current frame does or does not contain a TCP/UDP packet. The bit will be 0 when a valid TCP/UDP packet is received, but it will also be 0 when other kinds of frames are received too. This makes it hard to tell what the state of that bit really means in a given situation. The only way to be certain it indicates a valid TCP/UDP checksum is to manually inspect the protocol type in the IP header, which you shouldn't need to do. (The PRO/1000 controllers for example have both 'TCP/UDP checksum checked' and a 'TCP/UDP checksum error' bits. This makes it much clearer when a frame with a bad TCP/UDP checksum has been received.)
It's these sorts of things that make me question just what kind of performance target I should be aiming for. Even with the extra gymnastics in my driver necessitated by the controller design, the 150Mbps TX perfirmance that I'm getting now seems unusually low.