I've been monitoring the PCIe Read packets on my Linux box. Much to my surprise, when requesting a block read in the application software, this request is broken up into many length one memory read requests. Is there a way to improve this?
As opposed to the write-case, there is no such thing like "read combining". However, there is a possibility to generate requests for more than 8 bytes: prefetching. For memory regions marked as cachable the CPU may read data in advance, i.e. before it is actually needed. When data is fetched into the cache, whole cache lines are read. This operation is called cache line fill. A cache line is 64 bytes on the platform used in this tutorial. Note that caching by default is disabled for all I/O regions which are mapped into memory. The following trace shows what happens if caching is enabled.
My 2nd question is: How do you enable IO caching in Linux?