Making STM32 Ethernet Work With Cache Enabled
This article explains how turning on CPU cache on modern STM32 chips can silently break Ethernet DMA and cause weird, hard-to-debug network issues.
Join the DZone community and get the full member experience.
Join For FreeThis article explains how turning on CPU cache on modern STM32 chips can silently break Ethernet DMA and cause weird, hard-to-debug network issues. It walks through why this happens and shows simple, practical ways to fix it by keeping Ethernet buffers out of cached memory or properly syncing the cache so the CPU and DMA see the same data.
Overview
The world of microcontrollers was peaceful and predictable until someone introduced advanced interconnect buses. Unhappy with that, someone else introduced caches.
Basically, “modern” microcontrollers have a bunch of internal buses that can be switched or routed. The CPU core has a bunch of buses, and other buses interconnect peripherals.
- Some peripherals are slave stuff, like RAM.
- Some peripherals are bus masters, like DMA.
As, historically, accessing generic RAM through a bus has been slower than accessing special-purpose RAM “tightly coupled” to the CPU, someone thought a cache, sitting between the CPU and memory, could improve this.
When reading, the cache gets populated on the first read, serving subsequent reads. When writing, there are two strategies:
- Write-through: Data is written to the cache (write-allocate, or not: no-write-allocate) and to the memory. This last operation can be deferred (write-allocate only).
- Write-back: Data is written to the cache. To write to memory, the cache needs to be “flushed.” When the cache contains data that has not been written to memory, it is said to be “dirty,” so another word for this operation is to “clean” the cache
When there is more than one bus master, be it another core or a DMA controller, things can get out of sync quickly.
CPU writes, DMA reads: CPU → Cache → Memory → DMA
- Write-through cache: No problem
- no-write-allocate: Tx just works
- write-allocate: the CPU may miss the DMA marking the descriptor available, as the next CPU read will be from the cache → “No descriptors available” (see DMA writes, CPU reads below)
- Write-back cache: As CPU-written data can still be in the cache, the DMA controller may read stale data. The cache needs to be flushed/cleaned before the DMA process starts
DMA writes, CPU reads: CPU ← Cache ← Memory ← DMA
- As the memory space where the DMA controller writes can be cached, since this controller writes to memory, data in the cache gets stale, and it contains data from a previous CPU read that is no longer valid. The cache needs to be invalidated when the DMA controller finishes, before the CPU starts reading.
- This is a simplification; things are like that only if data is FULLY aligned to a cache line and not shared with anything else. See Appendix C.
Objective
We'd like to integrate Mongoose into existing projects, which may already have MPU and I/D caching set. Ideally, if the wizard just drops the “mongoose/” directory into such a project, it should just work. The exact mechanism is to be decided; perhaps it could be driven by the preprocessor definition in mongoose_config.h, but ideally, it should just work without any extra manual definition. Therefore, we do not control the MPU/caching settings of the project, but need to adapt to them.
Possible Strategies
As can be imagined, the above is not free. Caches work by holding lines of data, so to invalidate an area requires iteratively doing it for every line involved. Areas should align to cache line size, too. The same happens for flushing operations. Working on the whole cache at once is a big penalty for the rest of the actions, and if done frequently, is even worse than disabling the whole cache (we need to not only read/write memory anyway, but also act on the cache)
Strategy 1: Avoid the Cache
Depending on the internal buses, some memory sections can be non-cached; that is, the bus switch/matrix connects memory to the CPU, bypassing the cache. We can place our buffers in those sections and relax. This is by far optimum as it does not require any additional actions, and doesn't fiddle with hardware it doesn't need (the cache is a man in the middle no one asked for, it doesn't serve any purpose here, particularly in our architecture where DMA-written memory goes to a queue and is later copied to a buffer before being processed; or just written once and sent, the other way)
- Memory area: Unfortunately, Cube doesn't seem to define an area for this purpose, so our efforts would fail when people liberally cut and paste. ST linker files have a single RAM block defined.
- Absolute memory positioning: Ugly, but should work. However, there's a major caveat: there is no way to place a block of memory in an absolute location with GCC. It is with clang, even with Keil or IAR, but not with GCC. We can place a pointer, but to transform that into a block, we need to play with the linker file.
To reinforce this choice, there's also the fact that in some cases the ETH-DMA is unable to access some RAM sections, so we need to craft our linker files to map to usable sections. We've also done things like this for NXP iMXRTs in the past.
Strategy 2: Mark as Non-Cacheable
If the processor has an MPU, some areas of memory can have specific attributes. We can mark them so the cache will not hold data coming from those locations. The problem with this is. that there are few MPU regions, and who are we to decide how our customers will use their MPU? Even though most won't care, those who do need to be aware of what we're doing.
The processor may not have an MPU, or an RTOS might want to make better use of its few regions. or. In fact, the Arm Cortex-M33 does not have a standard way to work this out, as caches are vendor extensions.
This seems to be ST's preferred way. Their lwIP drivers work with pbufs allocated and managed by lwIP; aligning to cache lines would probably be overkill.
Strategy 3: Live With the Cache
We invalidate the proper cache region before reading DMA-written memory, and flush/clean the proper cache region after writing memory that will be read by the DMA.
Some people like to reconfigure the cache to be write-through. From our perspective, this is not much different from just disabling it. It requires the same amount of intrusion and system configuration. Besides that, most of the effort is on the CPU read side of things.
A full Ethernet frame would use 48 32-byte cache lines. Every time a frame is received, the cache is taken from those tasks that may make use of it, just to be invalidated some microseconds later. If the cache has a way to detect those lines that are being used frequently and keep them, things are fine; otherwise, things may become slower than with the cache disabled. This is an argument in favor of strategy 2, when 1 is not possible, and that is feasible and convenient.
Hidden Gotchas
As there are a lot of buses, some processors might want to reorder accesses, and even compilers might be tempted to reorder instructions, some specific actions need to wait for others to have finished. This is the reason why data synchronization barrier instructions are sprinkled over the code.
Conclusion
Mongoose tries to apply strategy 1 when possible, that is, we circumvent the cache in all those architectures that have simple means to do that, like STM32F and STM32H5. When that is not possible, we resort to strategy 3, and our driver will handle all related cache coherency actions.
See Appendix A for microcontroller data, Appendix B for solutions for typical architectures, and Appendix C for the long stories.
Published at DZone with permission of . See the original article here.
Opinions expressed by DZone contributors are their own.
Comments