Using Delfino as a reference, TI has noticed that flash efficiency was tied to the type of code that was run, so using C28x plus the FPU improves benchmarks by 44% over IQ math versions when running from a zero wait state RAM. This is where the real impact of FPU versus the C28x CPU can be measured. Indeed, the flash pipeline degraded a little benefit of FPU on Delfino types of devices where the improvement is only about 29% over IQ math when an arithmetic benchmark is executed from flash. Indeed, even if code running on a C28x FPU executes 29% faster than the same code running on a fixed point C28x with IQ math, the respective efficiency against RAM execution is definitely in favor of IQ. More specifically, TI noticed better efficiency against RAM when IQ math was used 94%, while the FPU achieves only about 76%. The reason for this that the FPU instructions are mostly 32-bit, where for IQ math, there is a mixture of 16-bit and 32-bit instructions. If most of the instructions are 32 bits, as the flash runs at two wait states with 64-bit pre-fetch, then the pre-fetch mechanism cannot keep the pipe full. Hence, the flash efficiency reduced from 92% to around 76%. This has been fixed Concerto and, again, the larger pre-fetch buffer already described. Also, the 65 nanometer flash and the 128-bit wide pre-fetch, the FPU benefits will be similar when executed from either RAM or flash. Even better, TI is showing a 20% increase in efficiency can be reached when parallel instructions are used such as a load and store with FPU instructions. This, again, is supported by compiler version 6.1 and the newer versions of that.

