Is there any tricks to efficiently utilise the NEON feature in Cortex-A35. I believe the Cortex-A35 has in-order execution, so what is the correct ways to load and process data.
- I need to load data into batch of neon buffers to hide data latency (ie. found in case of cortex-A8 article)?
- Combining LOAD-STORE operations improve CPU cycles (Does this execute parallely)?
- Does pre-load improves data-cache in case of consecutive buffer access?
- Does ARM code and NEON execute parallely, so can i combine ARM and NEON to improve CPU cycle?