I agree, it’s a good question.
Something we try to think about is cache coherency.
Keeping the relevant data in the L1 cache when calculating things is the goal.
To make it so, we allocate contiguous blocks up front, when a collection is loaded, in a data oriented design.
Also, we keep the C structs (POD types) as small as possible.
The end result is that we can avoid some costly cache misses when processing the data.
Keeping the code small also has a similar effect, it’s less code that has to be run!
These practices come from experience from other projects (e.g. game engines), and knowing what works well and what doesn’t.
That said, I think there is nothing new or unique about our approach, we just keep this in our focus all the time.
And at the end of the day, we let the projects show if it’s performant or not.
Here’s our bunny mark example, running 32k sprites @60fps. (click to add more sprites)