What’s the justification for performing model → world transformations on the CPU? From what I know about Defold’s batching system, it performs all model → world transformations by default with the material’s vertex space set to World, then passes those transformed positions in a vertex buffer to the vertex shader to be multiplied by the view and projection matrices, which are uniforms.
I’ve never seen a batching system that prefers vertices to be passed to the GPU already transformed in world coordinates because these transformations are much faster if they’re done in parallel on the GPU. The world transformation matrix for each object should be passed to the vertex shader inside the corresponding vertex buffer. This would also eliminate the problem I’ve seen where some developers want easy access to model coordinates without breaking the batch.
Maybe it was designed this way to support some limitation on older hardware? Maybe it’s focused on some very old version of OpenGL or something to do with embedded OpenGL?