Why is vector3/4 (and other "vmath" types) a "userdata" type, and not just a table?

Is there a good reason for this?

As far as I understand it, userdata objects are C objects. It seems as if the vmath library was written in some other language (C or C++) and compiled for use with Lua, this would allow for greater efficiency than a LUA implementation of the same library would provide.

4 Likes

It does not matter how vectors/matrices are represented inside the engine. This is implementation details. What is important is how it is exposed to the Lua.

Userdata are the most worst type for vectors and matrices. In practice, this only leads to performance loss for no reason because access to components are slow.

1 Like

Could you please elaborate on this and perhaps provide some links to back up this claim?

Quick benchmark:

local iterations = 10000000

local time = socket.gettime()
local v3 = vmath.vector3(100, 100, 100)
local v3_add = vmath.vector3(0.1, 0.1, 0.1)
for i=1,iterations do
	v3 = v3 + v3_add
end
print("time v3", socket.gettime() - time)

time = socket.gettime()
local t3 = { x = 100, y = 100, z = 100 }
local t3_add = { x = 0.1, y = 0.1, z = 0.1 }
for i=1,iterations do
	v3.x = v3.x + v3_add.x
	v3.y = v3.y + v3_add.y
	v3.z = v3.z + v3_add.z
end
print("time t3", socket.gettime() - time)

Result:

|DEBUG:SCRIPT: time v3|2.0486030578613|
|DEBUG:SCRIPT: time t3|4.6426301002502|

In this case vector3 addition is a little over twice as fast when the vector3’s are represented as userdata.

2 Likes
|DEBUG:SCRIPT: time v3|2.582750082016|
|DEBUG:SCRIPT: time t3|0.015422105789185|

Your example is incorrect. Look carefully.
And there is also a bug in vector addition method probably:

DEBUG:SCRIPT: time v3	2.4838500022888
DEBUG:SCRIPT: time t3	0.014795064926147
DEBUG:SCRIPT: vmath.vector3(1088062, 1088062, 1088062)
DEBUG:SCRIPT: 
{ --[[0x14edebe0]]
  y = 1000099.999839,
  x = 1000099.999839,
  z = 1000099.999839
}

Would you mind posting the code that produced that output?

EDIT: Nevermind. Got the same result as you with the following code. No idea why the different result. Defold uses Sony’s open math libraries (same as Bullet) internally. Maybe @Mathias_Westerdahl has a clue?

	local iterations = 10000000

	local time = socket.gettime()
	local v3 = vmath.vector3(100, 100, 100)
	local v3_add = vmath.vector3(0.1, 0.1, 0.1)
	for i=1,iterations do
		v3 = v3 + v3_add
	end
	pprint(v3)
	print("time v3", socket.gettime() - time)

	time = socket.gettime()
	local t3 = { x = 100, y = 100, z = 100 }
	local t3_add = { x = 0.1, y = 0.1, z = 0.1 }
	for i=1,iterations do
		t3.x = t3.x + t3_add.x
		t3.y = t3.y + t3_add.y
		t3.z = t3.z + t3_add.z
	end
	pprint(t3)
	print("time t3", socket.gettime() - time)

Output:

DEBUG:SCRIPT: vmath.vector3(1088062, 1088062, 1088062)
DEBUG:SCRIPT: time v3	1.935350894928
DEBUG:SCRIPT: 
{ --[[0x94bfbe0]]
  y = 1000099.999839,
  x = 1000099.999839,
  z = 1000099.999839
}
DEBUG:SCRIPT: time t3	0.01155686378479
1 Like

EDIT: You can safely ignore the rest of this post.

The internal representation is vital since the engine performs transforms on a ton of vectors and matrices each frame and that has to be very fast. The Lua interface has to be closely tied to this internal representation. Consider this:

  1. Let’s say it’s possible to expose the data as a Lua table. It still would have to be transformed back to the internal representation after any change to any of the components which would, of course, add overhead. And note that this creates two copies of the vectors/matrices you use in Lua code (one in engine structure and one in the Lua table), which leads to…
  2. The big problem of keeping the data in sync. After each update the engine runs its transforms on the internal data, but of course that does not touch any Lua structures. So after, or during, the transform pass, the engine would have to find all Lua table references to each transformed vector and update them.

Userdata exist to deal with this problem.

(I am not a low level Defold/Lua expert so correct me if I am wrong here)

EDIT: Yes, I had a brain melt.

Why engine would do that? There no need to keep anything in sync.
If I get the object position and keep it in some variable, its value not updated automatically. It is copy of internal representation, not reference to it. I must get it again each frame, if I need it. Same with matrices, etc.
We get positions and other stuff by demand, after all.

On lua side, in turn, we add, subtract, and transform those data with many ways and very often. And this should be fast. For now, as you can see, everything is ~175x times slower, then it may be.

There workaround exists, of course.
For example, capture positions in lua tables or even raw numbers. Transform them as you wish. Then (after all calculations) make vector3 (or better reuse one) from this data and hand it over back to engine.

1 Like

Yes, you are right. My brain did not work correctly.

As I understand it, the pointer to the underlying structure would need to go into the table, which is not very safe. (https://stackoverflow.com/questions/1434846/lua-bindings-table-vs-userdata)

(EDIT: Hmm, I may be wrong but thinking about it, I don’t think a pointer has to be stored in there)

Could it be that the speedup is because of luajit being able to do optimizations that can’t be done if the plus operator is ”overloaded” (so to speak)?

Can you make tests in release mode? Will it change something?

This is really interesting, and something I’ve wondered too. The potentially massive speedup must be worth exploring?

Is there a simple way this theory could be tested in practice? Aside from the speedup, it would also feel more logical to simplify the userdata black box to a table.

You can try a pure Lua implementation overloading the add operator in metatable. See http://www.lua.org/manual/5.1/manual.html#2.8

I can only understand two words in that sentence, and they’re not the useful ones. :slight_smile:

2 Likes

Lua allows you to change the behavior of how tables are handled, so instead of, for instance:

local t3 = { x = 100, y = 100, z = 100 }
local t3_add = { x = 0.1, y = 0.1, z = 0.1 }
for i=1,iterations do
	t3.x = t3.x + t3_add.x
	t3.y = t3.y + t3_add.y
	t3.z = t3.z + t3_add.z
end

You can alter the behavior of + by telling Lua to use a custom function for addition. This makes it possible to instead of the above, write:

local t3 = { x = 100, y = 100, z = 100 }
local t3_add = { x = 0.1, y = 0.1, z = 0.1 }
for i=1,iterations do
	t3 = t3 + t3_add
end

The custom add function would take care of adding each table element in turn. The facility Lua has to allow these kinds of things is called a metatable, which is a table with stuff about a table. It can contain things like what should happen if you add a table with something, what happens if you add or deletes elements etc.

1 Like

Our “release” mode isn’t anything like a Windows “release build”.
We always build with -O2, even the “non release” version.
Our release build simply doesn’t have logging or profiling enabled.

I cannot answer your first question exactly (since this was before my time), but some initial thoughts/ideas as to why:

  • We also use vanilla Lua, which is a LOT slower (see benchmark below)
  • Single precision floats, for consistency perhaps?
    As you saw, the results are wildly different, since they have different precision.
  • The paradigm is still to use reactive code, as opposed to do everything in a script (e.g go.animate()).

Your tests are interesting, and they really show the strength of LuaJIT.
However, you will eventually call back into the engine, so there will always be some calls between Lua and C. And those calls cost a lot more.

Q: Is this example close to your current use case?

get_position()/set_position() with T3

As a test, I implemented two functions go.get_position_t3()/go.get_position_t3() which
accept/return a table of 3 elements (e.g. {x=0,y=0,z=0})

Setting a T3 is faster than setting a vector3
But getting a T3 is slower than getting a vector3 (and a T4 would be even slower)

In combination they cancel each other out.

Take aways

  • LuaJIT is very fast for operators + - / *
    Not a surprise really, since JIT’ed code is really fast.

  • Vector3/Vector4 uses our C backend, which is more heavy weight than using LuaJIT directly.

  • Returning a T3 is actually slower than returning a Vector3

  • Since we promise to keep the code backwards compatible, I don’t really see us changing the return values of our functions, and as seen it will lead to worse performance.

  • Implementing a separate “luamath” using pure Lua module (no C calls) might be an idea.
    That Implementation should have the same api as the “vmath” module.

  • Another idea could be to accept a T3/T4 in those places we accept a vector3/vector4, to gain some performance. But I fear this may be a very confusing api: to set using T3 but not being able to get a T3.

  • Also, there are other things behind the scenes that I’ve been meaning to investigate.
    E.g. pooling the lightuserdata structs after deletion. That might speed up the return values.

All in all, I don’t see a huge gain for the common case here, since we’re both bound to our current API and that the combined functionality (get_add_set_test) doesn’t give any noticable gain.

That said, we always welcome new ideas and tests, to help make Defold a better product.

Benchmarks

LuaJIT

add_test (v + v_add)
DEBUG:SCRIPT: time v3 1.7740211486816
DEBUG:SCRIPT: time t3 0.018021821975708

add_and_set (v + v_add; go.set_position(v))
DEBUG:SCRIPT: time v3 3.1769490242004
DEBUG:SCRIPT: time t3 1.5994951725006 (using go.set_position_t3())

get_add_set_test
DEBUG:SCRIPT: time v3 5.5890138149261
DEBUG:SCRIPT: time t3 5.4159610271454 (using go.get_position_t3() and go.set_position_t3())

-- get_add_set
for i=1,iterations do
    local t3 = go.get_position_t3();
    t3.x = t3.x + t3_add.x
    t3.y = t3.y + t3_add.y
    t3.z = t3.z + t3_add.z
    go.set_position_t3(t3);
end

Vanilla Lua

add_test
DEBUG:SCRIPT: time v3 3.6135270595551
DEBUG:SCRIPT: time t3 1.0665061473846

4 Likes

May be. Because of table creation?
But your combined getting/setting tests shows that is T3 at least not slower than V3.

The whole point is not about speed up getting/setting values, but about speed up operations in-between.
If we get vector3’s instead of tables, the add/subtract/multiply operations on them would be slow because of slow component access.

userdata vector3 can be substituted with table vector3 without breaking backward compatibility. And, as your tests shows, without performance degradation.

Yes, because of the table creation.

Well, I don’t think this data supports us changing things in such a fundamental way.

You can also return raw numbers (tuple: x, y, z) and lets user decide, does it needs table in that concrete case or not.

I suppose you would like to have the same behavior in Lua as when the engine does transforms?

Certainly, but adding metatables to allow users to add/sub/mul vectors equals out the time difference:

time = socket.gettime()
local m = {
	__add = function(l, r)
		-- no error checking whatsoever here...
		return { x = l.x + r.x, y = l.y + r.y, z = l.z + r.z }
	end
}
local t3 = { x = 100, y = 100, z = 100 }
local t3_add = { x = 0.1, y = 0.1, z = 0.1 }
setmetatable(t3, m)
for i=1,iterations do
	t3 = t3 + t3_add
	setmetatable(t3, m) -- since the prev line creates a new table.
end
pprint(t3)
print("time t3", socket.gettime() - time)

Timing:

DEBUG:SCRIPT: time v3	1.9194610118866
DEBUG:SCRIPT: time t3	1.7654540538788
``
3 Likes