Lua and Luajit Best Practices?

dlannan · May 11, 2021, 11:43am

The following thread is intended to collect all the good ways to use Lua/Luajit within Defold.
Some of the information will include:

General Lua performance with respect to language use
How to tweak luajit to perform that little bit better
FFI specific performance improvements (for those targeting platform external libs)
Interop performance notes for native libraries.
Defold specific optimizations

Hopefully this could be a useful reference of common links and information that people can add to over time.

Initial References
Defold Optimisation Page

Lua Classic Tips:

http://lua-users.org/wiki/OptimisationCodingTips

Luajit Performance Tips:
https://www.programmersought.com/article/62883257108/

Luajit Performance Tips (from Mike Pall):
http://wiki.luajit.org/Numerical-Computing-Performance-Guide
http://wiki.luajit.org/NYI

dlannan · May 12, 2021, 9:15am

Luajit projects useful for learning interop:
UFO Excellent examples of luajit ffi

Lua for Windows:
A huge array of interop libraries for windows. You can make almost any application using these.

These projects provide a huge array of samples and libraries for people to learn Lua and Luajit with. Highly recommended to new Lua users.

dlannan · May 13, 2021, 10:56am

Some quick tips for lua performance:
Always create local references to methods within tables if they are going to be called many times.
The example here is using tables. Put this at the top of your lua file, and this will improve perf if you are operating on tables many times in a frame.

local tinsert = table.insert
local tremove = table.remove 
local tconcat = table.concat

You can do this with any module or lua script table you are using.

dlannan · May 13, 2021, 10:58am

Another classic perf tip:
Always try to use ipairs if you intend to do a large amount of iteration within the main game frame at runtime.

-- Do this
for i,v in ipairs(mytablelist) do ... something .. end 
-- Try not to do this
for k,v in pairs(mytablelist) do ... something .. end

If you need to use key tables, try to use them for lookups and not for iteration.

dlannan · May 13, 2021, 11:09am

Another lua handy tip. This is not performance related, but it is a look into the wonderful world of metatables and metamethods
Often you want to get a size of a table. You can iterate the table, and if you use insert and remove then table.getn will work most of the time. But it can be frustrating because using a table[key] = value can break getn.
What to do? Metatables!!! This is a little OO like but heres an example of a table with some metamethods that help solve the above problem.

local myobject = {}
local mt = {
  __newindex = function (tbl, key, value) 
       if(key == "count") then return end -- dont allow count modification!
       if(tbl[key] == nil) then tbl.count = (tbl.count or 0) + 1 end
       if(value == nil and tbl[key]) then tbl.count = tbl.count -1 end
       tbl[key] = value 
  end
}
setmetatable(myobject, mt)

Now when you use myobject[key]=value the myobject will have a count properties added that shows how many indexes are added and removed using the newindex ([]) method.
When combined with other functions and metamethods you can do some really nice things to make managing tables much easier and more intuitive to the developer.

dlannan · September 24, 2021, 3:11pm

FFI is cool and it is very fast - C fast. But it is also has a bunch of baggage. Here’s some quick tips and code snippets using ffi, and how you can leverage it.

The Bad
FFI bypasses the normal C calling convention in Lua to call C methods. This means when you call an FFI C method you can jump into handling pointers and addresses directly. This means crashing your application (and even Defold) is a real possibility. Beware!

FFI is platform specific. When you call into the lower levels, you are calling that platforms specific methods that are compiled for that platform (OS/Hardware). This means if you need cross platform you will need to make FFI mappings for each.

Note: FFI will not work on html5 (or I dont think it will - wasm might blow up). FFI should work on the other target platforms though.

The Good
Because FFI lets you call native C methods it is insanely quick. And the Luajit system treats it like directly calling a C function. This means you get the great benefits of the jit prediction systems and the speed of running something at maximum perf on a machine.

It is horribly easy to use. One of the best ways to be able to interface with Lua and very easy to write for - just write a C dll/so and call it

How?
This all sounds interesting Dave, what is it, and how can I use it.
Heres a quick example. We want to call the OS level malloc to make a huge amount of memory (which we cant always do in Lua), and we want to put stuff in it.

local ffi = package.preload.ffi()  -- In Defold this is a little different. Normaly you use: local ffi = require("ffi")
-- Define the methods you want to use (these are OS methods)
ffi.cdef[[
void * malloc( size_t bytes );
void free(void *ptr);
]]

-- Thats it. We are done! Now we can call malloc and free directly!!
local mymem = ffi.C.malloc( 1e8 )  -- alloc 100MB - you can make this over 2GB which is lua's own internal limit

-- Put something in it. FFI lets you use 0 based array assignment!
mymem[0] = 10
mymem[100000] = 20
-- Get the values
pprint(mymem[0], mymem[100000])
-- Let the memory free! Do not forget to do this, or you may end up in a bit of a mess
ffi.C.free(mymem)

Some things to note.
When using C library methods like malloc and free, ffi maps them into the ffi.C object. This is why you call them with a leading ffi.C.
When loading external libraries you need to call ffi.load on the library, which will make the methods in for you . More details here: FFI Library

dlannan · December 4, 2021, 2:37am

Adding some more info. There is a great discussion here:

The discussion is focussed mainly about locals vs table lookups. In general, always use a local to access a table property especially in tight loops or using large numbers of lookups.

local mymodule = require("mymodule")
-- Access my module function or table alot
local myfunc = mymodule.myfunc
local mytbl = mymodule.mytbl

local a, b = 1, 1
for i=1, 1e6 do 
  a = a + mytbl.somenumber
  b = b + myfunc()
end

The problem with accessing tables, is that for every property you call a hashmap lookup.
If you have a piece of code like:

local mybigtable = require("mybigtable")
-- Do some loopy stuff
for i=1, 1e6 do print(mybigtable.t1.t2.t3.x) end

The call here isnt one hashmap lookup. Its four (technically five).
Get mybigtable → get table t1 (hash lookup) → get table t2 (hashlookup) → get table t3(hashlookup) → get field x (hash lookup).
If you do this many times then it can create performance problems - especially if the hashmaps you are referencing are large (10K+).

The way around this problem, is to localise three of those lookups so you now have:

local mybigtable = require("mybigtable")
local myvec = mybigtable.t1.t2.t3
-- Do some loopy stuff
for i=1, 1e6 do print(myvec.x) end

This is a completely pointless example, but it shows how rather than calling 4 lookups every step in a loop, you will only call one. And thats the primary reason for using locals. Its a simple set of variables that save the expensive lookup calls - think of it like a cache.

Some caveats to this method:

Only 250 locals are available to a function scope - but if you are doing more than that, then your function is way too big anyway
Locals can only really bring perf benefits if there are benefits to be had Ie, if all your table calls are single property table calls like: mytable.stuff, then there is probably not alot of points doing it.
Beware of strings. When you reference a table with a string notation like shown, then when you first call that reference (string - t1, t2, t3, or x) then luajit will make a string hashid for it. If you have many (lets say 100K) then on the first reference of these properties luajit will spend some “extra time” creating these hashids. Usually not an issue, but beware if you are dealing with large numbers of anything.

Thanks to jseb for a great discussion on the matter. Sometimes I automatically assume people know this about luajit … For more performance info, please have a look at the docs at the top of the thread. Mike Pall’s info is very valuable in determining how best to manage a luajit design.

dlannan · December 6, 2021, 5:55am

Some notes: Please dont refer to the sample code here. It is for explanation only. For example this code:

for i=1, 1e6 do print(mybigtable.t1.t2.t3.x) end

Will only do that lookup once (unless there are some meta methods on x). Because luajit is smart enough to reuse repeated expressions and results (like most compilers). When does that multiple hash lookup occur? At least once at the start, and depending on how you manipulate the parent tables and how many registers are used, and lots of underlying “ifs” it may call them multiple times per loop step. The ‘safe’ and simple way to ensure that mytable.t1.t2.t3 is not going to impact much is to put in the local reference.
Im sure this is all fairly obvious, but what is not obvious is how luajit specifically treats all the underlying execution - and thats highly variable. Its perfectly fine to not use locals. And if you are comfortable doing that, dont worrry, its unlikely you will see problems in the normal game or application.
If you think there are perf issues. Then test first, get the results (ie where the perf probs are) then look at applying changes. Theres no need to change a style over something like this

morgerion · February 9, 2022, 12:28am

I am interested in memory limitations.
Can you tell me how many bytes are actually used to organize an array element? I mean, since each element has a pointer, it’s probably 8 bytes since the engine is 64-bit. But this is just my supposition. Lua may add some overhead of its own on top.

dlannan · February 9, 2022, 2:36am

Hi @morgerion - this is a difficult thing to answer. Firstly we need to know what runtime you are using. Is it native Lua or Luajit - each runs Lua bytecode, but Luajit (with its JIT) optimises the bytecode to be quite more performant at runtime (mostly relevant to functions). Then there is the question of if you are using Luajit, you may just use normal C arrays with ffi - meaning you can have arrays of any size/structure.

I’ll try to answer each as simply as possible with ref links.
Lua native:
All tables (and arrays) in Lua are a hash table:
https://www.lua.org/source/5.1/ltable.h.html
This means their ‘size’ of the index is generally fixed as a hash key. Which is 64 bit. As you can see the table is much like a Node pointer list: Lua 5.1.5 source code - lobject.h
Each Node is thus probably what you want to look at - which is a Value and Key (just above the Table definition). Each of these can be actually LuaObject, which means it can be a number, string, function, nil and some other odd types
Generally this node should be roughly 2x 64bit. But I would need to check to be sure.
Lua 5.1.5 source code - lobject.h

In Luajit this is a little bit different. The most of the above still applies (since Luajit is bytecode compatible) but you will node the Node structure is different:

github.com

LuaJIT/LuaJIT/blob/1d7b5029c5ba36870d25c67524034d452b761d27/src/lj_obj.h#L487

    
      
          #define iscfunc(fn)	((fn)->c.ffid == FF_C)
          #define isffunc(fn)	((fn)->c.ffid > FF_C)
          #define funcproto(fn) \
            check_exp(isluafunc(fn), (GCproto *)(mref((fn)->l.pc, char)-sizeof(GCproto)))
          #define sizeCfunc(n)	(sizeof(GCfuncC)-sizeof(TValue)+sizeof(TValue)*(n))
          #define sizeLfunc(n)	(sizeof(GCfuncL)-sizeof(GCRef)+sizeof(GCRef)*(n))
          
          
/* -- Table object -------------------------------------------------------- */
          
          
/* Hash node. */
          typedef struct Node {
            TValue val;		/* Value object. Must be first field. */
            TValue key;		/* Key object. */
            MRef next;		/* Hash chain. */
          #if !LJ_GC64
            MRef freetop;		/* Top of free elements (stored in t->node[0]). */
          #endif
          } Node;
          
          
LJ_STATIC_ASSERT(offsetof(Node, val) == 0);

This is because Luajit tracks hash usage and tries to predict (with its JIT) what it needs and when. It is surprisingly effective, but it means the memory usage for tables for instance can be higher than normal Lua. If you look at TValue in Luajit you’ll notice it has alot of platform specific code, this is to cater for the variants the JIT can run on. Generally, measuring this node usage should get you close to mem usage.

With Luajit and ffi, life is easy … its just like using C but directly from Lua. Heres a simple example of how you could work with an array of 32 bit ints.

local myarray = ffi.new("int[3]", {})    -- Initialise a C like int array with 0's
print(myarray[0])
myarray[0] = 5
print(myarray[0])

Reference: FFI Semantics
Using ffi, it gives you complete control. However, Luajit will_not hold your hand. You are effectively handling pointers, so you need to manage that properly. If you use ffi.new, then the gc will cleanup after you if you are finished using it. But if you pass the pointers around and forget to let go of the handle… well… you know… C… right?

Overall. It really depends what you are measuring too. There are lua methods that handle indexed (consecutive integer keys) arrays differently to other arrays (hashmaps). But these should have similar memory footprints.

morgerion · May 25, 2022, 7:32am

Is it possible to pass FFI arrays to native extension as userdata?

dlannan · May 25, 2022, 3:57pm

Im not sure. I think this might be problematic. ffi’s allocators are handled by the jit, and do things like automated destruction. So I suspect there might be problems.
Give it a try. If I get some time, I might have a look at it.