Sense check my custom script profiler please!

Alex_8BitSkull · January 14, 2021, 10:22am

I need to optimise my game. I used the profiler to identify some script bottlenecks. The trouble is then identifying which part of a script is causing trouble. I whipped up a script that I could use to measure the time taken to execute each function of a script. My test seems to work, but I’m concerned about whether I’ve misunderstood or missed something that might muddle my results. Would appreciate someone looking this over to sense check it before I expand it / start using it.

The main script (I named it ‘brofiler’ ):

local M = {}

M.total_time = 0
M.segments = {}

function M.frame_init()
    M.total_time = 0
    M.segments = {}
end

function M.start_segment(segment)

    if not M.segments[segment] then
        M.segments[segment] = {total_time = 0}
    end
    
    M.segments[segment].current_start = socket.gettime()
end

function M.end_segment(segment)

    if not M.segments[segment] then
        return
    end

    local t = socket.gettime() - M.segments[segment].current_start

    M.segments[segment].total_time = M.segments[segment].total_time + t

    M.total_time = M.total_time + t

end

function M.frame_final()

    for segment, table in pairs(M.segments) do
        print("Segment: " .. segment .. " took " .. table.total_time .. " (" .. math.ceil(table.total_time / M.total_time * 10000) / 100 .. "%)")
    end
end

return M

Test script to profile:

local brofiler = require "brofiler.brofiler"

function update(self, dt)
	
	brofiler.frame_init()

	brofiler.start_segment("short_loop")

	for i=1,100000 do
		math.random()
	end

	brofiler.end_segment("short_loop")

	brofiler.start_segment("long_loop")

	for i=1,1000000 do
		math.random()
	end

	brofiler.end_segment("long_loop")

	brofiler.frame_final()
end

Output seems to work:

DEBUG:SCRIPT: Segment: short_loop took 0.011999130249023 (12.97%)
DEBUG:SCRIPT: Segment: long_loop took 0.080535888671875 (87.04%)

Thanks!

britzl · January 14, 2021, 10:41am

The script looks good and should for your purposes.

If you want something a little bit more advanced then I’d recommend testing ProFi:

It sets up debug hooks to catch and measure every line of Lua code that is executed and will generate a report sorted based on most function calls (or time spent in function, can’t remember).

Since it uses debug hooks it will slow down code execution further but since it measures ALL code it will still give you accurate measurements and help you track down problematic code.

Not sure about others, but I love this kind of detective work, trying to figure out how to optimise and speed up code!

Alex_8BitSkull · January 14, 2021, 10:44am

Thanks for checking, and also appreciate the recommendation - it seems like just what I need!

Alex_8BitSkull · January 14, 2021, 2:51pm

Due to a lack of knowledge when I started my project, I have a lot of utility functions sitting as globals (rather than, say, in a module). They also create a bunch of new variables each call (including vectors). I was wondering how expensive that would be so created a test using an expanded version of my script above.

I created four functions:

Global function that creates new variables
Global function that recycles variables
Local function that creates new variables
Local function that recycles variables

The functions look like this:

function global_function_new_variables()
	local var = 1
	local var2 = vmath.vector3(1, 1, 1)
end

recycled_global_var = 1
recycled_global_var2 = vmath.vector3()
function global_function_recycled_variables()
	recycled_global_var = 1
	recycled_global_var2.x = 1
	recycled_global_var2.y = 1
	recycled_global_var2.z = 1
end

local function local_function_new_variables()
	local var = 1
	local var2 = vmath.vector3(1, 1, 1)
end

local recycled_local_var = 1
local recycled_local_var2 = vmath.vector3()
local function local_function_recycled_variables()
	recycled_local_var = 1
	recycled_local_var2.x = 1
	recycled_local_var2.y = 1
	recycled_local_var2.z = 1
end

Using brofiler (which I made into a library for now, don’t know if it’s worth documenting) I created a test script like this:

local brofiler = require "brofiler.brofiler"

function update(self, dt)
	
	brofiler.frame_init()

	local times_to_execute = 100000

	brofiler.start_segment("local recycled")
	for i=1,times_to_execute do
		local_function_recycled_variables()
	end
	brofiler.end_segment("local recycled")

	brofiler.start_segment("local new")
	for i=1,times_to_execute do
		local_function_new_variables()
	end
	brofiler.end_segment("local new")

	brofiler.start_segment("global recycled")
	for i=1,times_to_execute do
		global_function_new_variables()
	end
	brofiler.end_segment("global recycled")

	brofiler.start_segment("global new")
	for i=1,times_to_execute do
		global_function_new_variables()
	end
	brofiler.end_segment("global new")
end

The results I get aren’t consistent, and I suspect my test isn’t designed well.

One result is:

DEBUG:SCRIPT: Tracked segments took a total time of 0.5747s to execute.
DEBUG:SCRIPT: global new took 0.1316s (22.9%), called 1 times
DEBUG:SCRIPT: local recycled took 0.1481s (25.8%), called 1 times
DEBUG:SCRIPT: local new took 0.1264s (22%), called 1 times
DEBUG:SCRIPT: global recycled took 0.1688s (29.4%), called 1 times

Local new is fastest, but only slightly faster than global new. Local recycled (which you’d expect to be the fastest one), is slower than local new and global new. This makes me think that assigning a number to x, y, z for an existing vector is actually slower than creating a new vector, and therefore isn’t “fair” as a comparison. Maybe? At least the local versions are both faster than their global equivalents.

But in another frame, the result is:

DEBUG:SCRIPT: Tracked segments took a total time of 0.6113s to execute.
DEBUG:SCRIPT: global new took 0.1201s (19.7%), called 1 times
DEBUG:SCRIPT: local recycled took 0.1422s (23.3%), called 1 times
DEBUG:SCRIPT: local new took 0.1406s (23%), called 1 times
DEBUG:SCRIPT: global recycled took 0.2085s (34.2%), called 1 times

In this case, global new is much faster than local new. Strange.

Is there something I can do to improve the validity of this test? I need to spend a lot of time optimising my game and there are a lot of potential areas, and so I am trying to find the most juicy targets to focus on. I thought moving my utility scripts to a module and recycling variables would be faster, but so far this test doesn’t really bear that out.

britzl · January 14, 2021, 8:49pm

Variables that are local vs global vs in a module table will have some impact on performance, but not much overall.

A local variable should be the fastest. Global and module should have similar performance as they both index into a table (_G or the module table).

What you should pay attention to is the creation of short-lived objects. Objects in this case are Lua tables and Lua user data (ie vector3, vector4 and matrix). These are costly to create and they generate garbage that needs to be collected which may cause frame stutter. Try to avoid that as much as possible!

Avoid table creation inside tight loops
Reuse vector3/4 if you can
Reuse message tables
Cache hashes

Examples:

-- bad: create a new hash every time
-- bad: create a table every time
function bad_damage(target, amount, type)
	msg.post(target, hash("damage"), { amount = 100, type = type })
end

local DAMAGE_MSG_ID = hash("damage") -- cache message hash
local DAMAGE_MSG = {} -- cache message table
function good_damage(target, amount, type)
	MSG_DAMAGE.amount = amount
	MSG_DAMAGE.type = type
	msg.post(target, DAMAGE_MSG_ID, MSG_DAMAGE)
end

--------

-- if no external source can move a game object you might
-- as well get the initial position and then reuse it
-- instead of constantly calling go.get_position()
function init(self)
	self.pos = go.get_position()
	self.dir = vmath.vector3(0, 1, 0)
	self.speed = 100
end

function update(self, dt)
	self.pos = self.pos + self.dir * self.speed * dt
	go.set_position(self.pos)
end

Alex_8BitSkull · January 15, 2021, 10:18am

I didn’t realise that, but it makes sense!

I do cache hashes and try to avoid calling go.get_position() more than necessary. I didn’t realise that table creation was costly, and indeed I believe I always create new tables for messages. So that is probably a performance gain I can score, particularly in big central scripts that send messages to a lot of entities.

Thanks for the examples!

dmitriy · January 15, 2021, 10:46am

Don’t store and access anything in self (indexing userdata much slower than indexing regular tables, so use modules for everything)
Don’t access vector components (x, y, z, w) directly (see 5., because vectors is userdata ). Make native extension that would return (and set) componets in one pass, and don’t touch userdata.