i’ve always wondered this since, to me at least, strings seem like they’d be the better choice due to them being more easily stored to a configuration file, and easier to read for both people and computers.
Hashes are much easier for computers to read and deal with since they’re a single 8-byte number rather than an arbitrarily long string of bytes.
Comparison of strings, especially very simmilar and long ones are much more CPU expensive. Hence it much better to use hash generated from string than string itself. However in order to fully benefit from that its best to hash string once and then reuse this hash.
Let’s take a look at common example, messages:
function on_message(self, message_id, message, sender)
if message_id == hash("add_xp") then
self.xp = self.xp + message.xp
end
end
For each message script have to calculate hash and then compare the hashes. Quite bad since hashing is also a bit expensive, but not terrible if we don’t really expect this to be often or repetible.
But when we expect that this message will be sent very often it is super easy to speed up comparision by pre-hashing and storing the hash:
ADD_XP = hash("add_xp")
function on_message(self, message_id, message, sender)
if message_id == ADD_XP then
self.xp = self.xp + message.xp
end
end
It’s much faster and eliminates typos. And such speed-up is only possible because message_id is already a hash. Such speed-up also wouldn’t be possible, if built-in functions would require strings and be doing hashing by itself → that’s why it expect hash, prefferably stored one
i get that it’s faster, but how much faster is it really?
i don’t imagine it’s any more than a few hundredths of a millisecond.
i’m all for enhancing performance, but hashes are kind of annoying to work with when it comes time to actually store relevant data since hashing is a one-way irreversible process.
Let’s measure it! We can for instance use go.get() which takes a property name as the second argument. This should be a hash, but if you pass in a string the engine will hash it for you. We have three tests, each running 5000 times. I chose 5000 since it is not an insanely high number and you could actually imagine some scenario in a game where you need to use hashes 5000 times per frame.
- Test 1 - Passing a string
- Test 2 - Calling hash() every time
- Test 3 - Pre-hashing the string
local ITERATIONS = 5000
local now = socket.gettime()
for i=1,ITERATIONS do
go.get("#sprite", "tint")
end
print("Test 1 - String", (socket.gettime() - now))
local now = socket.gettime()
for i=1,ITERATIONS do
go.get("#sprite", hash("tint"))
end
print("Test 2 - Hash", (socket.gettime() - now))
local TINT = hash("tint")
local now = socket.gettime()
for i=1,ITERATIONS do
go.get("#sprite", TINT)
end
print("Test 3 - Pre hash", (socket.gettime() - now))
Results:
Test 1 - String 0.0069999694824219
Test 2 - Hash 0.0060000419616699
Test 3 - Pre hash 0.005000114440918
Test 1 and Test 2 should be fairly similar If you run the test a couple of times you will see the numbers for Test 1 and Test 2 fluctuate a bit. Sometimes Test 1 is faster than Test 2 and sometimes it is the opposite. But Test 3 will always be faster. It is not a huge difference, but it is measurable.
Here’s the codepad for the above if you want to play around with it.
For: local ITERATIONS = 5000000
Test 1 is faster:
Test 1 - String 4.9249999523163
Test 2 - Hash 5.7579998970032
Test 3 - Pre hash 5.0390000343323
Double it again, still faster:
Test 1 - String 9.646999835968
Test 2 - Hash 11.139999866486
Test 3 - Pre hash 9.8889999389648
What is occurring? Since Pre hash is the correct answer, something is fishy.
Edit: and on my laptop with 10000000:
|DEBUG:SCRIPT: Test 1 - String|6.6852281093597|
|DEBUG:SCRIPT: Test 2 - Hash|7.8060331344604|
|DEBUG:SCRIPT: Test 3 - Pre hash|6.765282869339|
Strings win again, never trust a test…
Repeat with test 3 run first:
|DEBUG:SCRIPT: Test 3 - Pre hash|6.7839930057526|
|DEBUG:SCRIPT: Test 1 - String|6.8664541244507|
|DEBUG:SCRIPT: Test 2 - Hash|7.1882319450378|
At last, the correct result! Yay.
Lies, dam lies, statistics and speed tests.
Restarting…
Test 1 - String 0.0079998970031738
Test 2 - Hash 0.0099999904632568
Test 3 - Pre hash 0.0069999694824219
Restarting…
Test 1 - String 0.0069999694824219
Test 2 - Hash 0.0090000629425049
Test 3 - Pre hash 0.0079998970031738
Restarting…
Test 1 - String 0.0069999694824219
Test 2 - Hash 0.0079998970031738
Test 3 - Pre hash 0.0090000629425049
Here’s a more advanced version where test order is shuffled, garbage is collected before each run, and each test is run multiple times and an average is calculated:
Looks like both String and Pre hash are nearly the same performance while Hash takes more time.
Is it a good idea to just use strings in Lua scripts and still keep hashes in engine core?
Well, the engine core will always use hashes. A single uin64_t comparison is a single instruction, as opposed to a string compare. Also, the storage is easier to use (no dynamic memory), and much smaller.
Using strings seem to have the edge generally on my machine, even though the difference is small. This is interesting to me, because I was under the impression that hashes would always be faster if pre-hashed and used as a local variable.
Restarting...
Test hash took 0.055999994277954
Test strings took 0.047999978065491
Test prehash took 0.049599981307983
Restarting...
Test strings took 0.048699998855591
Test prehash took 0.050399994850159
Test hash took 0.058500003814697
Restarting...
Test strings took 0.049200057983398
Test hash took 0.057299971580505
Test prehash took 0.0507000207901
Restarting...
Test strings took 0.048000025749207
Test hash took 0.05620002746582
Test prehash took 0.048399972915649
Restarting...
Test hash took 0.056699991226196
Test strings took 0.04850001335144
Test prehash took 0.050099968910217
Restarting...
Test prehash took 0.049500060081482
Test strings took 0.047499918937683
Test hash took 0.055699992179871
Restarting...
Test hash took 0.057299995422363
Test strings took 0.048100018501282
Test prehash took 0.050599980354309
Restarting...
Test prehash took 0.049900007247925
Test hash took 0.056799936294556
Test strings took 0.048900079727173
Restarting...
Test prehash took 0.049399995803833
Test hash took 0.055900025367737
Test strings took 0.048100018501282
Restarting...
Test hash took 0.056400012969971
Test prehash took 0.049600028991699
Test strings took 0.047899961471558
I think the britzl example is unfair for hashes.
In this example "tint"
is just a property name, (I guess) there’s nothing related to hashes underneath. But when it comes with hash("tint")
property, the engine may see that it’s a hash and does something related to hashes underneath. And because hash()
takes time, it leads to the Prehash test takes a little bit more time than String test
It is actually a bit surprising. There might be some optimizations we can do.
I don’t know anything about the Defold internals, but in many languages (including Lua), strings are interned – that is, strings are registered internally so that equivalent strings all have the same unique ID and point to the same character buffer. Comparing interned strings is a constant-time operation, since you can just compare the IDs or pointers. If property lookup in Defold is taking place in the Lua layer, then maybe that accounts for the lack of a big difference in performance.
While true that the strings are interned, it’s not really affecting the tasks performed here, as there are no string comparisons.
While strings are interned, unequal strings will require a per-character comparison.
For instance, are the following strings equal?
"indent preformatted text by 4 spaces"
"indent preformatted text by 4 spacez"
A string comparison would require comparing each character up until the last s/z.
But a hash will eliminate that immediately.
Of course, if the hash is just 64 bits, you still want to compare the strings for equality. But if using a good hashing algorithm, statistically, you don’t need to compare the entire string.
With 64 bits from a cryptographic hash, you only need to compare a handful of characters at most to be confident they are equal.
And of course, in a dictionary, hashing is a no-brainer for performance.
Note: whether Lua interns strings or not, your code cannot depend on that because your functions do not know how the strings were created (so you cannot use identity to disqualify equality).
My understand of string interning is that every string with the same characters is replaced with a pointer to the same unique, shared object, so testing for inequality should be constant time too. Of course, internally this is basically the same as hashing every string, and as you mentioned, it’s generally an implementation detail that shouldn’t be relied on.