Scan text for unique characters (for Chinese, Japanese, Korean fonts)

Pkeod · April 26, 2020, 6:00am

What do people use to build a unique character list for fonts? I need to be able to scan all files in a folder / subfolders and build a list of unique characters to be included in fonts to minimize total download size.

@Ivan_Lytkin I think you said once but it was in slack and the history is gone for it. I think it was a python script?

I have .json files and .txt files in various folders. It needs to be able to work with Chinese characters.

Here is one for Unity / textmeshpro

totebo · April 26, 2020, 9:30am

I actually wrote a small Lua tool for this. The language data is read from a .csv file exported from Google Docs, then all languages are parsed and a string is spat out.

Pkeod · April 26, 2020, 11:15am

Share?

I might need to write something custom which runs when doing builds, scans the locale folders, inserts current characters into language appropriate font character lists. Maybe a good editor script candidate.

Dragosha · April 26, 2020, 11:20am

the same. Just lua script in game, run them after texts update and copy print from console to font settings manually.

local utf8 = require "helper.utf8"

-----------------
local function uniqueSymbols(mytexttable)
    local unique={}
    local extra_characters=""
    for k,v in pairs(mytexttable) do
        if v then
            utf8.gsub(v,".", function(c)
            if  not unique[c] then
                unique[c]=1
                extra_characters=extra_characters..c
            else
                unique[c]=unique[c]+1
            end
            end)
        end
    end
    print(extra_characters)
end

totebo · April 26, 2020, 11:24am

That looks very similar to what I’m using!

Pkeod · April 26, 2020, 11:26am

I think for the text / fonts, I will write an editor script which takes a config file of languages and fonts and scans the locale folders then inserts the needed character lists into the fonts along with extra chars always added, that would probably be easy and clean so we don’t need to worry about it it’s always scanned every bundle. We have a game translated to Chinese and it’s a ton of characters and lots of fonts, very messy to copy and paste and keep track of.

Ivan_Lytkin · April 26, 2020, 9:18pm

I use this https://github.com/abadonna/unique-chars

roccosaienz · July 18, 2021, 3:51pm

Hi!

I would like to use this script for extracting unique character from the japanese text.

I have a (very) stupid question. How to “install” the utf8 package this code is using? If possible please be really basic, I am a very noob about this… I am using a mac.

Thanks for any help!

Dragosha · July 18, 2021, 4:40pm

Сhoose any:

NativeExtension:

or just place this file into your project: luv/utf8.lua at master · artemshein/luv · GitHub

roccosaienz · July 18, 2021, 5:40pm

Thanks! I have placed the pure lua implementation in my project.

But something is not working as expected… I have the following Japanese text “本当に終了しますか”. But the console just print:
本当に終了��
As the last 4 kanji’s are not in the console font?

EDIT: same problem with cyrillic: this text “Выйти” prints just “Вы��”.

roccosaienz · July 18, 2021, 8:32pm

I switched to python. Following the suggestion here: Count the number of unique characters in a string in Python - GeeksforGeeks it is almost trivial to do it in python using a set.

Then I run the python script in the Mac console and I take the output there. All characters are printed fine.

I post it here for some other interested people.

Ciao, Rocco.