Japanese / Chinese text line breaks

Pkeod · May 13, 2019, 2:38pm

What to do about text that’s all on one line technically?

Here’s a machine translated sample

モグは厚いかみそりの鋭いシダを突き破った。ここに魔法がありました。彼女はそれを味わうことができた。モグが小川の流れる水の上を石から石に飛び跳ねたとき、彼女の紫色の目は輝く縞を残しました。ここで、それはここにありました、彼女が彼女の暗くてもつれた髪を引っ張ったので、彼女は彼女自身に言いました近く。彼女は見下ろして水晶の破片をかみ合わせた。遠い過去の遺物が隠れた場所に散らばっていました。それは薄暗い光で輝きました。はい！モグは言った、それは魔法でした、彼女がそれを必要としたもの。あなたの力、Faeの惨めさをください。モグは、とらえどころのない水晶を革のような、しわになる手でつぶしたときに言った。水晶が壊れ、光の点が飛び出しました。結晶の魔法の周りのモグ斑点が周りに落ち、モグを浸した。彼女は力の言葉を話し、そして彼女の肌は古いものから新しいものに変わった。抱きしめられてから若々しいまで。光の斑点はMogに感謝するために戻ったが、彼女はそのトリックを持っていないだろう。行って、ファウルクリーチャー！力は私のものです！元に戻すことはできません。モグは言った。彼女は封印されていない魔法のエネルギーを吸収しながら流れる水の上で無重力に踊りまわりを旋回し続けました。しかし、この魔法は十分ではありませんでした、彼女は利己的にそれを隠す貪欲な妖精からもっともっと取る必要があるでしょう。彼女を裏切った人々を罰するのにかかる限り。

Do we need to manually apply line breaks in cases like these either in the text itself or code so that the linebreakless text fits better within text nodes or is there an option I’m not thinking of?

sven · May 13, 2019, 2:44pm

There is an old issue to correctly handle line breaks in Asian (and other) languages; DEF-1509

However, there is support for zero-width space character that you could include in your strings. (Granted not the best solution though…)

In Lua, I think you could do something like this (I haven’t tried it myself):

local zws = "\xe2\x80\x8b"
local a = "モグは厚いかみそりの" .. zws .. "鋭いシダを突き破った。"

Pkeod · May 13, 2019, 6:43pm

The “zero-width space” character “\xe2\x80\x8b” does actually produce spaces. This should probably be fixed to not happen by engine.

2019-05-13%2011_39_48-UTF8JapaneseText

If I add some negative text tracking it does look better

Here is example project

UTF8JapaneseText.zip (2.1 MB)

This could probably be made smarter by detecting non-Japanese chars and not adding zero-width space char after them so they don’t split in half when mixed in but then that would mess up the tracking.

There are also some special situations to consider Requirements for Japanese Text Layout

britzl · May 13, 2019, 8:20pm

We do nothing special with the zero-width space character except use it when calculating linebreaks. If the zero-width space character has some width it is probably because it’s like that in the font.

Pkeod · May 13, 2019, 8:36pm

Opened font in FontForge

View-Goto

U+200B Unicode/UTF-8-character table - starting from code position 2000

Width was at 512

Set to 0

Set leading back to 0 on the text node

Before

After

2019-05-13%2013_32_20-UTF8JapaneseText

So should be possible to detect English words and not add zero width spaces within them. There are still more considerations until this is really good too.

https://www.ptiglobal.com/2018/04/26/the-beauty-of-unicode-zero-width-characters/

Pkeod · April 26, 2020, 7:14am

To add to the topic of zero width characters, there are other characters which should or should not have zero widths before/after. https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages

britzl · April 26, 2020, 8:51am

Wow, that is an extensive set of rules! There’s basically two groups: 1) Characters not permitted on the start of a line and 2) Characters not permitted at the end of a line

Right now we only break on space + zero-width space and we never break in the middle of a word. I’m not really sure if we need to do anything else really?

Pkeod · April 26, 2020, 9:39am

No, nothing is needed by the engine, but we do have to filter text to insert zero-width spaces based on these rules depending on language at the correct places before setting text on nodes / labels to have attractive text that has line breaks as expected.

Pkeod · December 7, 2021, 4:31pm

Here is a wip module meant to address the issues related to this topic. It is a start with following these rules Line breaking rules in East Asian languages - Wikipedia though it would be good if Defold itself was able to handle these rules on labels/text nodes. Maybe RichText would be a good option for implementing the rules since this is already sort of in its domain.

The wip module can for sure be improved to be more efficient / faster / implement more of the rules by someone who knows how to do those well.

This image illustrates the kind of problems the WrapText module is meant to address. Because all characters are connected there is no natural white space to allow line breaks. But there are still situations where you don’t want certain characters to be left at the start/end of a line.

Pkeod · December 8, 2021, 2:23am

noto-cjk/Sans/OTF/SimplifiedChinese at main · notofonts/noto-cjk · GitHub

When testing with this specific font with Chinese characters attempting include the zero width space / insert it between Chinese characters does not seem to work.

Test string

页哨临蛤扩杯桃波楚淡啜遣帝虐能嚷在惨挑茉整精

Zero Width Space
U+200B

FontForge claims there’s a character there but seems weird to me.

The other test font I was using seems to look more correct to me.

There must be something I don’t understand in relation to Simplified Chinese fonts with zero width space.

I tried using this GitHub - akiirui/RobotoCJKSC: Roboto + Noto Sans CJK SC Combination Fonts to test if it was an OTF issue somehow and it has the ZWS space listed as I’d expect at least but still shows ~ in engine when attempting to add the ZWS character between Chinese characters.

I tested just trying to have multiple ZWS characters together and it displays nothing.

With more testing, I think it’s not an issue with the fonts but just somehow the UTF8 stuff.

When I try to include just “\226\128\139” in the extra characters this happens to the raw file:

font: "/assets/fonts/babamoji1004/BABAmoji.ttf"
material: "/builtins/fonts/label-df.material"
size: 15
antialias: 1
alpha: 1.0
outline_alpha: 0.0
outline_width: 0.0
shadow_alpha: 0.0
shadow_blur: 0
shadow_x: 0.0
shadow_y: 0.0
extra_characters: "\357\277\275\n"
  "8\v9"
output_format: TYPE_DISTANCE_FIELD
all_chars: false
cache_width: 0
cache_height: 0
render_mode: MODE_SINGLE_LAYER

It’s something the editor is doine changing the extra_characters field. I tried setting it to \xe2\x80\x8b and the editor changed it to extra_characters: “\342\200\213”

And now it seems to be working right, (but only with the modified ttf not the otf) and with the \342\200\213 listed in the extra chars raw text (I don’t understand this, I guess they are Octal UTF-8 bytes).

Something to be aware of is you cannot use the same fonts for Simplified Chinese and Traditional Chinese texts. They have their own set of glyphs.

COCO · December 8, 2021, 6:47am

I’m a Chinese. I know there’s one rule about line breaking:
Usually punctuations are not at the start of a line.
That’s all.

Pkeod · December 8, 2021, 6:58am

COCO · December 8, 2021, 7:04am

Looks pretty good.

totebo · December 8, 2021, 10:50am

This is great! I bumped into this a while back with Japanese, but solved it by manually (and possibly incorrectly) inserting spaces. Next time I’ll give the module a spin instead.