Emoji 正则匹配
2024-5-9 20:28:46 Author: taxodium.ink(查看原文) 阅读量:0 收藏

Strings are represented fundamentally as sequences of UTF-16 code units. In UTF-16 encoding, every code unit is exact 16 bits long. This means there are a maximum of 2^16, or 65536 possible characters representable as single UTF-16 code units.

However, the entire Unicode character set is much, much bigger than 65536. The extra characters are stored in UTF-16 as surrogate pairs, which are pairs of 16-bit code units that represent a single character.To avoid ambiguity, the two parts of the pair must be between 0xD800 and 0xDFFF, and these code units are not used to encode single-code-unit characters. (More precisely, leading surrogates, also called high-surrogate code units, have values between 0xD800 and 0xDBFF, inclusive, while trailing surrogates, also called low-surrogate code units, have values between 0xDC00 and 0xDFFF, inclusive.) Each Unicode character, comprised of one or two UTF-16 code units, is also called a Unicode code point. Each Unicode code point can be written in a string with \u{xxxxxx} where xxxxxx represents 1–6 hex digits.

—— UTF-16 characters, Unicode code points, and grapheme clusters

在 JavaScript 中,String 实际是 UTF-16 (16-bit Unicode Transformation Format) 编码的,它以 16 位去表示一个字符(code unit),最多可以表示 65536 (0x0000 - 0xFFFF) 个字符。

这 65535 个字符中包含了大部分常用字符,例如字母,数字,拉丁字符,以及一些东亚文字字符。

但是后来发现 65535 并不足以表达所有字符,16 位不够,那就需要增加 Unicode 去表达更多字符。

实现的方法就是定义了 代理对 (Surrogates pairs) , 代理对由 20 位组成。

规定前 10 位作为 高代理位 (high-surrogate) ,取值范围是 0xD800 - 0xDBFF。

后 10 位为 低代理位 (low-surrogate) ,取值范围是 0xDC00 - 0xDFFF。

高代理位和低代理位组成代理对 (surrogate pairs) 。

由于有 20 位的长度,因此可以表达 1048576 个字符,可以在原来 65536 个字符之上,再增加 1048576 个字符。

为什么 Unicode 要这么设计,可以参考 Why does code points between U+D800 and U+DBFF generate one-length string in ECMAScript 6?

为什么高代理和低代理这么取值,可以参考 How was the position of the Surrogates Area (UTF-16) chosen?

概括来说,就是在 JavaScript 的 String 中常用的字符(如字母,数字,汉字)是由 1 个 UTF-16 编码单元表示的。

而超出 65535 (0xFFFF, U+FFFF, \uFFFF) 字符(如 Emoji),则由代理对表示(高代理+低代理,2 个 UTF-16 编码单元)。


文章来源: https://taxodium.ink/post/emoji-regexp/
如有侵权请联系:admin#unsafe.sh