
This implements the requirements in the following proposals, which dictate how std::format deals with non-ASCII strings: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf There are two parts to this. The width estimation for strings must only count the width of the first character in an extended grapheme cluster. That requires implementing the algorithm for detecting cluster breaks, which requires a number of lookup tables of the grapheme cluster break properties (and Indic_Conjunct_Break and Extended_Pictographic properties) of every code point. Additionally, some characters have a field width of 2, which requires another lookup table of field widths for every code point. The tables added in this commit do not contain entries for every code point from 0 to 0x10FFFF as that would be very inefficient and use too much memory. Instead the tables only contain the code points that form an "edge" for a property, omitting all the code points that have the same property as the preceding one. We can use a binary search to find the closest code point in the table that is not greater than the one we're looking for. The tables are generated by a new Python script added to the contrib/unicode directory, and a new data file downloaded from the Unicode Consortium website. The rules for extended grapheme cluster breaking are implemented for the latest Unicode standard, version 15.1.0. libstdc++-v3/ChangeLog: * include/Makefile.am: Add new headers. * include/Makefile.in: Regenerate. * include/bits/unicode.h: New file. * include/bits/unicode-data.h: New file. * include/std/format: Include <bits/unicode.h>. (__literal_encoding_is_utf8): Move to <bits/unicode.h>. (_Spec::_M_fill): Change type to char32_t. (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value instead of a single character. (__write_padded): Change __fill_char parameter to char32_t and encode it into the output. (__formatter_str::format): Use new __unicode::__field_width and __unicode::__truncate functions. * include/std/ostream: Adjust namespace qualification for __literal_encoding_is_utf8. * include/std/print: Likewise. * src/c++23/print.cc: Add [[unlikely]] attribute to error path. * testsuite/ext/unicode/view.cc: New test. * testsuite/std/format/functions/format.cc: Add missing examples from the standard demonstrating alignment with non-ASCII characters. Add examples checking correct handling of extended grapheme clusters. contrib/ChangeLog: * unicode/README: Add notes about generating libstdc++ tables. * unicode/GraphemeBreakProperty.txt: New file. * unicode/emoji-data.txt: New file. * unicode/gen_libstdcxx_unicode_data.py: New file.
81 lines
3.5 KiB
Text
81 lines
3.5 KiB
Text
This directory contains a mechanism for GCC to have its own internal
|
|
implementation of wcwidth functionality (cpp_wcwidth () in libcpp/charset.c),
|
|
as well as a mechanism to update the information about codepoints permitted in
|
|
identifiers, which is encoded in libcpp/ucnid.h, and mapping between Unicode
|
|
names and codepoints, which is encoded in libcpp/uname2c.h.
|
|
|
|
The idea is to produce the necessary lookup tables
|
|
(../../libcpp/{ucnid.h,uname2c.h,generated_cpp_wcwidth.h}) in a reproducible
|
|
way, starting from the following files that are distributed by the Unicode
|
|
Consortium:
|
|
|
|
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/NameAliases.txt
|
|
|
|
Two additional files are needed for lookup tables in libstdc++:
|
|
|
|
ftp://ftp.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
|
|
ftp://ftp.unicode.org/Public/UNIDATA/emoji/emoji-data.txt
|
|
|
|
All these files have been added to source control in this directory;
|
|
please see unicode-license.txt for the relevant copyright information.
|
|
|
|
In order to keep in sync with glibc's wcwidth as much as possible, it is
|
|
desirable for the logic that processes the Unicode data to be the same as
|
|
glibc's. To that end, we also put in this directory, in the from_glibc/
|
|
directory, the glibc python code that implements their logic. This code was
|
|
copied verbatim from glibc, and it can be updated at any time from the glibc
|
|
source code repository. The files copied from that repository are:
|
|
|
|
localedata/unicode-gen/unicode_utils.py
|
|
localedata/unicode-gen/utf8_gen.py
|
|
|
|
And the most recent versions added to GCC are from glibc git commit:
|
|
71de3aead9fffe89556e80ebc94aa918d8ee7bca
|
|
|
|
The script gen_wcwidth.py found here contains the GCC-specific code to
|
|
map glibc's output to the lookup tables we require. This script should not need
|
|
to change, unless there are structural changes to the Unicode data files or to
|
|
the glibc code. Similarly, makeucnid.cc in ../../libcpp contains the logic to
|
|
produce ucnid.h.
|
|
|
|
The procedure to update GCC's Unicode support is the following:
|
|
|
|
1. Update the six Unicode data files from the above URLs.
|
|
|
|
2. Update the two glibc files in from_glibc/ from glibc's git. Update
|
|
the commit number above in this README.
|
|
|
|
3. Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_cpp_wcwidth.h
|
|
(where X.Y is the version of the Unicode standard corresponding to the
|
|
Unicode data files being used, most recently, 15.1.0).
|
|
|
|
4. Update Unicode Copyright years in libcpp/makeucnid.cc and in
|
|
libcpp/makeuname2c.cc up to the year in which the Unicode
|
|
standard has been released.
|
|
|
|
5. Compile makeucnid, e.g. with:
|
|
g++ -O2 ../../libcpp/makeucnid.cc -o ../../libcpp/makeucnid
|
|
|
|
6. Generate ucnid.h as follows:
|
|
../../libcpp/makeucnid ../../libcpp/ucnid.tab UnicodeData.txt \
|
|
DerivedNormalizationProps.txt DerivedCoreProperties.txt \
|
|
> ../../libcpp/ucnid.h
|
|
|
|
7. Read the corresponding Unicode's standard and update correspondingly
|
|
generated_ranges table in libcpp/makeuname2c.cc (in Unicode 15 all
|
|
the needed information was in Table 4-8).
|
|
|
|
8. Compile makeuname2c, e.g. with:
|
|
g++ -O2 ../../libcpp/makeuname2c.cc -o ../../libcpp/makeuname2c
|
|
|
|
9: Generate uname2c.h as follows:
|
|
../../libcpp/makeuname2c UnicodeData.txt NameAliases.txt \
|
|
> ../../libcpp/uname2c.h
|
|
|
|
See gen_libstdcxx_unicode_data.py for instructions on updating the lookup
|
|
tables in libstdc++.
|