mirror of
git://git.code.sf.net/p/cdesktopenv/code
synced 2025-03-09 15:50:02 +00:00
Fix UTF-8 shellquoting for xtrace, printf %q, etc.
This fixes an annoying issue in the shell's quoting algorithm (used for xtrace (set -x), printf %q, and other things) for UTF-8 locales, that caused it to encode perfectly printable UTF-8 characters unnecessarily and inconsistently. For example: $ (set -x; : 'aeu aéu') + : $'aeu a\u[e9]u' $ (set -x; : 'aéu aeu') + : 'aéu aeu' $ (set -x; : '正常終了 aeu') + : '正常終了 aeu' $ (set -x; : 'aeu 正常終了') + : $'aeu \u[6b63]\u[5e38]\u[7d42]\u[4e86]' This issue was originally reported by lijo george in May 2017: https://www.mail-archive.com/ast-developers@lists.research.att.com/msg01958.html src/cmd/ksh93/sh/string.c: - Add is_invisible() function that returns true if a character is a Unicode invisible (non-graph) character, excluding ASCII space. Ref.: https://unicode.org/charts/PDF/U2000.pdf - Use a fallback in is_invisible() if we cannot use the system's iswprint(3); this is the case for the ksh C.UTF-8 locale if the OS doesn't support that. Fall back to a hardcoded blacklist of invisible and control characters and put up with not encoding nonexistent characters into \u[xxxx] escapes. Ref.: https://unicode.org/charts/PDF/U2000.pdf - When deciding whether to switch to $'...' quoting mode (state=2), use is_invisible() instead of testing for ASCII 0-127 range. - In $'...' quoting mode, use is_invisible() to decide whether to encode wide characters into \u[xxxx] escapes. src/cmd/ksh93/tests/builtins.sh: - Add regression tests for shellquoting Arabic, Japanese and Latin UTF-8 characters, to be run only in a UTF-8 locale. The Arabic sample text[*] contains a couple of direction markers that are expected to be encoded into \u[xxxx] escapes. [*] source: https://r12a.github.io/scripts/tutorial/summaries/arabic
This commit is contained in:
parent
588a1ff7ca
commit
f9d28935bb
3 changed files with 52 additions and 2 deletions
|
@ -318,6 +318,24 @@ LC_CTYPE=POSIX true # on buggy ksh, a locale re-init via temp assignment res
|
|||
[[ $actual == "$expect" ]] || err_exit 'shell-quoting corrupted after interrupted processing of UTF-8 char' \
|
||||
"(expected $expect; got $actual)"
|
||||
|
||||
# shell-quoting UTF-8 characters: check for unnecessary encoding
|
||||
case ${LC_ALL:-${LC_CTYPE:-${LANG:-}}} in
|
||||
( *[Uu][Tt][Ff]8* | *[Uu][Tt][Ff]-8* )
|
||||
expect=$'$\'عندما يريد العالم أن \\u[202a]يتكلّم \\u[202c] ، فهو يتحدّث بلغة يونيكود.\''
|
||||
actual=$(printf %q 'عندما يريد العالم أن يتكلّم ، فهو يتحدّث بلغة يونيكود.')
|
||||
[[ $actual == "$expect" ]] || err_exit 'shell-quoting: Arabic UTF-8 characters' \
|
||||
"(expected $expect; got $actual)"
|
||||
expect="'正常終了 正常終了'"
|
||||
actual=$(printf %q '正常終了 正常終了')
|
||||
[[ $actual == "$expect" ]] || err_exit 'shell-quoting: Japanese UTF-8 characters' \
|
||||
"(expected $expect; got $actual)"
|
||||
expect="'aeu aéu'"
|
||||
actual=$(printf %q 'aeu aéu')
|
||||
[[ $actual == "$expect" ]] || err_exit 'shell-quoting: Latin UTF-8 characters' \
|
||||
"(expected $expect; got $actual)"
|
||||
;;
|
||||
esac
|
||||
|
||||
# ======
|
||||
# we won't get hit by the one second boundary twice, right?
|
||||
expect= actual=
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue