1
0
Fork 0
mirror of git://git.code.sf.net/p/cdesktopenv/code synced 2025-03-09 15:50:02 +00:00

printf: Fix HTML and URI encoding (%H, %#H)

This applies a number of fixes to the printf formatting directives
%H and %#H (as well as their equivalents %(html)q and %(url)q):
1. Both formatters have been made multibyte/UTF-8 aware, and no
   longer delete multibyte characters. Invalid UTF-8 byte sequences
   are rendered as ASCII question marks.
2. %H no longer wrongly encodes spaces as non-breaking spaces
   ( ) and instead correctly encodes the UTF-8 non-breaking
   space as such.
3. %H now converts the single quote (') to '%#39;' instead of
   ''' which is not a valid entity in all HTML versions.
4. %#H failed to encode some reserved characters (e.g. '?') while
   encoding some unreserved ones (e.g. '~'). It now percent-encodes
   all characters except those 'unreserved' as per RFC3986 (ASCII
   alphanumeric plus -._~).

Prior discussion:
ce8d1467-4a6d-883b-45ad-fc3c7b90e681%40inlv.org

src/cmd/ksh93/include/defs.h:
src/cmd/ksh93/sh/string.c:
- defs.h: If compiling without SHOPT_MULTIBYTE, redefine the
  mbwide() macro (which tests if we're in a multibyte locale) as 0.
  This lets the compiler optimiser do the work that would otherwise
  require a lot of tedious '#if SHOPT_MULTIBYTE' directives.
- string.c: Remove some now-unneeded '#if SHOPT_MULTIBYTE' stuff.
- defs.h, string.c: Rename is_invisible() to sh_isprint(), invert
  the boolean return value, and make it an extern for use in
  fmthtml() -- see below. If compiling without SHOPT_MULTIBYTE,
  simply #define sh_isprint() as equivalent to isprint(3).
- defs.h: Add URI_RFC3986_UNRESERVED macro for fmthtml() containing
  the characters "unreserved" for purposes of URI percent-encoding.

src/cmd/ksh93/bltins/print.c: fmthtml():
- Remove kludge that skipped all multibyte characters (!).
- Complete rewrite to implement fixes described above.
- Don't bother with '#if SHOPT_MULTIBYTE' directives (see above).

src/cmd/ksh93/data/builtins.c:
- sh_optprintf[]: %H: Add single quote to encoded chars doc.
- Edit credits and bump version date.

src/cmd/ksh93/tests/builtins.sh:
- Update and tweak old regression tests.
- Add a number of new tests for UTF-8 HTML and URI encoding, which
  are only run when running tests in a UTF-8 locale (shtests -u).
This commit is contained in:
Martijn Dekker 2020-08-10 22:15:53 +01:00
parent aff63e382d
commit 8477d2ce22
7 changed files with 149 additions and 61 deletions

View file

@ -1180,7 +1180,7 @@ USAGE_LICENSE
;
const char sh_optprintf[] =
"[-1c?\n@(#)$Id: printf (AT&T Research) 2009-02-02 $\n]"
"[-1c?\n@(#)$Id: printf (AT&T Research/ksh93) 2020-08-10 $\n]"
USAGE_LICENSE
"[+NAME?printf - write formatted output]"
"[+DESCRIPTION?\bprintf\b writes each \astring\a operand to "
@ -1211,7 +1211,7 @@ USAGE_LICENSE
"[+%B?Treat the argument as a variable name and output the value "
"without converting it to a string. This is most useful for "
"variables of type \b-b\b.]"
"[+%H?Output \astring\a with characters \b<\b, \b&\b, \b>\b, "
"[+%H?Output \astring\a with characters \b<\b, \b&\b, \b>\b, \b'\b, "
"\b\"\b, and non-printable characters properly escaped for "
"use in HTML and XML documents. The alternate flag \b#\b "
"formats the output for use as a URI.]"