1
0
Fork 0
mirror of git://git.code.sf.net/p/cdesktopenv/code synced 2025-03-09 15:50:02 +00:00

printf: Fix HTML and URI encoding (%H, %#H)

This applies a number of fixes to the printf formatting directives
%H and %#H (as well as their equivalents %(html)q and %(url)q):
1. Both formatters have been made multibyte/UTF-8 aware, and no
   longer delete multibyte characters. Invalid UTF-8 byte sequences
   are rendered as ASCII question marks.
2. %H no longer wrongly encodes spaces as non-breaking spaces
   ( ) and instead correctly encodes the UTF-8 non-breaking
   space as such.
3. %H now converts the single quote (') to '%#39;' instead of
   ''' which is not a valid entity in all HTML versions.
4. %#H failed to encode some reserved characters (e.g. '?') while
   encoding some unreserved ones (e.g. '~'). It now percent-encodes
   all characters except those 'unreserved' as per RFC3986 (ASCII
   alphanumeric plus -._~).

Prior discussion:
ce8d1467-4a6d-883b-45ad-fc3c7b90e681%40inlv.org

src/cmd/ksh93/include/defs.h:
src/cmd/ksh93/sh/string.c:
- defs.h: If compiling without SHOPT_MULTIBYTE, redefine the
  mbwide() macro (which tests if we're in a multibyte locale) as 0.
  This lets the compiler optimiser do the work that would otherwise
  require a lot of tedious '#if SHOPT_MULTIBYTE' directives.
- string.c: Remove some now-unneeded '#if SHOPT_MULTIBYTE' stuff.
- defs.h, string.c: Rename is_invisible() to sh_isprint(), invert
  the boolean return value, and make it an extern for use in
  fmthtml() -- see below. If compiling without SHOPT_MULTIBYTE,
  simply #define sh_isprint() as equivalent to isprint(3).
- defs.h: Add URI_RFC3986_UNRESERVED macro for fmthtml() containing
  the characters "unreserved" for purposes of URI percent-encoding.

src/cmd/ksh93/bltins/print.c: fmthtml():
- Remove kludge that skipped all multibyte characters (!).
- Complete rewrite to implement fixes described above.
- Don't bother with '#if SHOPT_MULTIBYTE' directives (see above).

src/cmd/ksh93/data/builtins.c:
- sh_optprintf[]: %H: Add single quote to encoded chars doc.
- Edit credits and bump version date.

src/cmd/ksh93/tests/builtins.sh:
- Update and tweak old regression tests.
- Add a number of new tests for UTF-8 HTML and URI encoding, which
  are only run when running tests in a UTF-8 locale (shtests -u).
This commit is contained in:
Martijn Dekker 2020-08-10 22:15:53 +01:00
parent aff63e382d
commit 8477d2ce22
7 changed files with 149 additions and 61 deletions

View file

@ -29,6 +29,11 @@
#define defs_h_defined
#include <ast.h>
#if !SHOPT_MULTIBYTE
# undef mbwide
# define mbwide() (0) /* disable multibyte without need for further '#if SHOPT_MULTIBTYE' */
#endif
#include <sfio.h>
#include <error.h>
#include "FEATURE/externs"
@ -441,6 +446,14 @@ extern int sh_whence(char**,int);
extern Namval_t *sh_fsearch(Shell_t*,const char *,int);
#endif /* SHOPT_NAMESPACE */
#if SHOPT_MULTIBYTE
extern int sh_isprint(int);
#else
# define sh_isprint(c) isprint(c)
#endif /* SHOPT_MULTIBYTE */
#define URI_RFC3986_UNRESERVED "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~"
#ifndef ERROR_dictionary
# define ERROR_dictionary(s) (s)
#endif

View file

@ -17,4 +17,4 @@
* David Korn <dgk@research.att.com> *
* *
***********************************************************************/
#define SH_RELEASE "93u+m 2020-08-09"
#define SH_RELEASE "93u+m 2020-08-10"