File: pymailgui-products/unzipped/PyMailGui-PP4E/fixTkBMP.py

"""
================================================================================
[4.0] Sanitize text for GUI display, replacing characters outside Tk's
BMP code-point range with the standard Unicode replacement character.
Without this, GUIs may be left half-drawn or hung.  This code has also been
deployed in frigcal (for calendar content); mergeall (for filenames in
scrolled messages); and pyedit (for both standalone use embedded roles here).

DISCUSSION:

At least through Tk 8.6, Tk cannot display Unicode characters outside the
U+0000..U+FFFF BMP (UCS-2) code-point range.  This issue has been popping
up increasingly in PyMailGUI, as people have begun sending newer emojis in:

  -Main message text    (displayed in view windows and texteditor popups)
  -Header line text     (displayed in view and possibly list windows)
  -Text attachments     (displayed in texteditor popups)
  -Attachment filenames (displayed in view window part buttons)
  -HTML part text       (displayed in view windows and texteditor popups)
  -Info message boxes   (when filenames are included in the display)
  -Ooen dialogs         (when tkinter saves a prior choice having emojis)

Any one of these display contexts can disable a Tk-based GUI if unhandled.
Specifically, an uncaught exception is raised by Python's tkinter module,
which is displayed on the console (if one exits) and causes the GUI's
currently-running code to exit and return to the GUI event loop:
_tkinter.TclError:
    character U+1f60a is above the range (U+0000-U+FFFF) allowed by Tcl

Some of the above contexts were addressed individually in the past with
'try' statements, but an emoji in an important attachment's filename that
rendered it unreadable finally escalated this issue to global status.
This can also impact other programs, including mergeall (filenames in
scrolled output) and frigcal (calendar content from other programs).

To address, call this function to sanitize all text passed to the GUI for
display.  It replaces any non-BMP characters with the standard Unicode
replacement character U+FFFD, which Tk displays as a highlighted question
mark diamond on Windows (and the same or similar elsewhere).

This is not ideal and slows and clutters code, but email providers seem
intent on rushing to proprietary characters not supported by other clients
written just a few years ago, and replacements are better than exceptions.
That is, emojis kill programs!  They impact potentially every text display
program ever written.  Were Unicode jack-o-lanterns really that important?
And wouldn't embedded images in HTML mails have achieved the very same goal?
Alas, those who show up for standards meetings set the standards...

Note: this workaround assumes Tk 8.7 will lift the BMP restriction in 2017
or later, per a dev rumor; if not, the code below should be updated (TBD).

ABOUT INVALID EMAILS:

Caveat: at least one email source has been seen sending UTF-16 headers
text having embedded UTF-16 surrogate-pair bytes for emojis, as raw unmarked
bytes without the required MIME encoding.  Such invalid text is not and
cannot be decoded from bytes to Unicode characters.  Its UTF-16 surrogate
bytes are properly interpreted as ASCII here, and display as odd fraction
character symbols (the glyph of the 16-bit value used to mark surrogate
pairs when encoded per UTF-16) instead of the standard Unicode replacement
character.  Mail clients cannot "guess" that ASCII text isn't ASCII.

On the other hand, such invalid text will not crash the GUI, and PyMailGUI
cannot fix broken mailers (this same mailer has sent UTF-16 text incorrectly
encoded as quoted-printable UTF-8: decoding per MIME and Unicode yields only
raw UTF-16 bytes!).  When properly MIME-encoded, UTF-16 surrogate pairs that
encode Unicode emoji characters will be correctly decoded to their Unicode
code points here, and be accurately detected as outside Tk's BMP range.
================================================================================
"""
from tkinter import TkVersion


def fixTkBMP(text):
    """
    Change characters outside TK's BMP range to the Unicode replacement char.
    Used to sanitize all text for display in the GUI, else tkinter fails.
    """
    if TkVersion <= 8.6:
        text = ''.join((ch if ord(ch) <= 0xFFFF else '\uFFFD') for ch in text)
    return text 


def isNonBMP(text):
    """
    Return true if any character (codepoint) in text is outside Tk's BMP range.
    Used by Open dialogs to force initialfile=None when True for prior choice.
    """
    if TkVersion <= 8.6:
        return any(ord(ch) > 0xFFFF for ch in text)
    else:
        return False   # and assume Tk 8.7 will make this better...



[Home page] Books Code Blog Python Author Train Find ©M.Lutz