Minutes of the June 21
           Message format extensions working group.


Attendees
---------

Phill Gross              pgross@nis.ans.net
Peter Svanberg           psu@nada.kth.se
Byungnam Chung           bnchung.sokri.etra.re.kr
Bob Kummerfeld           bob@ca.pn.oz.au
Jonny Eriksson           bygg@sunet.se
Jan Michael Rynning      jmr@nada.kth.se
Keld Simonsen            keld.simonsen@dkuug.dk
Greg Vaudreuil           gvaudre@nri.reston.va.us

Agenda
------

1) Character Set Selection

   - Status and Input to the ISO 10646 process
     o Unicode <=> ISO 10646 Union?
     o Use of CO and C1 codespace

   - Selection of "Common" character sets or schemes
     o ISO 8859-1, ISO 8859-n, Profiles for the use of ISO 2022?
     o Specifying "requiredness"

   - Specification of 8 bit character sets in headers

Minutes
-------

1) Character Set Issues

   a) Unified character set

     1) Administrative

At last word, the ISO DIS 10646 received 9 YES votes and 14 NO
votes, and work is proceeding to resolve the remaining issues.  An
unofficial but promising effort is the work underway to unify ISO
DIS 10646 and Unicode, another scheme for a global character set.
This effort is being conducted outside the normal ISO
process. This working group was asked to discuss this effort and
endorse it if possible.  The working group discussed this effort,
and agreed that the efforts to combine Unicode and 10646 were in
fact positive.  

     2) Technical

The unification of ISO DIS 10646 and Unicode requires the
resolution of several technical issues.  The primary
issue,tentatively resolved involves "Han unification" a scheme that
re-uses many of the graphics of the various Kanji character sets.
Other issues involve the use of CO and C1 codespace.  The use of
C0 and C1 codespace involves transport issues and this working
group was asked for its input.

C0 codespace consists of the spaces between 0 and 31 and
127,traditionally used for control characters.  There is a proposal
to use this space in the second octet of a multi-byte character for
graphic characters.  The working group discussed this and rejected
the use of this space.  A graphic character in the C0 space will
likely be interpreted by a transport protocol as a control
character.  Many transport protocols which interpret in-band data
such as SMTP may behave unpredictably in this situation.  One
example is where the sequence of graphics legally sent by a 8 bit
sender may be mis-interpreted by a 7 bit receiver after bit
stripping as a 13-10-46-13-10 sequence terminating the SMTP session
prematurely.  Other related anomalies were envisioned. Unless all
transport protocols are made aware of the multi-byte nature of the
data, an unlikely occurrence any time soon, reuse of C0 space is
not recommended.

C1 codespace consists of the spaces between 128-150, space that may
be interpreted as control characters if the high order bit is
stripped.  ISO 8859-n character sets, and the current 10646
proposal reserve this space for control characters only, with an
eye toward backward compatibility with 7 bit systems.  The working
group discussed this and concluded that use of C1 codespace could
be used for graphics if transport protocols could be relied upon
to never strip the high order bit and interpret the resulting
character as control sequences.  The working group did not make a
specific recommendation, only that the use of C1 space to compact
a character set was a positive thing, and future evolution
transport protocols should support the use of this space for
graphics.


   b) Common Character Sets

In the absence of a single international standard character set,the
working group needs to profile the use of a limited number of the
200+ character sets in use worldwide to facilitate interoperation. 
Keld S. gave an overview of the current character sets in usage.

ISO 7 bit family:
     ASCII
     National Versions
       10 National use
       2 Alternate rep # $
     ECMA registry
       7, 8, 16 bit
       ISO 2022 shifts

ISO 8 bit 8859 family:
     1 char = 1 octet
     ASCII in pos 0-127
     Pos 160-255
       Latin sets (5)
       Cyrillic
       Greek
       Arabic
       Hebrew

ISO 6937-2 family 8/16 bit:
     6937-2, T.61
     Non-Spacing accents
     1 char = 1 or 2 bytes
     about 330 graphical chars

Vendor 8 bit sets
     DEC-MCS
     HP Roman8
     IBM PC codepages (5)
       Uses also 128-159 (C1)
     IBM EBCDIC
       Many versions
       Not ASCII Compatible

16 bit char sets
     Japanese: JIS 0208, 0212
     Chinese: GB 1980
     Korean:
     Japanese 8/16 bit: Shift JIS
     Unicode: New vendor charset unifies CN, JP, KO sets
        Incompatible with ISO

Multi-byte:
     EUC: Extended UNIX code
       ISO 2022 shifting
       SS1 SS2 SS3
       4 char sets
       8/16/24 bits   

32 Bits:
     ISO 10646
       Also usable in 8, 16, or 24 bit compaction methods
       Proper encoding subsets: ASCII and ISO 8859-1

Control Character Sets:
     ISO 646: 0-31, 127
     ISO 6429: 0-31, 127-159
     EBCDIC: as ISO 646  
     
Several ideas were batted around, including strict use of ISO2022,
profiling language to character set mapping, and the use of
"preferred" character sets.  The working group felt that the best
approach was to codify existing practice in the interim,pending
adoption of an "international" character set.  This existing
practice was reduced to the following.

If possible, use ISO 8859, with the lowest version number possible,
i.e., use 8859-1 (Latin 1) over 8859-10? (Latin 5?). If the
characters needed are not in the 8859 sets (i.e. Kanji)use the 2022
character switching standard, declaring 2022 in the header of the
document.  While this may lead to the use of any of the many
characters in the ECMA registry, the WG felt that in practice, only
the current Oriental mail systems will use the2022 system and only
with limited character sets.

    c) Use of Non-ASCII character sets in headers. 


What a mess!  The attendees of this meeting spend over an hour
working on various schemes for indicating character sets in the
headers of a message other than ascii.  It was identified as a
requirement that the fields defined as TEXT be able to have
variable character sets.  While this goal was stated, no mechanism
for the implementation was agreed upon.

A modification of the BNF notation was suggested by Keld S.     

CHAR-EIGHT     = <any Eight-bit character>; (0-377, 0.-255)

qtext          = <any CHAR-EIGHT excepting <">,"\" & CR, and
               including linear-white-space>

quoted-pair    = "\" CHAR-EIGHT

text           = <any CHAR-EIGHT, including bare CR & bare LF but
               NOT including CRLF>


This notation was accepted by the attendees of the meeting, however
several problems were identified and not resolved.  1)
Identification of the header character set and the need to for
conversion, and 2) Encoding the header character sets in 7 bit
transport format.

It was not clear how a conversion gateway would know that the
header was 8 bit and needed encoding.  A suggestion accepted by the
group was that the use of the new BNF requires the use of a header-
charset header line.  This additional header adds complexity to
user agents and conversion gateways by requiring two passes of the
header to determine and convert the header into a passable or
readable form.  It was felt that this was inelegant but do-able.

Several proposals were discussed for encoding the 8 bit text
strings when 7 bit transport was required.  It was accepted that
this was a hard requirement.  

1) Variable Substitution

     On proposal for the insertion of 8 bit text was to substitute
a variable name in the header for each text string needing 8 bit
characters. The variable could then be defined elsewhere in the
header, including the encoded actual string and a token indicating
the character set.  This was rejected as messy and difficult to
implement in current user agents.  

2) Message Encapsulation

Encapsulate the mail message using the message type body part and
a suitable transport encoding, preferable quoted-printable.  This
proposal is controversial among at least one implementor of the
message format standard as having excessive complexity for the user
agent.  It is not clear the encapsulated message will be permitted
to have a transport encoding.

3) Encoded Text Fields

This proposal would specify a standard encoding for the header
fields, possibly quoted-readable or quoted-printable and identify
this fact in a header-transport-encoding header or the header-
character-set header.  

Conclusions

While no one was happy, the group tentatively agreed to not permit
8 bit text in the headers. The only reasonable way to encode 7 bit
text was to encode the text fields, and insert a new header line. 
With this overhead the group agreed that while not ideal, a
requirement that extended character sets should always be encoded,
eliminating the need for intermediate gateways to parse and convert
the headers.