Rationale
Since Everything2 pages do not contain explicit encoding
tags (and the user cannot specify them), the default
character set on
Everything2 is
ISO 8859-1 (aka
Latin-1). This is great for
English and
also sufficient for most
Western European languages, since their accented characters
(é, ô, ñ, ä...) will show up just fine, but anything outside
the basic 255 will run into problems. There is exactly one
acceptable solution: using
Unicode as
HTML character entities.
As you may know, Unicode is a character set that will cover
every single script on the planet (and beyond).
Characters on the main plane of Unicode (U+0000 to U+FFFF), which
almost certainly include everything you will ever need,
can be accessed in HTML with the escape sequence
ode;. There are several distinct
and unique advantages to this approach:
- No character set switching. Characters encoded this way
are instantly visible, without the user tweaking his encodings,
fonts, etc. This is by far the most important single reason
to use Unicode on E2.
- Multiple languages in one page. Unicode characters are
distinct and unique, so they can be mixed and matched freely.
There is no
other way to use both, say, Hebrew and Arabic in the same writeup.
- Guaranteed E2 support. Character entities are interpreted
and stored as ordinary text, so they will never be mangled by EDB.
- Graceful failover. If the user's browser does not support
Unicode (or the subset in question), the user will see question
marks or little squares, instead of random 8-bit garbage, which may
include control codes that wreak havoc on formatting.
(Unfortunately, some very old and/or broken browsers may refuse to
recognize the existence of two-byte entities and print the entity
string in full, which will look horrible.)
There are, of course, a few downsides:
- Inefficiency. Each coded character entity takes up seven
bytes, whereas a national character set encoding may squeeze down
to one or two. For small quantities of text, this is not really an
issue.
- Difficulty of entering. Only a few programs can
generate HTML encoded characters automatically -- but some tips on
fixing this in the next section.
- Lack of support. Older browsers typically do not
support extended character entities at all, or require painful
manual configuration (esp. fonts) for them. Both Mozilla and later versions of
Internet Explorer support them quite well though right out of the box, and this problem
will gradually solve itself. (Also bear in mind that most older
systems that do not support Unicode without tinkering will also not
support any other encoding without tinkering.)
When to Use Unicode
Unicode character entities are at their best when you have to refer to
small bits of other languages in writeups written mostly in English.
For example, a writeup on
Chinese astrology may
want to mention the original characters
(天干) for what are in English dubbed the
Heavenly Stems.
Speakers of
Hebrew may want to trace how
בית לחם
became
Bethlehem, while those of
Arabic may wonder how
غزة became
Gaza. A writeup on
Budapest's metro system can't spell
Kőbanya-Kispest
properly without using a character entity for
ő.
Students of
Japanese can
find out what
Tokyo (東京)
really means.
And the list goes on! I recommend putting the Unicode in
parentheses after the transliteration or translation, so people who do not speak the language or whose
browsers do not support Unicode will still have some idea of what you are talking about.
When Not to Use Unicode
Material written
entirely in non-Latin1 languages, on the other
hand, is probably best written with some other encoding;
Unicode's own
UTF-8 might not be a bad choice. As an experiment,
I did node the
Three Gates of Tosotsu (a
Zen text dating back
to 600 AD or so) in the original using character entities, but I
got a few complaints about screwy formatting -- Chinese doesn't
use spaces between words, so even a short line written as an unbroken
string of entities will stretch into hundreds of characters on systems
that do not fully support Unicode.
Using Unicode characters in node titles is also bit of an
iffy business, since they're usually pretty tough
to enter and also because EDB doesn't realize
that &#xhex;, �hex;,
&#dec; and �dec; are all
the same character. Then again, for "non-transscriptable"
languages like Hebrew and Arabic entering the words
in Unicode is pretty much the only way to get a unique
and identifiable name. But until the search code gets
tweaked for better support for non-Latin1 characters,
I would have to recommend keeping Unicode out of titles.
Notes on Composed, Right-To-Left and Other Odd Scripts
Some scripts, like
Devanagari and
Hangul, compose words from
individual letters. Some scripts, like
Hebrew, write from right
to left. A few scripts, like
Arabic, are both. Fortunately,
Unicode hides all the hellishly complex details of implementation, so
غزة (
Gaza) is written in Unicode as
ghain-zain-teh marbuta,
غزة, and your browser's
rendering engine will automatically reverse the order and
join them as script so that
ghain is
initial,
zain final and
teh isolated.
As these computations are left to the user's display engine, it is
possible that the browser does not know the proper
rendering method and that there are bugs in the rendering code --
for example, Mozilla (at time of writing)
still has some difficulties with bidirectional
scripts.
There is
nothing you can do about this, but again, browsers that dig Unicode
will usually get these right and the issue is irrelevant for
systems that don't support Unicode at all.
Manual Entry
Unicode character entries can be written by hand by looking up
the code in a character table and entering them as
ode;. Tables of codes
can be found at
www.unicode.org, the authoritative source,
and
www.hclrss.demon.co.uk/unicode, which gives the
characters packaged more conveniently as HTML tables.
This method is, however, intensely painful for anything more complex
than a single name. Also, while OK for alphabetic or
syllabic scripts, converting Japanese kanji or Chinese hanzi
(漢字) by browsing through
5000 characters is not fun.
Automated Conversion
Some tools can generate character entities on the fly, most
notably perhaps
Microsoft Word, which converts any script
into entities if you
Save As... HTML. Alas, this is
accompanied with lots of other HTML mangling, so for E2 you'll have
to pick out the entity by hand from the generated junk and paste
it back into the original. This is OK for one-off operations, but
soon becomes painful.
A better option is Java, which includes a remarkable set of
tools that can convert almost any encoding into Unicode and back.
Once the text is Unicode, it's a simple matter to extract the hex
code and pad it, and that's what my little utility J2U does.
You'll need a working Java environment to run J2U, writing an
applet interface to the tool is on my TODO list.
For Japanese, you can cut and paste strings in any encoding into
XJDIC or WWWJDIC
(at http://www.csse.monash.edu.au/~jwb/wwwjdic.html), after which performing an "Examine Kanji" on the word
gives the Unicode as Uxxxx. unicode.org's Unihan
database search provides similar facilities for all languages
that use 漢字.
A few more tools and tips sent in by kind noders:
- GNU Recode, for converting anything to anything else
- Mozilla's Composer, for realtime conversion of native IME input into HTML entities
Cheers to Gorgonzola, lj, Oolong, tres equis and WWWWolf for corrections and additions.