Tuesday, May 12, 2009

Unicode and Oracle

In my work for a software company, I get a lot of questions and problems relating to Oracle and Unicode. The following series of posts will be a summary of how these two not-particularly-easy-to-use-or-understand components fit together.

Before I get into how Oracle and Unicode fits together, I really feel that it's instructive to understand both what Unicode is, and how it came to be. Therefore I'm starting with...

A potted history of the precursors to Unicode

Even before ASCII... or even proper computers

To really understand Unicode, you first need to understand a little about the background to what came before Unicode. The most detailed history I know actually gives the history of ASCII, and doesn't even touch on Unicode, but if you're into that sort of thing (I am!) then it should be interesting.

Basically, the first non-time based encoding scheme (i.e. not Morse Code) that was invented was used in Telegraphs and was called the Baudot Code. This was created by Emile Baudot in 1876 and it was itself inspired by the Gauss-Weber Telegraph Alphabet. The code was initially represented by 6-bits, but it was later reduced to a 5-bit encoding scheme (32 characters). This patented encoding was the basis of a later code that was called the Murray Code, which was invented by Donald Murray who used it to automatically print telegraphs to a punch-tape based printer he had also invented. The code was also a 5-bit encoding but introduced control characters we still see today - characters such as the Carriage Return (CR), Line Feed (LF), NULL, BLANK and DEL.

Murray's patent was later sold to the Western Union Telegraph Company in 1912. While it was patented, a number of incompatible variations of his code were being used in other Telegraph systems by this time - an issue that not even Murray was too worried about in the beginning. However as reliable communications became more and more important, governments and businesses started to realise that they needed to standardize this code so that they could communicate more easily with each other.

The French actually work with (most of) the rest of the world

Thus in 1925 the French setup the Comité Consultatif International Télégraphique (CCIT, or in English the "International Telegraph Consultative Committee") for the purposes of creating an internationally recognized standardized encoding. This proved to be no mean feat - especially given that they only had 32 letters to work with! As it turns out, standardizing this coding scheme was particularly difficult because the Russians objected to what everyone else initially agreed on. Rather than bore you with what went on, just understand that eventually two encoding schemes were formed - the International Telegraph Alphabet 1 (ITA-1) which the Russians were particularly fond of, and the International Telegraph Alphabet 2 (ITA-2) which everyone else quite liked.

As you can imagine, with almost everyone but the Russians using ITA-2, use of ITA-1 soon fell by the wayside and in 1948 it was adopted (with reservations) by the United States as their standard for Telegraphy. Of course when that happened it was pretty much all over for ITA-1. What I find interesting is that evidently in those days they were more cautious about adopting standards, as it took 19 years to adopt the ITA-2 encoding - enough time for World War I and World War II to have been and gone for some time!

The Americans take over

Now while it was grand that the world had agreed that ITA-2 was the one everyone should use, 5-bits is really quite limiting and so it was inevitable that someone would decide that the encoding set should be expanded. Thus in the 1960s Captain William F. Luebbert of the U.S. Army Signal Research and Development Lab invented FIELDDATA, which was a 6-bit encoding that was used extensively by the U.S. Army in their communications. I only really mention FIELDDATA as it is historically important because it inspired the committee that invented ASCII. More on ASCII later though.

In the meantime, another 100-pound gorilla was also inventing their own encoding scheme. That gorilla was IBM, who in 1962 created the 6-bit Binary Coded Decimal Interchange Code (BCDIC) for use in their punched card machines. When punched cards were replaced with IBM's System/360 mainframes, Charles E. McKenzie of IBM extended the code, which became known as Extended Binary Coded Decimal Interchange Code (EBCDIC). Of course, IBM didn't standardize this on their own machines until a few years went by... but it is still in wide use to this day. In fact, EBCDIC was long the rival of ASCII, but the X3.2 subcommittee rejected it. Don't worry, we'll get to this very shortly!

The birth of ASCII

Given the number of different encodings that had started to proliferate, the American Standards Association (ASA; they are now known as ANSI) decided that it was time to form a new international standard to represent characters. Thus on June 17, 1963 the ASA's X3 Committee for Computer and Processing Standards formed the X3.2 Subcommittee for Coded Character Standards and Data Formats. Within this subcommittee was formed a core group of nerds known as the X3.2.4 taskgroup. To cut a long story short, our small band of heroes spent so long arguing over such things as how many characters they should represent, whether to use control characters and in what position they should put the characters in the code chart that they never met any girls and thus never produced any offspring. Whether that was because they were off the nerd Richter scale, or whether it was because ASCII ruined them is a point hotly debated. Whatever it was, we shall never have a generation of men like them again. The important thing here is that with all that incredibly dull discussion, you would have thought that the resulting standard included such innovations as lowercase letters, umlauts and character grave-marks. Strangely, this was not to be and the world was initially stuck with a bunch of upper case characters, numbers, punctuation symbols and a lot of obscure control characters.

Incredibly, it took another four years for the X3.2.5 working group to decide on a final 7-bit encoding scheme. This time, however, someone must have told them that nobody wants to COMMUNICATE BY SHOUTING. It was only at this point that the members got a clue and decided that the world should actually be allowed to write electronically with proper punctuation and lower-case letters. Incidentally, it appears that they didn't keep their controversies inhouse, and managed to anger another set of (albeit cooler) nerds. What happened was that they were just about to release the final standard that just about every standards body in the world had approved - including the ECMA and ISO - when who should come along but the president of the IBM User Group, otherwise known as SHARE. For their troubles, the X3.2.4 subcommittee working group was broadsided by a vitriolic letter in which the SHARE president threatened that the ASA could go to hell unless changes were made to the draft standard. If they were not, he warned that the programmers of the world would create their own competing standard and ASCII as we know it would be in peril due to a lack of adoption. Faced with this unpalatable situation the X3.2.4 working group decided to pull the wool over the SHARE president's eyes by moving a few characters into different positions as well as changing the form of a few characters - for instance they adding a break in the middle of the "|" character (have a look at your current keyboard to see what I'm talking about).

Thus in 1967, with SHARE suitably triumphant and mollified - not to mention feeling very smug about forcing changes to a standard that every country in the world had agreed to - the ASA officially released the American Standard Code for Information Interchange (ASCII), or more formally X3.4-1967. This was actually a joint release done with the European Computer Manufacturers Association (ECMA) and the International Standards Organization (ISO). The ECMA released ASCII as ECMA-6 and ISO released a slightly modified version of ASCII as ISO-646. The ISO standard differs from ASCII because it replaced the dollars sign with the International symbol for currency, which is ¤. Personally, I think that while this certainly seems a good idea in theory, when you realise that currencies fluctuate against each other then you start to understand that unless you show the actual symbol of the currency being used you could either make quite a bit of money, or lose quite a bit of money. Possibly this is why nobody has ever heard of the universal symbol of currency, which ironically is not used universally. What was the ISO thinking?

As an aside, you'd think that after all that careful consideration and endless argument amongst nations (and even within nations - IBM's EBCDIC was categorically rejected by the X3.2.4 committee) that people wouldn't want to fiddle with the standardized characterset. But no, what we got was ATASCII, PETSKII, the ZX Spectrum characterset, Galaksija and YUSCII. Some people are never happier than when they are buggering up a perfectly good standard.

8-bits of confusion - Extended ASCII

It soon became apparent to almost everyone that while 7-bits certainly gave everyone a lot more characters to play with, it wasn't really enough to represent even a fraction of the world's characters. Computers by this stage all used the byte as the smallest unit of measurement anyway, so really there was no need to only represent characters with 7 bits. Also, by this time the IBM PC had been unleashed on an unsuspecting public, and with it came an 8-bit extended ASCII variant called PC-US (sometimes also known as OEM-US or DOS-US). This extended ASCII characterset was burnt into the ROM of every IBM and IBM compatible PC that was sold, which obviously made it very popular, especially as it included characters that let you do things like this:

╔═══════════════════════════╗
║ SWEET BOX WITH TEXT IN IT ║
╚═══════════════════════════╝

(The box would be even sweeter if Blogger honoured non-breaking spaces).

That one extra bit allowed for a whopping 255 characters, 127 more than were available than before. The rest of the world soon cottoned on, and a large number of ASCII variants appeared; these included two version of the Greek alphabet, many variants of the Cyrillic languages, Arabic, Chinese, etc, etc, etc. Of course they all used the same code points to represent their various languages, so eventually IBM organized these into what are now known as code pages - the original being code page 437. Soon thereafter, Microsoft made it big with Windows, and decided to add to the confusion with their own set of code pages.

For a while, confusion reigned. In particular, BASIC programmers were annoyed because when they switched their graphics card to another mode that used a non-437 code page their sweet graphics would turn into a big mess of characters. And you never want to make a BASIC programmer angry...

ISO to the rescue!

Evidently realising that angry BASIC programmers aren't a good thing, ISO (now known as ISO/IEC) decided to remedy the situation. Unfortunately for the BASIC programmers this is the same organization that nearly caused financial chaos by replacing the dollar symbol with the universal currency symbol, so unfortunately for them they didn't get any ANSI graphic symbols. And hence BASIC died, to give birth to the hell-spawn known as Visual BASIC... sorry, I digress.

Basically, ISO/IEC sat down and decided to use the 8th bit of extended-ASCII to form a proper universal standard. Thus was born the ISO 8859 charactersets, of which there were eventually 15 different versions.

  • ISO 8859-1; this is the most well known. It's also called Latin-1, and you'll often see reference to it in databases and in formats such as MIME encoded emails. It covers characters from most Western European languages.
  • ISO 8859-2; (Latin-2) covers characters from Central and Eastern Europe
  • ISO 8859-3; (Latin-3) covers Esperanto and Maltic languages
  • ISO 8859-4; (Latin-4) covers Baltic languages
  • ISO 8859-5; (Cyrillic) covers Bulgarian, Byelorussian, Macedonian, Russian, and Serbian languages
  • ISO 8859-6; (Arabic) does not cover all of the Persian or Pakistani Urdu languages
  • ISO 8859-7; (Greek)
  • ISO 8859-8; (Hebrew)
  • ISO 8859-9; (Latin-5) has Turkish characters
  • ISO 8859-10; (Latin-6) covers Nordic languages
  • ISO 8859-11; (Thai)
  • ISO 8859-12; sorry, those who use Devanagari missed out - this never eventuated.
  • ISO 8859-13; (Latin-7) covers languages written in the Baltic Rim
  • ISO 8859-14; (Latin-8) covers Celtic languages
  • ISO 8859-15; (Latin-9) Latin-1 on steroids; includes the Euro symbol and a few other obscure characters
Now there are only really two places I know of where you can get a list of ISO 8859 character maps. The first is the following Debian page, and of course the second is Wikipedia which not only lets you copy and paste the characters but gives you an excruciating amount of info on the charactersets. Sort of like this blog post, only without the sarcasm.

Yet more charactersets

Of course, this now meant that there were more charactersets than you could poke a stick at. In actual fact, there are even more charactersets than this because ISO-646 had dozens of variations. Not only this, but there appeared another class of characterset encodings called the Double-Byte Character Set (also known as DBCS). DBCSes were actually the real precursor to Unicode in my mind, because they were the first to attempt to use more than a byte to represent characters. You can read more about DBCSes on Wikipedia. While you are there, if you are interested it's worthwhile reading about another ISO/IEC standard - ISO/IEC 2022 - which uses variable-sized encoding to represent characters and is mainly used to represent characters of East-Asian languages.

More to come later

Well, I'll write more later, as it's quite late here and writing about the precursors to Unicode is actually quite tiring.

No comments:

Post a Comment