Introduction to Decoding

Bits of History, Words of Advice

This text is fiction. It's got some truth in it, but its main purpose is to entertain you with bits of history of electronic communication, to show you the problems that still exist in the world today.


Prehistoric


Once upon a time, in a not so far away land, hardware engineers thought of a basic calculation unit for their computers, packaged 8 bits together and called it a byte. It would store 256 different values that could be used for machine instructions. There were some machines that broke with this convention and had 6, 7 or 9 bits per byte, but the 8-bit-byte eventually prevailed and was accepted as a good choice.

Note that memory was at first only used for storing numbers. Nobody thought of storing text strings; computer memory was far too limited and too precious to be filled with stupid messages like The Result is or Hello World. Input was in numbers, and output was in numbers. People were content with receiving a punchcard with 3.14157354 coded into it after 5 hours of computation time.

It was years later that memory was getting so cheap that computers could be equipped with a kilobyte or more. Progress also had it that teletype terminals could be hooked up to these computers. Instead of feeding the machine with ticker type, you now had a keyboard and screen. It was probably the executives who felt that millions worth of hardware should be able to communicate in a more human-readable fashion. Computer scientists didn't need such a feature; they all could read punchcards and were busy fixing their Pi approximation until it got the fifth digit correct.

With terminals, the computer could read commands in more or less plain english. However, terminal and computer had to communicate, for the digits and characters had to be transported from the keyboard to the computer, and then from the computer to the screen. For this transmission, letters and digits had to be encoded. The obvious choice would have been to use one byte to encode one character. However, the links between the terminal and the computer were incredibly slow, so using 8 bits to encode one character were thought to be overkill. After all, there's only 26 letters and 10 digits. A little more if you also want to use lowercase letters.

Instead, they limited themselves to 7 bits, which could encode 128 different values. The eight bit was sometimes still used, but as a "parity bit," to detect transmission failures. The human-readable characters were then spread over the 128 possible values. The empty spots where then filled up with some special characters that were necessary for communicating with a terminal, so there were special values for end of input or ring bell. You can guess the executives were enthusiastic when the machine accepted their inputs with a gentle but firm ping!.

This 7 bit encoding was eventually standardized as the ASCII character set (American Standard Code for Information Interchange), internationally known by the code ISO 646 (for International Standards Organization).


The Dawn of The Net


This solution was perfect for a couple of years, yes, even for more than a decade, until the price for computers dropped below a couple of hundred thousand dollars, and some institutions found they had some money left for a second computer. Wow, to have two computers, that was power!

Sooner or later, students with too much time on their hands started to connect machines together, and wrote software to communicate from the first one to the other one, and to send messages over that link. For that, they adopted the established protocol of a machine talking with its terminal. For years, this worked rather well, and they were happy to send little pieces of mail from here to there, and eventually, after discovering the advantages of a telephone line, across the country.

This was still way before the Internet became widespread. At that time, you had UUCPNet, Bitnet, the still autonomous Usenet, FIDO and other BBS systems, which were individual machines connected by modem links. They dialed each other up several times a day to exchange messages.

The trouble started when computer companies managed to sell more than one computer of the same model. Until now, computers had rarely been compatible with each other, and software that worked here usually didn't run elsewhere. But now, it became possible to exchange software, and again there were a couple of students who wanted to exchange software not by bulky magnetic tapes, but by sending them with their usual E-Mail software.

However, this did not work. All the software that was already in place to transport and forward E-Mail was programmed to expect text messages only. But software, as later sound files and images, is not text but "binary", meaning that it is composed of all 256 possible values of a byte instead of the very limited set of characters that plain text uses. To see what I mean, try to open an image within a text editor. When they started to send binary files, it was rejected by their mail software. If the software was dumb enough to accept it, this message was likely to break the communication link, for example if the binary program contained the special character for "end of input" at some point.

They were stonewalled by this only for a moment, but soon thought of ways around that problem, and invented a classic workaround. Since the existing software that delivered E-Mail messages worked so well, they didn't want to change it. So instead, the chose to encode binary files into plain text on the sending side, and to decode plain text back into binary files on the receiving end.

This was how "uuencoding" was born. Uuencoding takes three bytes of a binary file (3 bytes are 24 bits) and splits them up into four 6-bit values. Each 6-bit value is then encoded into a plain text character using a character table. 6 bits have 64 possible combinations, so you need to choose 64 characters for encoding.

Using uuencoding, you could take a binary file, for example an image, and encode it into gibberish like this:


begin 644 Frank.gif
M1TE&.#=A'``E`/4``#`T,#@X.#@\.$!`0$!$0$A(2$A,2%!04%!44%A86%A<
M6&!@8&!D8&AH:&AL:'!P<'!T<'AX>'A\>("`@("$@(B(B(B,B)"0D)"4D)B8
MF)B<F*"@H*"DH*BHJ*BLJ+"PL+"TL+BXN+B\N,#`P,#$P,C(R,C,R-#0T-#4
MT-C8V```````````````````````````````````````````````````````
M`````````````````````````````````"P`````'``E```&_D!#X6`X'!(9
MR(.@T$`T$H7%\P$I$HK&HZ$@"@T13\:1P%@$$LJD$L%$3)P%@M%P7!/&1\3P
M$"SD!1<1$AD8&1T='Q8)6'\*#(P)"`4&"!D%!`4#DQ4,EAH8$A4'"%@,#`L&
M#0,$K`0"KP*L!0P"`@X7#Q.E#`I_"P<""!8#M@(!Q\C%M@00#`^26:4)KP,*
ML,C'M@'<V0T'"PD%<P@("LO;VL;(V0*8`P\&!EWFR^WLW-O*!9@11N4'%,1*
MU@U9`5G:-K6"@$>!@E(#"G8;8"!!@R*L!C"((.Z(0R[G)-I*`*'D`SD.(DRP
MD,$#DP4,)#!P4!#6(P<0(NA468%"_@4+&CH86/!(0@0(!;K!>O``Y\D%6TKN
M(=`@P9\$1HU"R\;4044$K0@D8,"MF0-("29,,'JV%9\MXF@-.-!@E`(*&QY!
M4J,FPKDA#AHT0%45`=U<$2QL&%QU[025@2%DF9E32:\&%#I8`*H!)H*=$2BH
M5'#VF9*C#18T>/)3P^)>%B(PG4!A`83`='!*@$#T69@'%S!L66!!@H,'.0%<
M%MQ4]L--!RK0EM#4@58Z$#`X**VEZ>2'#6BKW0V!^G80K7#3V2)XL.JF!D3K
MA#"!??D`GGJI/IL@`*4#!4BP$G)J4?``*@]T(0\IOOB"2B6GE$>=6A.4E%J"
M""Q8"186Y?72T`.Q/?"328)!4H0\!@Q`R2.I623!BTRE85Q@#7Q@!!&4:#+`
M`&-EP51L1PDHVQ8A5$`-$5X0$!8",`VVVW@1;/<`""944)407F"B)"EIB6""
M=6I4V``)5-8GAP$"G"C$BA,,`$$=$@["0`@B4"!!:EV\8M@7\U!PQ"EU;.>`
M!R.,4$%3#92#B3#_D/(`!@4D^L=Q"X@``@9*\`?004,8@45%=WRT0`E)\"88
M'HP8\,H_Y0!@``,$%#%-"<75=V`O5QA6C#REG+.`DD-<T4$X#<26"E$',$"$
-BK$FX`X$JE)B6!``.P``
`
end


This basically worked very well. The only problem was the choice of the character table. Uuencoding took a naive approach and chose, for its original character table, the ASCII values 32 to 95. It was quite a simple choice, done without much reflection, and this soon became a problem.

One problem was the choice of the space character (ASCII 32). Some email software, in an effort to shorten transmission time, cut off space characters at the end of the line; others even tried to compress multiple space characters within a line, or replaced it by tabulators, each time making the result undecodable.

Then there were computers from IBM, which did not use the ASCII character set, but, in an IBM effort to make users dependent on their terminals and other hardware as well, used the IBM-specific EBCDIC character set, which was missing several characters from the uuencoding character table.

Variants of uuencoding soon started to spread. Uuencoding continued to be the most popular, followed by the similar xxencoding, after a special exception was made to use the tilde ~ (ASCII value 96) instead of the space character to encode the value 0. But the damage had been done. From then on, you could never be sure whether you would be able to decode a file that you had received.

If you are interested in a technical description of uuencoding and its brothers, you can download this document (which is actually the appendix to the UUDeview decoding library documentation).


The Multimedia Age


Over time, new problems appeared; the most prominent was a problem with large files. When people started to exchange multimedia data like images, sounds, or videos, the files could grow to huge sizes up into the megabytes.

But again, some mail transport agents were not up to it. Some had limited internal buffers, others refused to accept huge mails for security reasons.

Again, the solution was simple. The encoded data could be split up into multiple chunks that could be sent separately. The user on the other end would have to take care that these pieces were glued back together in correct order -- a task made even more difficult by the fact that e-mail messages do not always arrive in the same order as they were sent.

So this was a rather easy solution for the sender, but the recipient had more work to do than before.


The Problem (reprise)


Let's summarize the problems that we've found out about:


The Final Act


Over the years, people have had numerous problems with the encodings mentioned so far, uuencoding and xxencoding. In 1993, some people finally banged their heads together to find a final solution. Borenstein and Freed eventually published RFC 1521, called "Multipurpose Internet Mail Extensions", in which they carefully crafted new methods to encode data.

In this groundbreaking document, they defined a layout for mail messages that could contain multiple parts, they defined headers to describe a part's contents, and they defined the Base 64 encoding for binary data and the Quoted Printable encoding for non-ASCII plain text (to ease the problems of us western europeans with our umlauts and accents).

Binary files could, according to MIME, simply be included in a message by the way of an attachment. They even defined some headers to declare that separate mail messages belong together, so that MIME-aware software on the recipient's side could piece it back together.

(Note for the reader with technical interest: RFC 1521 has been replaced by RFC 2045 in November 1996.)


And Still ...


MIME is a nice way out of the earlier dilemma, and these days, it is pretty well supported by all sorts of mail software. Why, a lot of users aren't even aware of such past-time problems, because their software is so effective in hiding the encoding and decoding of attachments. But let me repeat: attachments are always encoded and decoded behind your back, because the protocols for transporting mail are text only.

Today, users only notice this if things go wrong. And indeed, in a brave new MIME world, you do not need special software like UUDeview.

However, it didn't prove very easy to convince all the world of MIME's merits. For all the years since 1993, there is still mail software around that is not MIME compliant, especially on legacy systems. And even if the software would be, users need their time to take advantage of these new features.

Even in 2001, some files are being sent not as MIME attachments, but as uuencoded messages, or even using other encodings. And then there is also still broken software, which claims to be MIME compliant but is getting some headers wrong, again causing difficulities if the recipient tries to read that attachment with a MIME compliant program.

And that is why software like UUDeview is still necessary.

UUDeview is a program that has seen a lot of encoded data over the years, and knows to handle a great deal of unexpected cases. Therefore, if you receive a mail with a lot of gibberish as below, you should try to open it with UUDeview. The chances are good that UUDeview will be able to handle it.

Have Fun!


Other Transports


Note that all the text above applies to E-Mail and Newsgroup transmission of binary files only. SMTP, the Simple Mail Transport Protocol (the protocol used to transport E-Mail), was first written down in RFC 821 in August 1982; NNTP, the Network News Transport Protocol (used to distribute newsgroup articles) from February 1986 is documented in RFC 977.

Neither SMTP nor NNTP did take binary data into account, driving us into the mess above. Suggestions for extending SMTP were made in RFC 1426, SMTP Service Extension for 8bit-MIMEtransport, but they were never implemented widely enought so that mail software could rely on them.

Other Internet service protocols were designed from the beginning to carry binary data. The most popular protocol these days is HTTP, the Hypertext Transfer Protocol (the current version 1.1 is defined in RFC 2068), and it is binary capable. This is why you do not have a problem downloading images or other files using a Web browser; accordingly, you do not need a software like UUDeview for browsing.

The same is true of other protocols like FTP (File Transmission Protocol). FTP is a dinosaur in itself, but all other more modern protocols like CORBA's IIOP, or even Napster or Gnutella do not have a problem with binary data.


Frank Pilhofer <fp -AT- fpx.de> Back to the Homepage
Last modified: Thu Jun 7 15:56:29 2001