NHacker Next
login
▲UTF-8 history (2003)doc.cat-v.org
68 points by mikecarlton 3 days ago | 20 comments
Loading comments...
bikeshaving 2 hours ago [-]
There are socio-economic reasons why the early computing boom (ENIAC, UNIVAC, IBM mainframes, early programming languages like Fortran and COBOL) was dominated by the US: massive wartime R&D, university infrastructure, and a large domestic market. But I wonder if the Anglophone world also had an orthographic advantage as well. English uses 26 letters with no diacritics, compared to other languages like Chinese (thousands of characters), Hindi (50+ letters), or French/German (latin with diacritics).

That simplicity made early character encodings like 7-bit ASCII feasible, which in turn lowered the hardware and software barriers for building computers, keyboards, and programming languages. In other words, the Latin alphabet’s compactness may have given English-speaking engineers a “low-friction” environment for both computation and communication. And now it’s the lingua franca for most computing on top of which support for other languages is now built.

It’s very interesting to think about how written scripts give different cultures advantages in computing and elsewhere. I wonder for instance how scripts and AI interact, like LLMs trained in Chinese are working with a high-density orthography with a stable, 3500 year dataset.

jcranmer 14 minutes ago [-]
> English uses 26 letters with no diacritics, compared to other languages like Chinese (thousands of characters), Hindi (50+ letters), or French/German (latin with diacritics).

The English language has diacritics (see words like naïve, façade, résumé, or café). It's just that the English language uses them so rarely that they are largely dropped in any context where they are hard to introduce. Note that this adaptation to lack-of-diacritic can be found in other Latin script languages: French similarly is prone to loss-of-diacritic (especially in capital letters), whereas German has alternative spelling rules (e.g., Schroedinger instead of Schrödinger).

pklausler 34 minutes ago [-]
We got lots done with 6-bit pre-ASCII encodings, actually, like CDC Display Code and Univac's Fieldata. It's more than enough for 26 letters, 10 digits, and lots of punctuation. And there are faint echoes of these early character sets remaining in ASCII -- a zero byte is ^@, for example, because @ was the zero-valued Fieldata "master space" character, which distinguished EXEC 8 control cards from source code and data cards.
duskwuff 21 minutes ago [-]
> a zero byte is ^@, for example, because...

A zero byte is ^@ because 0x00 + 64 = '@'. The same pattern holds for all C0 control codes.

ummonk 1 hours ago [-]
The same applies to why China had all the building blocks (pun intended) of the printing press but it was perfected by Gutenberg in Europe, where the number of glyphs was much more manageable.
zahlman 38 minutes ago [-]
Indeed. Even if you try to split hanzi into parts it's far more unwieldy (https://en.wikipedia.org/wiki/Kangxi_radicals).
kps 31 minutes ago [-]
Computer character codes descended directly from pre-computer codes, either teletype or punched card. The advantage holds back through printing to writing itself; having a small, fixed set of glyphs that can represent anything is just better.
flohofwoe 2 hours ago [-]
WinNT missing out on UTF-8 and instead going with UCS-2 for their UNICODE text encoding might have been "the other" billion dollar mistake in the history of computing ;)

There was a 9 month time window between the invention of UTF-8 and the first release of WinNT (Sep 1992 to Jul 1993).

But ok fine, UTF-8 didn't really become popular until the web became popular.

But then missing the other opportunity to make the transition with the release of the first consumer version of WinNT (WinXP) nearly a decade later is inexcusable.

anonymars 1 hours ago [-]
"UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993" (https://en.wikipedia.org/wiki/UTF-8)

Hey team, we're working to release an ambitious new operating system in about 6 months, but I've decided we should burn the midnight oil to rip out and redo all of the text handling we worked on to replace it with something that was just introduced at a conference..

Oh and all the folks building their software against the beta for the last few months, well they knew what they were getting themselves into, after all it is a beta (https://books.google.com/books?id=elEEAAAAMBAJ&pg=PA1#v=onep...)

As for Windows XP, so now we're going to add a third version of the A/W APIs?

More background: https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...

nostrademons 52 minutes ago [-]
Interestingly, there is another story on the HN front page about Steve Wozniak doing exactly that for the Apple I:

https://news.ycombinator.com/item?id=45265240

toast0 5 minutes ago [-]
The 6502 and the 6800 are pretty similar. The 6501 was pin compatible with the 6800, but not software compatible; the 6501 was dropped as part of a settlement with Motorola.

Changing an in-progress system design to a similar chip that was much less expensive ($25 at the convention vs $175 for a 6800, dropped to $69 the month after the convention) is a leap of faith, but the difference in cost is obvious justification, and the Apple I had no legacy to work with.

It would have been great if Windows NT could have picked up utf-8, but it's a bigger leap and the benefit wasn't as clear; variable width code points are painful in a lot of ways, and 16-bits for a code point seemed like it would be enough for anybody.

ivanjermakov 1 hours ago [-]
Windows using CP-125X encoding by default in many countries instead of a UTF-8 did a lot of damage, at least in my experience.
andrewl-hn 25 minutes ago [-]
For many European languages like French or German the switch from local CP-encodings meant that only some characters like å, ñ, ç, etc. would require extra bytes. And thus the switch to UTF-8 was a no-brainer.

On the other hand, Cyrillic and Greek are two examples of short alphabets that allowed combining them with ASCII into a single-byte encoding for countries like Greece, Bulgaria, Russia, etc. For those locations switching to UTF-8 meant that you need extra bytes for all characters in a local language, and thus higher storage, memory, and bandwidth requirements for all computing. So, non-Unicode encodings stuck there for a lot longer.

bobmcnamara 1 hours ago [-]
Can't imagine they would've wanted to change encoding between Win3.1 and NT3.1.
masfuerte 44 minutes ago [-]
But they did?
rfl890 2 hours ago [-]
And nowadays developers have to deal with the "A/W" suffix bullshit.
gpvos 26 minutes ago [-]
"as told by Rob Pike" is an essential part of the title, can that be added/reinstated?
theologic 2 hours ago [-]
Great story and brought up on Hacker News on a regular cycle: https://news.ycombinator.com/item?id=26735958

While I love the Hacker News purity, takes me back to Usenet, it makes me wonder if a little AI could take a repost and auto insert previous postings to allow people to see previous discussions.

xenadu02 2 hours ago [-]
It is worth reading the history of the proposal. The final form is superior to the others so someone was doing a lot of editing!

Take the final and second form where the use of multiple letters was eliminated, instead using "v" to indicate bits of the encoded character.

I also chuckle at the initial implementation's note about the desire to delete support for 4/5/6 byte versions. Someone was still laboring under the UCS/UTF-16 delusion that 16-bits was sufficient.

Rendello 21 minutes ago [-]
They pretty much got their wish, bytes 5 and 6 are gone, along with half of byte 4!

The RFC that restricted it: https://www.rfc-editor.org/rfc/rfc3629#page-11

A UTF-8 playground: https://utf8-playground.netlify.app/