Why Plain Text

Posted on May 21, 2012 by Luke Maciak tein.co/12033

Recently Chris Wellons shared some an really interesting thoughts on why a lot of programmers tend to flock to certain kinds of tools – powerful text editors, plain text formats, markup over WYSIWYG and etc.. Here is what he said on the topic:

In my experience, software developers generally prefer some flavor of programmer’s tools when it comes to getting things done. We like plain text, text editors, command line programs, source control, markup, and shells. In contrast, non-developer computer users generally prefer WYSIWYG word processors and GUIs. Developers often have somewhere between a distaste and a revulsion to WYSIWYG editors.

Why is this? What are programmers looking for that other users aren’t? What I believe it really comes down to is one simple idea: clean state transformations. I’m talking about modifying data, text or binary, in a precise manner with the possibility of verifying the modification for correctness in the future.

File formats generated by WYSIWYG word processors tend to be opaque to source control tools – they like to store data in binary format, or lumps of compressed XML soup – where each edit introduces perturbations throughout the file. Tools to that allow to search and compare such files are few and far in between – most are built into the bloated text editors and come burdened with finicky, crippled GUI’s. There is nothing as elegant and flexible as the Unix diff command that would work for Microsoft Word or ODT documents. Changes in plain text files on the other hand are easily tracked – be it manually or via source control. They produce clean, human readable diffs. I agree with Chris on this – clean state transformations are a killer feature, but they are not the only reason why many of us are repulsed by WYSIWYG.

Rich text formats actually have many disadvantages compared to plain text, whereas they offer only a single advantage: a promise of user friendliness (a promise which they fail to deliver, but more on that later). Overall, the choice of plain text over rich text is a pragmatic one – it boils down to handling and maintenance. Plain text files are simply much easier and straightforward to work with – especially when it comes to collaboration or maintaining a large set of data over a long period of time.

Size – Plain text files are typically much smaller in size. WYSIWYG formats are complex and store a lot more than just your data – they also store garbage, and endlessly compounding collections of metadata that can’t be easily pruned from the file. Yes, we currently live in a day and age when Terabyte drives are quite affordable, so storing few extra megabytes here and there is not a problem. The problem is moving these things around.

I have seen Excel files ballooning to over 30MB in size. Such monstrous files are virtually impossible to transfer over standard email connections. Modern businesses bravely move towards paperless workflows only to realize their network infrastructure cannot support their multi-Gigabyte zip files. Granted this is a problem that would have existed without Office and WYSIWYG formats, but they exacerbate the issue.

How does this affect me personally? Well, I prefer my data to be backed up on a regular basis. Storing a lot of my writing in plain text helps to make backups faster and more space efficient. I’m sure someone will scoff at these meager space savings, but hey – I’ll take what I’ve got – especially since size savings are not the only benefit of using plain text.
Compression – plain text files compress well. MS Office OOXML files on the other hand don’t compress at all – they are already compressed and still huge. This is almost directly linked to previous point and it is worrisome of the same reasons. Large data dumps from accounting systems can be compressed quite well, but users want these dumps in Excel. A lot of times simply converting such files to a CSV and re-saving them in XLSX format causes 200-300% increase in size, without ability to compress. Many of such files become impossible to email after the conversion.
Parsing – Plain text files are easy to work with programatically. Reading from and writing to text files is very easy in most programming languages. Most of the time all you need is 2-3 lines of code, unless you are working with Java (in which case you need about 200 lines of boilerplate and class declarations). Doing the same with MS Office or ODF files is a complex task that usually requires third party libraries or plugins. Thankfully, these are readily available but they do create dependencies in your code. Not to mention that many of such libraries slide out of compatibility as Microsoft always tweaks their file formats between Office versions, and not all maintainers can keep up with these changes.

Why would a geek and a weekend hacker like me want a programatic access to notes and text documents? Duh, think about it. I hope you can figure this out on your own.
Search – plain text files can be easily searched using simple tools such as Unix command line utilities grep and find. Windows people don’t usually realize this but these tiny applications can iterate over hundreds of text files in mere seconds. Searching within rich text documents is a much more complex issue and there are few built in OS level search tools that could accomplish this task. Those which can, are unable to do it very fast because the process of opening and scanning these files for relevant strings is a complex task. Modern desktop users often utilize powerful indexing engines (such as Google Desktop Search or Windows Search) to get around this problem. These are usually custom tailored to the task of parsing specific rich text formats and they slowly scan your drives in the background.

I’m not saying that desktop search is bad – I’m saying that it is an overkill if all you want is a quick way to search your notes for specific keywords or sentences. And even if you do desire a database driven search index of your files, generating one for plain text documents takes almost no time, whereas indexing large collections of Word and Excel files will take many hours.

A lot of my old notes (from ancient times) are still locked up in proprietary WYSIWYG formats. Sometimes I want to search through them but I quickly realize I can’t grep that far into the past. A constant reminder of the mistakes of my youth I guess.
Resilience – Plain text files are fairly resistant to damage. Even if they become corrupted, large amounts of data can still be recovered. Word and Excel documents become corrupted quite often, to the point where there exists an entire software industry branch for “Microsoft Office File Repair tools” that provide data recovery services to the business sector. And I’m not just talking about actual data corruption – due to disk failure or bad network transfer. I’m talking about corruption that stems from the internal implementation of these files – like for example the “too many formats” Excel issue.
Privacy – Microsoft Office products sometimes inadvertently include sensitive information in their files. There are methods to remove that information, but the default is to preserve it. A lot of people have been burned by this “feature”, including high profile British politicians

It’s astounding that this is even a problem, but somehow, someone down the line made a decision that rich text formats ought to carry a lot of meta-data with them, and it became a de-facto standard. While I’m not against meta-data on principle (it is beneficial when you want to index and categorize content for fast searching – which as I mention is a problem in the rich text world), I am very much for privacy. I always wonder how much information business organizations leak out by simply emailing each other word documents.

It is even better when two security conscious companies exchange AES encrypted zip files, via PGP encrypted emails, while at the same time shedding
all kinds of confidential metadata, because the file they exchanged was previously used on 16 other unrelated projects.
Future Compatibility – plain text is the safest way to preserve data. Software companies go out of business, file formats fall into disuse (real player videos, lotus notes files, etc..) and standards change. Microsoft may seem an indestructible corporate powerhouse today, but 20 years from now they may no longer exist. And if they do exist, you can bet your sweet ass that they won’t support Office 2003 format anymore. Locking up your data in proprietary formats is foolish and most people outside the intellectually stunted corporate ghetto of business school graduates realize this.

Open standards are great, and open standards that use plain text file formats are even greater. Why? Because even if the spec is lost, plain text is relatively easy to figure out. Future digital archaeologists will only need to stumble upon or figure out the concept of an ASCII table to be able to decode most of plain text documents just by examining their contents in a hex editor.

And yes, I do understand that there are many encodings, and the world of plain text is far from uniform. Still, figuring out the few dozen encodings and their quirks ought to be easier than trying to work out how is data stored in the binary .doc file, without access to a copy of Microsoft Word to reverse engineer the damn thing. Or figuring out the “a thousand zipped XML files” format of OOXML.

I suspect that few centuries from now, historians will assume that majority of the people living in this day and age were Linux nerds because open source software, open standards and plain text files will all that will remain after us. Scholars will continuously argue about what our people might have been hiding in the Terabytes of inaccessible binary data they can’t decode.

We don’t have to go that far into the future to see the effects of this though. There are ancient folders on my drive, that survived my youthful nonchalance towards backup. They contain notes and short stories saved in Corel Word Perfect format – software I no longer own, use or need. Back then I used it, because that’s what was available. I did not know enough to make an educated decision. I could not have known that many years later I will be sitting there, staring at incomprehensible files that I cannot open without downloading some sort of viewer tool from the vendors website. Fortunately Corel still exists, and still supports their product. But that did not necessarily had to be the case.
Clarity – if you want to have “formating” in plain text files you need to use explicit markup. Whether you are using HTML, Markdown or LaTex you essentially type in your formating commands as textual blocks. Whoever is working on your file next, can probably figure out what you intended to do from the markup even if you messed it up, or if parts were deleted while reformatting. You can easily tweak, or clean up said formating code by hand.

WYSIWYG rich text documents purposefully hide that stuff from the user. The claim it is for clarity and user friendliness but this is really a matter of perspective. When you are typing up a paper, you may not want to deal with markup. When you are editing and proofreading something for release however, you want to have full control over the document. You want to be able to tweak and correct every aspect of it. WYSIWYG tools wrestle that control away from you, and abstract it into a series of useless toolbars and context items.

If you have ever wrote a serious paper in Word you know the frustration of trying to make it behave – especially if you are trying to make it do some very specific things: for example, to place a figure just so, to lay out tables or images side by side, to have some pages in different orientation to accommodate large charts. The more you try to do, the harder it becomes to keep it all together and a small change on one page, may have catastrophic effects on everything below. What’s worse, there is usually no way to predict these issues. Editing documents in Word is a game of chance – every time you make a change you have to inspect the entire document for problems, and keep your Undo button ready to back-track.

I have some really neat examples of how the transparency of markup improves the user experience in my LaTex vs Office post. Please check it out to see animated examples of some of the WYSIWYG issues I mentioned above. This is the failed promise of user friendliness I mentioned. Yes, typing up a Word document is pretty user friendly. Opening said document later, deleting a single character and seeing the entire document collapse upon itself and contort into an unimaginable mess is not friendly at all. Tools like Microsoft Word are only user friendly up to a point.

That last point especially is very, very important. I would say it is more important than clear state transformations. I would say that hidden markup is the root of about 70% of issues you will run into when working with Word and Excel. I just made that number up, so feel free to substitute another one if you are so inclined. But trust me – a lot of the things you see Word users complain about stems from their inability to comprehend and anticipate WYSIWYG paradigm concepts such as invisible control characters. You simply can’t comprehend why Word does certain things until you understand markup language – especially opening/closing tags, and what happens when you leave them open.

Here is an interesting thing I discovered: learning markup makes people better at Word. I teach kids HTML. Very basic stuff, mind you – no CSS or anything like that. The most complex thing they create is a table, and for the sake of simplicity I show them the font tag (I know it’s wrong, but it makes more sense than shoe-horning inlne CSS there). Then I go back to Microsoft word and explain to them how it puts these invisible tags in the documents. And it is pretty good at keeping them matched up, but sometimes it fails. So that’s why once in a blue moon you delete something, and it causes the rest of the document barf up upon itself. I swear to you, you can usually see a little light bulb appear over the heads of like one or two students in the class. The rest of them remains ignorant, but that’s just par for the course. You can’t win them all but seeing a few select individuals have this world view shattering epiphany is what teaching is all about. All of a sudden these few students instantly understand why Word sucks, and are armed with knowledge on how to battle and tame that software beast. They are no longer baffled by mysterious “computer doing stupid things” but instead realize the society sold them a $300 piece of shit, and requires them to use it.

So there you have it. I’m pretty sure there are more compelling reasons why use plain text over rich than that, but this is all that I can think of right now. On the other hand I can’t think of a single argument for the reverse side of the argument, other than the old and tired “end users won’t understand plain text in notepad” mantra. I honestly don’t know a single logical reason why a self respecting geek would subject himself to a WYSIWIG purgatory. Other than ignorance of course. And there are a lot of ignorant geeks out there.

These guys are almost worse than your common garden variety of a luser (technophobus ignoramus luserati if you want to use fake latin). Lusers simply don’t know any better, and they don’t understand logic. I have learned to accept that – them creatures are slaves to the emotion, and impervious to facts and empirical evidence. We geeks operate a little bit differently – we like to think of ourselves as rational beings. But alas, we don’t always behave that way. There are otherwise wonderfully clever individuals out there who know that MS Word sucks, and love to complain about it but won’t take the plunge and peel themselves away from WYSIWIG. You show them LaTex and they scoff at the syntax. You give them Markdown, and they go “Yea, but no.” Then they go and rant for hours about how no one can make a Word processing tool that works. They see the problem, and yet they refuse the accept the solution. WYSIWIG simply does not work.

Over the years I have learned that smart people do stupid things all the time. Once can be an expert in one field and a complete idiot in a related one. The most valuable traits in a human being are an open mind, willingness to try new things and an ability to admit and learn from mistakes. If you find a person who exhibits all these three traits, hire them immediately. Or marry them – whichever seems more appropriate at the moment.

Chris – this is not refutation of your claim. I do agree that clear state transforms are important, beneficial and definitely a factor. They are just one of many factors why WYSIWIG sucks. Maybe one day we will have a better way of editing documents that is neither like WYSIWIG, nor like markup. If it’s indeed superior to both, then I’ll switch. But until then, I’m going plain.

This entry was posted in Uncategorized. Bookmark the permalink.

16 Responses to Why Plain Text

Victoria says:

May 21, 2012 at 10:32 am

while I’m still afraid of command line at my ripe old age, I do prefer plain text or xml simply because it gives me control. I can easily see what’s inside and edit it without any special tools.

One of the graphic programs I use recently got redesigned and they switched to dark interface (which I hate), but what’s worse, they don’t give it is an option, only as default. So I found the xml file with the new color scheme and changed it back to how I prefer it. No biggie. If it were in some dll, it would’ve been impossible for me.

Reply | Quote
Greg says:

May 21, 2012 at 11:41 am

Is there a good WYSIWYG editor for LaTeX? It seems like that would be the best of both worlds: user-friendliness combined with an ability to edit the underlying text and markup by hand.

Reply | Quote
MrJones2015 says:

May 21, 2012 at 2:57 pm

@ Greg:

Yes, and we will call that one Fronttex (in memory of the glorious microsoft frontpage)

Reply | Quote
IceBrain says:

May 21, 2012 at 3:25 pm

@ Greg:
Good is subjective, but there’s LyX. It’s not exactly WYSIWYG, since it only gives you an approximation of the final result, but I think it’s more then enough (take a look at the screenshots page if you want to see for yourself). It’s also cross-platform, but I haven’t used it on anything other than Linux, where it’s just an apt-get away.

Reply | Quote
Parminder says:

May 21, 2012 at 3:35 pm

@Greg:
+1 for Lyx. Although, I haven’t used it in a recent years, I successfully used Lyx and LaTex template provided by my University to format and submit my masters thesis. Compared to my peers who were using MS Word, I spent almost no time on formatting, I could simply focus on the content.

Reply | Quote
Chris Wellons says:

May 21, 2012 at 4:03 pm

If only I could get my co-workers to understand all this so that I could stop having to e-mail Office documents back and forth!

Such monstrous files are virtually impossible to transfer over standard email connections.

Heh, I felt bad about e-mailing a 20KB patch last night twice, thinking it’s slightly excessive as far as e-mail on a mailing list goes.

Maybe one day we will have a better way of editing documents that is neither like WYSIWIG, nor like markup.

IceBrain already mentioned it but I’m going to mentioned it again anyway: LyX is the only thing I’m aware of that’s neither markup or WYSIWYG but rather something in the middle. You’re already aware of LyX, of course. LyX is not something I’m interested in using, but it’s worth mentioning as a weird in-between tool.

Chris – this is not refutation of your claim. […] just one of many factors why WYSIWIG sucks.

I understand. I figure we were due for another WYSIWYG rant from you anyway! :-) When I was linking to you for the word “revulsion” I had to pick which of your several WYSIWYG articles I was going to use.

Reply | Quote
Luke Maciak says:

May 21, 2012 at 4:28 pm

@ Victoria:

Well, editing DLL’s is a nightmare no matter how geeky you are. But you bring up a very interesting tangentially related topic: how to store application configuration. Hard coding settings is obviously bad, but what is a good format to store your config files in?

XML? YAML? JSON? Generic config format?

@ Greg, @ MrJones2015, @ IceBrain, @ Parminder:

Good point about Lyx. The wonderful thing about it is that it gives you access to full access to the LaTex source it auto-generates while at the same time giving you a WYSIWYM style abstraction layer:

http://i.imgur.com/6U08m.png

It’s a good compromise I guess but most of the time I prefer to write my own LaTex by hand. Plus, Lyx has no inline spell checker. Like none whatsoever. In the day and age where my browser has that feature I consider it a huge flaw.

@ Chris Wellons:

20KB? The average size of an email here is close to a meg – including the three color rich text email footer and the obligatory embedded images and backgrounds. :P

Reply | Quote
IceBrain says:

May 21, 2012 at 4:47 pm

Plus, Lyx has no inline spell checker.

Apparently, they fixed that recently: http://wiki.lyx.org/LyX/NewInLyX20#inline-spelling

Reply | Quote
Morghan says:

May 21, 2012 at 6:54 pm

I always find it amusing to see all the HTML code in my emails. I have my client set to text only as GNUPG integration for HTML messages was buggy. There is often more formatting and image links than actual content in messages.

It’s not just programmers. I know a little BASIC and some outdated HTML from playing around back in the day, a bit of bash scripting, and some LaTeX but I’m definitely not a programmer. It comes down to be same reason I prefer my own server to using the cloud. If my stuff gets lost or screwed up I want to have only myself to blame.

Trusting someone else with my data and its security is really not something I’m comfortable with. I do have some services, like my Kindle, but I crack the DRM and back it up to my server and external media as soon as I buy something from them.

Reply | Quote
Luke Maciak says:

May 21, 2012 at 7:29 pm

@ IceBrain:

Well, look at that! Lyx 2.0 was finally released. It took them quite a while. I remember seeing that wiki page back in 2010 but 2.0 was nowhere near release then. Hell, I think the first feature requests for inline spellcheck in their bug tracker date back to like 2004 or something.

Ah, the joys of Open Source. :)

@ Morghan:

See, I like the kind of cloud that “syncs up” with local file system – like dropbox. It gives me a redundant copy somewhere else. By keeping data in both places I minimize possibility of loss. If I fuck up, the cloud can bail me out. If cloud fucks up, I have local copies. Likelihood of both me and the cloud fucking up at the same time is small – not impossible, but smaller than the probability of individual fuck up on either side.

Reply | Quote
Adrian says:

May 25, 2012 at 2:38 pm

Word would be so much better if it just offered a ‘source’ view where you can edit all the formatting tags just like you can for Latex or HTML.

Reply | Quote
Pingback: Write Less, Right More « Abusive Views
Pingback: Why I did not buy an iPad mini or Syncing “Complex” Data with Plain Text « Incredible Visions
Pingback: Some Thoughts on Productivity and Storage « Vomi Mot
Janek Warchoł says:

February 6, 2013 at 1:44 pm

Hi Luke&errbody,

you might want to add another advantage of plain text (it may be considered as a manifestation of resilience you already mentioned, but i consider it important enough on its own): recoverability.
Several months ago i occasionally wiped out my home folder. I had a backup, but it was 3 weeks old, and there was some important stuff created during that time, so i turned to data recovery; surprisingly, my data appeared to be more fragmented than i expected (i suspect that’s because my laptop has an SSD). I had luck and recovered some pictures and some spreadsheets, but there was a significant amount of corrupted binary/XML files – on the other hand, when it came to text files, i was able to rescue information even from the ones that were partially overwritten.
Actually, since majority of recovered files lost their names, i had to grep their contents to find some data i needed. I was quite surprised – it worked fairly well. The worst problem when searching was… lots of xmlish garbage showing in results of the search ;)
Oh, and don’t forget about duplicates! Since SSDs copy all files around, i got up to 10 (maybe even more?) duplicates of some files. In case of plain text ones, it was easy to clean this up, but the other ones…. Some still wait unexamined.

Reply | Quote
Tim says:

January 31, 2015 at 2:19 am

I may be misapplying the concept of “Clean State Changes” but most of Luke’s points seem to be beneficial consequences of clean state changes. Size, Compression, and Resilience come because there is no extraneous or over-encoded data. Parsing and Search are side-effect free state transformations, the output of a search is a new subset of the file for example.

Privacy and Clarity are kinda linked concepts, private data can be included in a plain text file, but it can’t be hidden from most plain text editing programs. Future Compatibility is pretty much just another flavor of Clarity. These are more about the clean state changes inside the users thought process as they reason about a given file.

Of course, until you wrote about it I hadn’t really been able to express why I switched to Notepad++ for all my writing as soon as inline spell check showed up. I’ve pretty much always understood that there was a lot of stuff lurking in those Word files, but never connected that with my dislike of the beast, since I never try to do anything complicated with it.

Reply | Quote