Recently Chris Wellons shared some an really interesting thoughts on why a lot of programmers tend to flock to certain kinds of tools – powerful text editors, plain text formats, markup over WYSIWYG and etc.. Here is what he said on the topic:
In my experience, software developers generally prefer some flavor of programmer’s tools when it comes to getting things done. We like plain text, text editors, command line programs, source control, markup, and shells. In contrast, non-developer computer users generally prefer WYSIWYG word processors and GUIs. Developers often have somewhere between a distaste and a revulsion to WYSIWYG editors.
Why is this? What are programmers looking for that other users aren’t? What I believe it really comes down to is one simple idea: clean state transformations. I’m talking about modifying data, text or binary, in a precise manner with the possibility of verifying the modification for correctness in the future.
File formats generated by WYSIWYG word processors tend to be opaque to source control tools – they like to store data in binary format, or lumps of compressed XML soup – where each edit introduces perturbations throughout the file. Tools to that allow to search and compare such files are few and far in between – most are built into the bloated text editors and come burdened with finicky, crippled GUI’s. There is nothing as elegant and flexible as the Unix diff command that would work for Microsoft Word or ODT documents. Changes in plain text files on the other hand are easily tracked – be it manually or via source control. They produce clean, human readable diffs. I agree with Chris on this – clean state transformations are a killer feature, but they are not the only reason why many of us are repulsed by WYSIWYG.
Rich text formats actually have many disadvantages compared to plain text, whereas they offer only a single advantage: a promise of user friendliness (a promise which they fail to deliver, but more on that later). Overall, the choice of plain text over rich text is a pragmatic one – it boils down to handling and maintenance. Plain text files are simply much easier and straightforward to work with – especially when it comes to collaboration or maintaining a large set of data over a long period of time.
Size – Plain text files are typically much smaller in size. WYSIWYG formats are complex and store a lot more than just your data – they also store garbage, and endlessly compounding collections of metadata that can’t be easily pruned from the file. Yes, we currently live in a day and age when Terabyte drives are quite affordable, so storing few extra megabytes here and there is not a problem. The problem is moving these things around.
I have seen Excel files ballooning to over 30MB in size. Such monstrous files are virtually impossible to transfer over standard email connections. Modern businesses bravely move towards paperless workflows only to realize their network infrastructure cannot support their multi-Gigabyte zip files. Granted this is a problem that would have existed without Office and WYSIWYG formats, but they exacerbate the issue.
How does this affect me personally? Well, I prefer my data to be backed up on a regular basis. Storing a lot of my writing in plain text helps to make backups faster and more space efficient. I’m sure someone will scoff at these meager space savings, but hey – I’ll take what I’ve got – especially since size savings are not the only benefit of using plain text.
Compression – plain text files compress well. MS Office OOXML files on the other hand don’t compress at all – they are already compressed and still huge. This is almost directly linked to previous point and it is worrisome of the same reasons. Large data dumps from accounting systems can be compressed quite well, but users want these dumps in Excel. A lot of times simply converting such files to a CSV and re-saving them in XLSX format causes 200-300% increase in size, without ability to compress. Many of such files become impossible to email after the conversion.
Parsing – Plain text files are easy to work with programatically. Reading from and writing to text files is very easy in most programming languages. Most of the time all you need is 2-3 lines of code, unless you are working with Java (in which case you need about 200 lines of boilerplate and class declarations). Doing the same with MS Office or ODF files is a complex task that usually requires third party libraries or plugins. Thankfully, these are readily available but they do create dependencies in your code. Not to mention that many of such libraries slide out of compatibility as Microsoft always tweaks their file formats between Office versions, and not all maintainers can keep up with these changes.
Why would a geek and a weekend hacker like me want a programatic access to notes and text documents? Duh, think about it. I hope you can figure this out on your own.
Search – plain text files can be easily searched using simple tools such as Unix command line utilities grep and find. Windows people don’t usually realize this but these tiny applications can iterate over hundreds of text files in mere seconds. Searching within rich text documents is a much more complex issue and there are few built in OS level search tools that could accomplish this task. Those which can, are unable to do it very fast because the process of opening and scanning these files for relevant strings is a complex task. Modern desktop users often utilize powerful indexing engines (such as Google Desktop Search or Windows Search) to get around this problem. These are usually custom tailored to the task of parsing specific rich text formats and they slowly scan your drives in the background.
I’m not saying that desktop search is bad – I’m saying that it is an overkill if all you want is a quick way to search your notes for specific keywords or sentences. And even if you do desire a database driven search index of your files, generating one for plain text documents takes almost no time, whereas indexing large collections of Word and Excel files will take many hours.
A lot of my old notes (from ancient times) are still locked up in proprietary WYSIWYG formats. Sometimes I want to search through them but I quickly realize I can’t grep that far into the past. A constant reminder of the mistakes of my youth I guess.
- Resilience – Plain text files are fairly resistant to damage. Even if they become corrupted, large amounts of data can still be recovered. Word and Excel documents become corrupted quite often, to the point where there exists an entire software industry branch for “Microsoft Office File Repair tools” that provide data recovery services to the business sector. And I’m not just talking about actual data corruption – due to disk failure or bad network transfer. I’m talking about corruption that stems from the internal implementation of these files – like for example the “too many formats” Excel issue.
Privacy – Microsoft Office products sometimes inadvertently include sensitive information in their files. There are methods to remove that information, but the default is to preserve it. A lot of people have been burned by this “feature”, including high profile British politicians
It’s astounding that this is even a problem, but somehow, someone down the line made a decision that rich text formats ought to carry a lot of meta-data with them, and it became a de-facto standard. While I’m not against meta-data on principle (it is beneficial when you want to index and categorize content for fast searching – which as I mention is a problem in the rich text world), I am very much for privacy. I always wonder how much information business organizations leak out by simply emailing each other word documents.
It is even better when two security conscious companies exchange AES encrypted zip files, via PGP encrypted emails, while at the same time shedding
all kinds of confidential metadata, because the file they exchanged was previously used on 16 other unrelated projects.
Future Compatibility – plain text is the safest way to preserve data. Software companies go out of business, file formats fall into disuse (real player videos, lotus notes files, etc..) and standards change. Microsoft may seem an indestructible corporate powerhouse today, but 20 years from now they may no longer exist. And if they do exist, you can bet your sweet ass that they won’t support Office 2003 format anymore. Locking up your data in proprietary formats is foolish and most people outside the intellectually stunted corporate ghetto of business school graduates realize this.
Open standards are great, and open standards that use plain text file formats are even greater. Why? Because even if the spec is lost, plain text is relatively easy to figure out. Future digital archaeologists will only need to stumble upon or figure out the concept of an ASCII table to be able to decode most of plain text documents just by examining their contents in a hex editor.
And yes, I do understand that there are many encodings, and the world of plain text is far from uniform. Still, figuring out the few dozen encodings and their quirks ought to be easier than trying to work out how is data stored in the binary .doc file, without access to a copy of Microsoft Word to reverse engineer the damn thing. Or figuring out the “a thousand zipped XML files” format of OOXML.
I suspect that few centuries from now, historians will assume that majority of the people living in this day and age were Linux nerds because open source software, open standards and plain text files will all that will remain after us. Scholars will continuously argue about what our people might have been hiding in the Terabytes of inaccessible binary data they can’t decode.
We don’t have to go that far into the future to see the effects of this though. There are ancient folders on my drive, that survived my youthful nonchalance towards backup. They contain notes and short stories saved in Corel Word Perfect format – software I no longer own, use or need. Back then I used it, because that’s what was available. I did not know enough to make an educated decision. I could not have known that many years later I will be sitting there, staring at incomprehensible files that I cannot open without downloading some sort of viewer tool from the vendors website. Fortunately Corel still exists, and still supports their product. But that did not necessarily had to be the case.
Clarity – if you want to have “formating” in plain text files you need to use explicit markup. Whether you are using HTML, Markdown or LaTex you essentially type in your formating commands as textual blocks. Whoever is working on your file next, can probably figure out what you intended to do from the markup even if you messed it up, or if parts were deleted while reformatting. You can easily tweak, or clean up said formating code by hand.
WYSIWYG rich text documents purposefully hide that stuff from the user. The claim it is for clarity and user friendliness but this is really a matter of perspective. When you are typing up a paper, you may not want to deal with markup. When you are editing and proofreading something for release however, you want to have full control over the document. You want to be able to tweak and correct every aspect of it. WYSIWYG tools wrestle that control away from you, and abstract it into a series of useless toolbars and context items.
If you have ever wrote a serious paper in Word you know the frustration of trying to make it behave – especially if you are trying to make it do some very specific things: for example, to place a figure just so, to lay out tables or images side by side, to have some pages in different orientation to accommodate large charts. The more you try to do, the harder it becomes to keep it all together and a small change on one page, may have catastrophic effects on everything below. What’s worse, there is usually no way to predict these issues. Editing documents in Word is a game of chance – every time you make a change you have to inspect the entire document for problems, and keep your Undo button ready to back-track.
I have some really neat examples of how the transparency of markup improves the user experience in my LaTex vs Office post. Please check it out to see animated examples of some of the WYSIWYG issues I mentioned above. This is the failed promise of user friendliness I mentioned. Yes, typing up a Word document is pretty user friendly. Opening said document later, deleting a single character and seeing the entire document collapse upon itself and contort into an unimaginable mess is not friendly at all. Tools like Microsoft Word are only user friendly up to a point.
That last point especially is very, very important. I would say it is more important than clear state transformations. I would say that hidden markup is the root of about 70% of issues you will run into when working with Word and Excel. I just made that number up, so feel free to substitute another one if you are so inclined. But trust me – a lot of the things you see Word users complain about stems from their inability to comprehend and anticipate WYSIWYG paradigm concepts such as invisible control characters. You simply can’t comprehend why Word does certain things until you understand markup language – especially opening/closing tags, and what happens when you leave them open.
Here is an interesting thing I discovered: learning markup makes people better at Word. I teach kids HTML. Very basic stuff, mind you – no CSS or anything like that. The most complex thing they create is a table, and for the sake of simplicity I show them the font tag (I know it’s wrong, but it makes more sense than shoe-horning inlne CSS there). Then I go back to Microsoft word and explain to them how it puts these invisible tags in the documents. And it is pretty good at keeping them matched up, but sometimes it fails. So that’s why once in a blue moon you delete something, and it causes the rest of the document barf up upon itself. I swear to you, you can usually see a little light bulb appear over the heads of like one or two students in the class. The rest of them remains ignorant, but that’s just par for the course. You can’t win them all but seeing a few select individuals have this world view shattering epiphany is what teaching is all about. All of a sudden these few students instantly understand why Word sucks, and are armed with knowledge on how to battle and tame that software beast. They are no longer baffled by mysterious “computer doing stupid things” but instead realize the society sold them a $300 piece of shit, and requires them to use it.
So there you have it. I’m pretty sure there are more compelling reasons why use plain text over rich than that, but this is all that I can think of right now. On the other hand I can’t think of a single argument for the reverse side of the argument, other than the old and tired “end users won’t understand plain text in notepad” mantra. I honestly don’t know a single logical reason why a self respecting geek would subject himself to a WYSIWIG purgatory. Other than ignorance of course. And there are a lot of ignorant geeks out there.
These guys are almost worse than your common garden variety of a luser (technophobus ignoramus luserati if you want to use fake latin). Lusers simply don’t know any better, and they don’t understand logic. I have learned to accept that – them creatures are slaves to the emotion, and impervious to facts and empirical evidence. We geeks operate a little bit differently – we like to think of ourselves as rational beings. But alas, we don’t always behave that way. There are otherwise wonderfully clever individuals out there who know that MS Word sucks, and love to complain about it but won’t take the plunge and peel themselves away from WYSIWIG. You show them LaTex and they scoff at the syntax. You give them Markdown, and they go “Yea, but no.” Then they go and rant for hours about how no one can make a Word processing tool that works. They see the problem, and yet they refuse the accept the solution. WYSIWIG simply does not work.
Over the years I have learned that smart people do stupid things all the time. Once can be an expert in one field and a complete idiot in a related one. The most valuable traits in a human being are an open mind, willingness to try new things and an ability to admit and learn from mistakes. If you find a person who exhibits all these three traits, hire them immediately. Or marry them – whichever seems more appropriate at the moment.
Chris – this is not refutation of your claim. I do agree that clear state transforms are important, beneficial and definitely a factor. They are just one of many factors why WYSIWIG sucks. Maybe one day we will have a better way of editing documents that is neither like WYSIWIG, nor like markup. If it’s indeed superior to both, then I’ll switch. But until then, I’m going plain.