In our LaTex discussion Craig mad a very good observation. Office is abused on a daily basis. People use Word and Excel for things that plain text would surely suffice. Lets talk about benefit of storing your random notes, raw numerical data and informal writing in plain text format. I’ll do this the same way as I did with LaTex – let’s count the reasons why you should use plain text over more complex formats such as doc or PDF.
True Read/Write Portability
Plain text is not tied to a single editor, office suite or IDE. You may think that doc files are portable, because there are many applications that can read them, and write them. But they are not – how many applications can write them well? How many applications let you edit PDF files on the fly? When you find yourself on some strange bare bones system that does not belong to you, where you have no admin privileges and install anything – will you be able to access your notes? Let’s say it’s an old linux box with no X, or an “out of the box” install of Windows 2000. Will you be able to access your doc, xls and pdf files on it?
Every OS ships with a plain text editor – even DOS had one. Text is the only file format that you are guaranteed to be able to read on just about any system. This is why README files are always written in plain text.
And yes, switching between one OS and another might pose a slight inconvenience due to the way different systems encode line breaks. But your data will still be there – just may look a little scrambled or spaced out. Most decent text editors can deal with this issue.
Searchability
There are many mature, and efficient tools for parsing and searching text. Unix grep and dos find are good examples here. You can use them to search for matches across directories, employ regular expressions (at least with grep), and build very complex queries. There are efficient algorithms to do it, that were perfected and optimized over the years. With more complex file formats this is no longer so easy. You may need to search them via built in editor tools (ie. one file at a time), parse binary files, or use cryptic API’s.
The popularity of the so called Desktop Search Engines such as Google Desktop, Beagle, the new Windows Desktop Search and Spotlight is rising these days, because we no longer can efficiently search our data. You used to be able to simply grep your home directory for a keywoard and get relevant hits. But those days are long gone. Our data is locked up in complex, often binary and/or proprietary formats that require a lot of processing to be parsed. We can no longer efficiently scan through hundreds of files in real time, which is why we are moving toward indexing.
But it doesn’t have to be that way. You shouldn’t need to run a background indexing daemon just to be able to search inside your files. If you store most of your notes in plain text, you may noticed that you hardly need those resource stealing engines.
Ease of Modification
There is nothing easier than doing search and replace on plain text files. You can do it in batches using sed, awk, perl or hundreds of other tools. Its easy, fast and efficient. How fast can you do that with other types of files? Can you do it in batches? Perhaps. But plain text is just that – plain. You don’t need additional modules, you don’t need specialized software. You just grab the first tool that can run regexps on files, and write simple scripts. Nothing can beat that.
Even text based formats such as HTML, XML or LaTex are harder to parse, because you need to watch for the markup. Good luck quickly changing your company or product name in your documentation composed of few dozen of files in different proprietary and open formats.
Ease of Parsing and Filtering
Plain text is also extremely easy to parse and filter. In most programming languages reading a file into memory is a two-liner. First line opens a character stream to the file, the other one reads from it. In some environments you may need a loop to get the whole file into memory, and in some cases (like Perl for example) you can just slurp the whole thing up into an array, splitting it on a given delimiter (say space, or a newline). You can write a plain text parser for your specific file in a matter of minutes.
Writing an XML, HTML or PDF parser – now that’s a project. You will probably want to import a class or a module written by someone else, because doing it from scratch would be neither easy nor straightforward.
Ease of Concatenation and Splitting
How do you concatenate doc files? How do you append all the columns from one Excel file to another? In most cases this involves opening both files in separate windows and some copying and pasting. With text files this is unimaginably simple. To concatenate files simply cat and append. To combine column-wise use paste. There is no need to open files – it can be all done in batch, from within a script, or even programatically from within your application.
Same with splitting files – you can split them line wise, or use the some parsing and filtering to subdivide it based on content, or some other defined rules. No other file formats can be manipulated this easily, by such a wide variety of tools.
Ease of Indexing
All the current desktop search engines can index a wide variety of file formats. But guess which files are indexed fastest, with minimal drain on system resources? If you said text files, you guessed it right!
Efficient Storage
There are few data formats more efficient at storing information than plain text. It’s also one that compresses the best. I don’t think I need to convince anyone of that but let me just make a quick comparison of the popular document formats used these days. I generated random Lorem Ipsum text with 9662 words and saved it as text, Word document and a PDF. Here are the file sizes:
txt – 65 Kb (17 Kb zipped)
doc – 96 Kb (21 Kb zipped)
pdf – 137 Kb (129 Kb zipped)
Of course these savings may seem minuscule these days. Terrabyte drives are currently affordable for average consumer so who cares about few bytes here and there. But then again you have to admit that 30 and 50% gain in storage efficiency as opposed to doc and pdf respectively is pretty good.
With all of the benefits described above, why do we insist on storing documents in other formats? If you don’t need any rich text formating in your document, why use Word? If all you are storing is a long list of names and phone numbers why do you need Excel if a tab separated, or coma separated file would work just as well? Why take time to write out a short memo in Word and send it out as an attachment, if you could simply type same few sentences in the body of the email? There are many cases in which you do want your documents to look well – and then Word or a typesetting software is appropriate.
It’s sad that with exception of readmes and various log files, plain text is seldom ever used anymore. And yet, most of the documents we work with on a daily basis are mostly simple text with few purely cosmetic embellishments. We write, edit, copy and paste plain text. We move it from one editor to another, lift it from one medium, and publish on a different one. We simply choose to save it in the most inefficient and complex form we can find.
[tags]plain text, text, office, word, doc, pdf, txt, awk, sed, perl, documents[/tags]
Great article! However, I read somewhere that an ODF text document can be smaller than the same text in a text file when zipped. This is because ODF uses internal compression already, so once you put it in a zip it can be slightly smaller for larger files.
Interesting. Although from my experience, compressing files that already use some sort of compression does not produce good results. When you compress files you usually eliminate redundancy, wasted space etc. If the file is already free of all those things, then there is not much you can do to compress it further…
DITTO
Some of us even go as far as to use ASCII art to make drawings! I can’t tell you how many times I have made network and computer diagrams using this method.
Now, there is one little hitch with using ASCII. It seems Microsoft went it’s own way and uses the extra CR (carriage return, control-m, ASCII code 13, whatever) method of ending a line instead of just a EOL (end of line). This doesn’t mean anything unless you need to move the file between Windows and some other OS (I think Windows is the only one that does this). The magic fix is to use unix2dos or dos2unix commands. However, if you moved a UNIX file to Windows, you will have to use some other method. A good way is to open the file with a web browser and save it. PRESTO! File fixed!
Yep. I think the way different OS’s specify line endings was this:
Windows: \n\r (line feed, carriage return)
Linux/Unix: \n (line feed)
Pre OSX Mac: \r (carriage return)
Almost every editor that has syntax highlighting will also be able to convert between unix and windows line endings.
I often have complaints regarding our project managers using Word format for copy decks for web sites we create. When a user writes a copy deck in word and gives it to me to put on the web i often have to do a search and replace to get rid of “MSified” characters, like werid quotes, apostrophes and dashes that MS seems to alter from plain ASCII format. If i just copy and paste directly from word to html the page will produce “?” (question marks) where these special characters were. For example many times Word will turn normal quotes (“) into angled quotes and that doesn’t gel in html.
So i usually have to go WORD > NOTEPAD > HTML and then find / replace certain characters.
its a real pain, and could be settled by the project managers putting decks into plain text.
mcm
I hate angled quotes. The only thing they do is cause issues.
Btw, do they actually type HTML in word? Anyone who actually types any kind of code in Word really need to have a clue LART’ed into them.
I keep alot of notes on my computers and most are in plain text format, some are in HTML (because it easy to put images into it) and since i started using ubuntu I’ve been playing some with Tomboy. So yeah i agree text is usually overlooked and very flexible. And btw i totally hate the dos/windows \n\r thing, it complicated my life back when I used to write a lot of C code. Admittedly not much but it is unnecessary and never made sense to me. And for the record back in the day of MS-Word 95 i created some HTML files using it, needless to say i freaked out when i looked at the files in text editor. Never repeated that one again, took me longer to clean up the files than it would’ve taken me to code them directly. I knew then tho Word did a poor job of it I just had to see how poor for myself.
Creating web pages in Word is not a good idea. But people still think that using Front Page is acceptable – even though the code quality is not much better. That’s the scary part. There are many Front Page developers out there who do not realize just how horrible is the code they produce.
Nothing really beats an old fashion text editor and a good web browser for writing HTML. Altho I’m kinda fond of 1st Page in windows the free edition cause I ain’t paying for nothing. lol. A couple of months ago i used Scream in ubuntu and it did ok. of course neither of these is a true WYSIWYG editor. Still if you are going to code in HTML learn it and css while ya at it and the best way to do that is in a text editor, IMHO :)
Text editor or an IDE.
For example Dreamweaver is actually a really good tool for HTML and CSS. It has a really nice editor with syntax hilighting, auto-completes and etc. It’s horribly bloated though – but it works.
But it’s impossible to make good looking and standards compliant pages with a WYSIWYG tool.
Yeah lots of people love Dreamweaver but it costs $400 which is simply unacceptable to me. If I decided I must USE it it would have to be a pirated copy. Pricing software at prices like that FORCES the poor and “the working class” to either piracy or doing without or using other perhaps inferior tools. But anyway I’ve never saw a WYSIWYG html editor that produced good clean pretty code. Not really opposed to people using them tho just if you truly are trying to learn how to code in html a tool like that is not really going to help you. And if you care about standards a tool like that might even hurt you.
Yup – this is one of the reasons I don’t use Dreamweaver. Too damn expensive. For my HTML, PHP and Perl coding needs I currently use Komodo Edit. :)
For coding html – if you’re a web developer and you’re not using the Firefox extension Firebug, you should slap yourself.
Dynamically update any code and see immediately what it does, mousing over code highlights the parts of the page that code applies to, stuff and more stuff, and then a few more goodies. It does this dynamically for any page you visit, not just your own. Nice. I’m not a web developer myself, but friends of mine use it, and it’s a very sexy piece of software.
I agree vacri Firebug is a great FF extension, I’ve used it tho I’m not using it now. I’m not really a web developer tho I set up a web site for a local business and for a friend of mine selling organic cookies. That was for free and for fun tho :)
And ah yeah while we are on the topic of firefox add-ons, another cool one for web developers is the Web Developer toolbar. It’s very popular and i suppose needs no introduction.
And Luke I haven’t used Komodo Edit yet, I was planning on trying it out when ya first blogged about it but kinda spaced it out. Kinda too busy right now to play with it but it certainly looks cool.
I do use Firebug, and yes – it is pretty sweet. Then again, a lot of my work happens on the back end so I don’t use it all that much.
Nothing wrong with plain text – I have notepad pinned to my start menu for when I want to make a quick note of something. MS Word or OOo Writer take too long to load up when all I want to do is write 3 lines of stuff to remind myself of something later.
Although from the sound of it, if I ever want to use plain text for anything “serious” like coding or suchlike then I’d want to get a better text editor than notepad :wink:
Matt`, if you like Notepad, you should look into Notepad2. It is similar load times, similar look and feel, and similar memory footprint – but it does have syntax highlighting, and doesn’t mind opening really big text files.
You can actually swap notepad.exe with the it’s executable so that it acts like a seamless drop in replacement.
Notepad replacements there are a ton of them, Notepad++ is the last one i played with. It is similar to Notepad2 but supports tabs and has a few more features. Naturally it probably doesn’t have as small a footprint, but I never checked just assuming.
Or you could use Vim. :mrgreen: Or Emacs.
There is also a registry hack somewhere that will make Vim the default windows text editor (it will make it the editor that opens when you do View Source in IE and etc..)
Another small Notepad like text editor is Metapad.
You didn’t mention one of my big reasons for going with text files: proper version control.
If your document is in text form, version control systems will properly handle it. Sure, you can use binary files in version control, but the version control system won’t understand the structure of the information in the file. It will work especially poor for compressed formats like ODF and OOXML, because even a small change in the document can change the entire file, which is misleading to version control.
Without understanding the content of the file, the version control system can’t do many things, like merging and blame/praise. The repository will also tend to grow in size by O(n), since commits can’t delta-compressed well (due to the document format compression).
For example, I keep my resume in LaTeX and have it checked into a Git repository. If I need to spin off custom versions for some special use, I make a branch and commit the specific changes to it. If I later update my main generic resume I can trivially merge these updates into any custom version branches I have.
I love doing it this way.
A word processor may have some kind of change tracking, but compared to a real, powerful version control system like Git, it’s going to be a sucky half-assed implementation.
I am another one who ‘gets’ this article.
I am a big advocate of using the simplest software possible and sticking to one, monospaced font too.
It makes me wince a little when I look on the drives at work and see people have saved the simplest of notes in spreadsheets or .doc files.
It is even worse when they use tables in Word!
My site [shameless plug] is all about the beauty and simplicity of the plain text file.