I’m not sure if I ever did this experiment here or not. I might have touched on this subject in the past but I don’t think I ever committed a full post to the issue. But let’s start from the begging…
I was messing around with a relatively small output file from an old scientific experiment. It was few hundred lines, and the data was organized into 4 tab separated columns. I kept it in basic ASCII format because that was how the application dumped it’s data, and because it is the best file format for manipulation. I can use all the wealth of unix based text processing tools to sort, edit, filter and parse it, or even graph it with tools like GNUPlot. Naturally to exchange this data with “normal” people (and I’m using the word normal in the most derogatory, condescending way possible) I have to put it in a different format. Lusers just don’t seem to be able to read plain text files these days for some reason. If the data is not in Excel it is un-parsable and scary. This is what I noticed:
My plain text data file was 11 KB. When I opened it up and saved it with Excel it became 28 KB. Yes, this simple conversion more than doubled the size of my file. And all I did here was just simple conversion from one format to another. The additional 17 KB are just pure meta data that stores file formating and the like. Naturally as you could expect the file size doesn’t always double. I took another file, this time much larger weighing in at 7.9 MB and repeated the same process. It grew to nearly 8.3 MB.
This is much more than 17 KB of metadata we needed in the first case. So the overhead grows as the function of the file size. Thia makes me wonder. How much space do we really waste keeping our data in the inefficient MS Office formats? Let’s take it one step further. The image below shows us what happens when we attempt to zip compress all these files:
My first text file shrunk down to 5 KB yielding 82% compression. The compression for the bigger text file was even better. It went down to 58 KB losing over 99% of it’s former size. The compression on the excel files on the other hand was disappointing. First file wend down to 26 KB shrinking by only 7%. The second file did even worse shrinking by less than 100 KB and loosing only 2% of it’s capacity.
There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed – there is not much we can do about it! This files are and will be huge for many reasons.
The striking realization here is that I can aggressively compress my plain text data for huge space savings. I’m talking about Terabytes of data across the world. The MS office files just do not shrink this way. They pretty much stay the same size no matter what you do to them. And the MS Office format is unfortunately the proffered way of storing information for millions of companies out there. Go to any office out them and ask them to have a peek into their archival files. I bet they will be mostly word and excel documents scattered all over the place. Most people are using these incredibly wasteful file formats around and think nothing off it. It’s actually kida scarry to think how much space is wasted this way.
Think of the savings we could get if we moved to storing plain text data as a rule. Think about the accessibility benefits! Think about portability. Plain text is the simplest, most portable and easiest format to work with. And yet it is used very rarely for anything serious these days. If we could just cut down on the gratuitous use of the Office programs we could see dramatic changes all around. And no – it wouldn’t be step backwards. It would be a step in the right direction. First step back on the path from which we strayed around the time Microsoft figured out they can squeeze money out of pointy haired managers and directors by ruthlessly locking them into their platform.
But this is a moot point. Most of these poor vendor locked in souls will fight tooth and nail to to maintain status quo. After all, change is scary. :P
[tags]plain text, excel, xlsx, xls, word, ms office, office[/tags]