File Format Overhead for Data Storage
I’m not sure if I ever did this experiment here or not. I might have touched on this subject in the past but I don’t think I ever committed a full post to the issue. But let’s start from the begging…
I was messing around with a relatively small output file from an old scientific experiment. It was few hundred lines, and the data was organized into 4 tab separated columns. I kept it in basic ASCII format because that was how the application dumped it’s data, and because it is the best file format for manipulation. I can use all the wealth of unix based text processing tools to sort, edit, filter and parse it, or even graph it with tools like GNUPlot. Naturally to exchange this data with “normal” people (and I’m using the word normal in the most derogatory, condescending way possible) I have to put it in a different format. Lusers just don’t seem to be able to read plain text files these days for some reason. If the data is not in Excel it is un-parsable and scary. This is what I noticed:

My plain text data file was 11 KB. When I opened it up and saved it with Excel it became 28 KB. Yes, this simple conversion more than doubled the size of my file. And all I did here was just simple conversion from one format to another. The additional 17 KB are just pure meta data that stores file formating and the like. Naturally as you could expect the file size doesn’t always double. I took another file, this time much larger weighing in at 7.9 MB and repeated the same process. It grew to nearly 8.3 MB.

This is much more than 17 KB of metadata we needed in the first case. So the overhead grows as the function of the file size. Thia makes me wonder. How much space do we really waste keeping our data in the inefficient MS Office formats? Let’s take it one step further. The image below shows us what happens when we attempt to zip compress all these files:

My first text file shrunk down to 5 KB yielding 82% compression. The compression for the bigger text file was even better. It went down to 58 KB losing over 99% of it’s former size. The compression on the excel files on the other hand was disappointing. First file wend down to 26 KB shrinking by only 7%. The second file did even worse shrinking by less than 100 KB and loosing only 2% of it’s capacity.
There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed - there is not much we can do about it! This files are and will be huge for many reasons.
The striking realization here is that I can aggressively compress my plain text data for huge space savings. I’m talking about Terabytes of data across the world. The MS office files just do not shrink this way. They pretty much stay the same size no matter what you do to them. And the MS Office format is unfortunately the proffered way of storing information for millions of companies out there. Go to any office out them and ask them to have a peek into their archival files. I bet they will be mostly word and excel documents scattered all over the place. Most people are using these incredibly wasteful file formats around and think nothing off it. It’s actually kida scarry to think how much space is wasted this way.
Think of the savings we could get if we moved to storing plain text data as a rule. Think about the accessibility benefits! Think about portability. Plain text is the simplest, most portable and easiest format to work with. And yet it is used very rarely for anything serious these days. If we could just cut down on the gratuitous use of the Office programs we could see dramatic changes all around. And no - it wouldn’t be step backwards. It would be a step in the right direction. First step back on the path from which we strayed around the time Microsoft figured out they can squeeze money out of pointy haired managers and directors by ruthlessly locking them into their platform.
But this is a moot point. Most of these poor vendor locked in souls will fight tooth and nail to to maintain status quo. After all, change is scary.
Related Posts:

February 7th, 2008 at 12:17 pm (8002) [Quote]
Wow! Nice experiment you got there. I presume the text files are in CSV format?
I know that the same data kept in various formats will definitely have differing file sizes due to the metadata, but that compressed text file really got me thinking. How much space have I wasted and could have saved on my office PC? The 7861kB text file compressed to 58kB is just too cool. It also kinda shows the inefficiency of Excel 2007 though. Wonder how efficient is OpenOffice’s Calc? Gotta try this out someday!
Posted usingFebruary 7th, 2008 at 12:37 pm (8005) [Quote]
And what about XML files?
I think it’s even more convenient to store data in xml files.
Especially these days, when lots of programming languages can handle XML.
I did a quick test with some Xml file (13 Mb)
Posted usingIt compressed to a smooth 300Kb
February 7th, 2008 at 1:00 pm (8009) [Quote]
XML is good too. But then again both ODF and OpenXML formats are XML based. But you see what happened when I tried compressing it. So it really depends on how do you use the XML.
Posted usingFebruary 7th, 2008 at 2:04 pm (8012) [Quote]
This being why I use notepad when all I want is a quick, plain little file. Waiting for Office (be it Open or MS) to load up just feels like dragging my feet for no good reason. Plus the file sizes are so much smaller, and the program interface so much simpler - just type, save and go back to whatever without worrying about any of the formatting guff
Posted usingFebruary 7th, 2008 at 2:55 pm (8013) [Quote]
OpenOffice does only slightly better (it uses ISO/IEC 26300 OpenDocument format which is also compressed XML files): raw .cvs data: 232 kB; .ods file: 182 kB; .zip of the raw csv data: 65 kB; zip of the .ods file: 170 kB.
Posted usingFebruary 7th, 2008 at 4:30 pm (8014) [Quote]
Heh, it seems that ODF is superior!
God to know.
Posted usingFebruary 7th, 2008 at 6:52 pm (8016) [Quote]
I have to admit that I’m a little suspicious about a file that can be compressed to below 1% of it’s original file size. Was the bigdata file a realworld text file or was it a mocked up file that inadvertently could be easily compressed? What kind of data was it?
Although I think it speaks for itself that the .xls didn’t really compress at all.
Posted usingFebruary 7th, 2008 at 7:52 pm (8018) [Quote]
The file was essentially rows of floating point numbers. They were readings taken at each iteration and they were supposed to converge and stop changing at some point. Here is the sample of the data:
It’s all numeric, no letters and the values are relatively close to each other. It compresses very well. For this test I zipped all the files it using WinRar with the “Best Compression” option.
Posted usingFebruary 8th, 2008 at 2:35 pm (8030) [Quote]
What is this mudkips captcha, I don’t get it.
A blog I used to read (when entries were still forthcoming), 3monkeys, did a file size comparison between doc, xml, txt, & odt about a year ago. They came up with essentially similar same results as you. They also compared file sizes across different apps but the same filetype, different results. OOXML wasn’t out at that time, so no comparison there.
Thanks for this, it is pretty interesting to me. If you ever try it again, maybe run some odf comparisons too.
Posted usingFebruary 8th, 2008 at 2:39 pm (8031) [Quote]
BTW - saving your data in ascii is a great idea for a few reasons: it is more searchable, more easily manipulated, smaller size (as you demonstrated), and you can bet your donkey you’ll be able to open a txt file in 30 years and import it into whatever you need. If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you’re probably hosed.
Posted usingFebruary 8th, 2008 at 3:38 pm (8033) [Quote]
Dito! You hit the nail on the head. I totally glossed over that, but the availability of your information in the future is paramount. Locking all your data in proprietary formats is not a smart thing to do.
Posted usingFebruary 8th, 2008 at 3:54 pm (8034) [Quote]
Mudkips == 4chan meme. In other words, you probably don’t want to know.
I added it to the word list because it is short and random word that is not really in a dictionary, but it is easy to spell and type in and also acts as a silly inside joke for some.
Posted usingFebruary 11th, 2008 at 5:55 pm (8064) [Quote]
Luke,
The xlsx files are already zip compressed. Rename one to end with zip & you’ll be able to decompress it, then you can see how (un)cooperative m$ has been with their new ‘open’ standard.
Posted usingFebruary 11th, 2008 at 7:26 pm (8069) [Quote]
Will Sheldon said:
Will, I know that. This is why I said this in my post:
Funny thing is that the XML inside is convoluted and it can’t be readily edited. I tried to unzip a Word file, slightly change some text and then zip it back but naturally this doesn’t work. They have checksums and hashes of the content stowed away in more than one place to ensure that modifying OOXML files is as convoluted as possible.
Posted usingMarch 19th, 2008 at 9:58 pm (8555) [Quote]
This may be slightly off-topic, but since you mention it…
There is now a book on Gnuplot: “Gnuplot in Action”. You can pre-order it directly from the publisher: Manning: Gnuplot in Action.
The book assumes no previous familiarity with Gnuplot. It introduces the most basic concepts rapidly, and then moves on to explain Gnuplot’s advanced concepts and power features in detail.
If you want to learn more about the book and the author, check out my book page at Principal Value - Gnuplot in Action.
Let me know if you are interested in a review copy.
Posted usingMarch 20th, 2008 at 12:43 am (8556) [Quote]
Nice! I’d love a review copy if you can spare one.
I’ll shoot you an email.
Posted using