File Format Overhead for Data Storage

I’m not sure if I ever did this experiment here or not. I might have touched on this subject in the past but I don’t think I ever committed a full post to the issue. But let’s start from the begging…

I was messing around with a relatively small output file from an old scientific experiment. It was few hundred lines, and the data was organized into 4 tab separated columns. I kept it in basic ASCII format because that was how the application dumped it’s data, and because it is the best file format for manipulation. I can use all the wealth of unix based text processing tools to sort, edit, filter and parse it, or even graph it with tools like GNUPlot. Naturally to exchange this data with “normal” people (and I’m using the word normal in the most derogatory, condescending way possible) I have to put it in a different format. Lusers just don’t seem to be able to read plain text files these days for some reason. If the data is not in Excel it is un-parsable and scary. This is what I noticed:

TXT and XLSX Comparison

My plain text data file was 11 KB. When I opened it up and saved it with Excel it became 28 KB. Yes, this simple conversion more than doubled the size of my file. And all I did here was just simple conversion from one format to another. The additional 17 KB are just pure meta data that stores file formating and the like. Naturally as you could expect the file size doesn’t always double. I took another file, this time much larger weighing in at 7.9 MB and repeated the same process. It grew to nearly 8.3 MB.

Another Comparison

This is much more than 17 KB of metadata we needed in the first case. So the overhead grows as the function of the file size. Thia makes me wonder. How much space do we really waste keeping our data in the inefficient MS Office formats? Let’s take it one step further. The image below shows us what happens when we attempt to zip compress all these files:

Compression Comparison

My first text file shrunk down to 5 KB yielding 82% compression. The compression for the bigger text file was even better. It went down to 58 KB losing over 99% of it’s former size. The compression on the excel files on the other hand was disappointing. First file wend down to 26 KB shrinking by only 7%. The second file did even worse shrinking by less than 100 KB and loosing only 2% of it’s capacity.

There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed – there is not much we can do about it! This files are and will be huge for many reasons.

The striking realization here is that I can aggressively compress my plain text data for huge space savings. I’m talking about Terabytes of data across the world. The MS office files just do not shrink this way. They pretty much stay the same size no matter what you do to them. And the MS Office format is unfortunately the proffered way of storing information for millions of companies out there. Go to any office out them and ask them to have a peek into their archival files. I bet they will be mostly word and excel documents scattered all over the place. Most people are using these incredibly wasteful file formats around and think nothing off it. It’s actually kida scarry to think how much space is wasted this way.

Think of the savings we could get if we moved to storing plain text data as a rule. Think about the accessibility benefits! Think about portability. Plain text is the simplest, most portable and easiest format to work with. And yet it is used very rarely for anything serious these days. If we could just cut down on the gratuitous use of the Office programs we could see dramatic changes all around. And no – it wouldn’t be step backwards. It would be a step in the right direction. First step back on the path from which we strayed around the time Microsoft figured out they can squeeze money out of pointy haired managers and directors by ruthlessly locking them into their platform.

But this is a moot point. Most of these poor vendor locked in souls will fight tooth and nail to to maintain status quo. After all, change is scary. :P

[tags]plain text, excel, xlsx, xls, word, ms office, office[/tags]

This entry was posted in Uncategorized. Bookmark the permalink.



17 Responses to File Format Overhead for Data Storage

  1. Muhammad SINGAPORE Mozilla Firefox Windows Terminalist says:

    Wow! Nice experiment you got there. I presume the text files are in CSV format?

    I know that the same data kept in various formats will definitely have differing file sizes due to the metadata, but that compressed text file really got me thinking. How much space have I wasted and could have saved on my office PC? The 7861kB text file compressed to 58kB is just too cool. It also kinda shows the inefficiency of Excel 2007 though. Wonder how efficient is OpenOffice’s Calc? Gotta try this out someday! :)

    Reply  |  Quote
  2. Wikke BELGIUM Mozilla Firefox Windows says:

    And what about XML files?

    I think it’s even more convenient to store data in xml files.
    Especially these days, when lots of programming languages can handle XML.

    I did a quick test with some Xml file (13 Mb)
    It compressed to a smooth 300Kb :)

    Reply  |  Quote
  3. Luke Maciak UNITED STATES Mozilla Firefox Ubuntu Linux Terminalist says:

    [quote post=”2282″]And what about XML files?[/quote]

    XML is good too. But then again both ODF and OpenXML formats are XML based. But you see what happened when I tried compressing it. So it really depends on how do you use the XML. :)

    Reply  |  Quote
  4. Matt` UNITED KINGDOM Mozilla Firefox Windows Terminalist says:

    This being why I use notepad when all I want is a quick, plain little file. Waiting for Office (be it Open or MS) to load up just feels like dragging my feet for no good reason. Plus the file sizes are so much smaller, and the program interface so much simpler – just type, save and go back to whatever without worrying about any of the formatting guff

    Reply  |  Quote
  5. I CROATIA Mozilla Firefox Windows says:

    OpenOffice does only slightly better (it uses ISO/IEC 26300 OpenDocument format which is also compressed XML files): raw .cvs data: 232 kB; .ods file: 182 kB; .zip of the raw csv data: 65 kB; zip of the .ods file: 170 kB.

    Reply  |  Quote
  6. Luke Maciak UNITED STATES Mozilla Firefox Ubuntu Linux Terminalist says:

    Heh, it seems that ODF is superior! :) God to know.

    Reply  |  Quote
  7. vacri AUSTRALIA Mozilla Firefox Ubuntu Linux says:

    I have to admit that I’m a little suspicious about a file that can be compressed to below 1% of it’s original file size. Was the bigdata file a realworld text file or was it a mocked up file that inadvertently could be easily compressed? What kind of data was it?

    Although I think it speaks for itself that the .xls didn’t really compress at all.

    Reply  |  Quote
  8. Luke Maciak UNITED STATES Mozilla Firefox Windows Terminalist says:

    The file was essentially rows of floating point numbers. They were readings taken at each iteration and they were supposed to converge and stop changing at some point. Here is the sample of the data:

    0.002479512	0.003447492	0.004360885	0.00417913
    0.002243689	0.003142163	0.003983163	0.003899171
    0.002080678	0.002927045	0.003714036	0.00369954
    0.001953793	0.002756688	0.003503308	0.003538016
    0.001857135	0.002622514	0.003341999	0.003409899
    0.001786521	0.002518304	0.003223567	0.003311575
    0.001739003	0.002439494	0.003143176	0.003240258
    0.001712549	0.002382714	0.003097226	0.003193814
    0.001705819	0.002345494	0.003083038	0.003170632
    0.001718007	0.002326056	0.003098644	0.003169526
    0.001748726	0.00232317	0.003142633	0.003189665
    0.001797924	0.002336054	0.003214054	0.003230512
    0.001865815	0.002364298	0.003312335	0.003291781
    0.001952821	0.002407811	0.003437223	0.003373395
    0.002059526	0.002466784	0.003588746	0.00347545
    0.002186619	0.002541665	0.003767171	0.003598183
    0.002334844	0.002633143	0.003972968	0.003741938
    0.002504938	0.002742141	0.004206782	0.003907128
    0.002697553	0.002869812	0.00446939	0.004094192
    0.002913174	0.003017547	0.004761655	0.004303547
    0.003151996	0.003186983	0.005084467	0.004535527
    0.003413796	0.003380021	0.005438662	0.004790304
    0.003697758	0.003598848	0.005824921	0.005067787

    It’s all numeric, no letters and the values are relatively close to each other. It compresses very well. For this test I zipped all the files it using WinRar with the “Best Compression” option.

    Reply  |  Quote
  9. jambarama UNITED STATES Mozilla Firefox Windows Terminalist says:

    What is this mudkips captcha, I don’t get it.

    A blog I used to read (when entries were still forthcoming), 3monkeys, did a file size comparison between doc, xml, txt, & odt about a year ago. They came up with essentially similar same results as you. They also compared file sizes across different apps but the same filetype, different results. OOXML wasn’t out at that time, so no comparison there.

    Thanks for this, it is pretty interesting to me. If you ever try it again, maybe run some odf comparisons too. :)

    Reply  |  Quote
  10. jambarama UNITED STATES Mozilla Firefox Windows Terminalist says:

    BTW – saving your data in ascii is a great idea for a few reasons: it is more searchable, more easily manipulated, smaller size (as you demonstrated), and you can bet your donkey you’ll be able to open a txt file in 30 years and import it into whatever you need. If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you’re probably hosed.

    Reply  |  Quote
  11. Luke Maciak UNITED STATES Mozilla Firefox Ubuntu Linux Terminalist says:

    [quote post=”2282″]you can bet your donkey you’ll be able to open a txt file in 30 years and import it into whatever you need. If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you’re probably hosed. [/quote]

    Dito! You hit the nail on the head. I totally glossed over that, but the availability of your information in the future is paramount. Locking all your data in proprietary formats is not a smart thing to do.

    Reply  |  Quote
  12. Luke Maciak UNITED STATES Mozilla Firefox Ubuntu Linux Terminalist says:

    [quote post=”2282″]What is this mudkips captcha, I don’t get it.[/quote]

    Mudkips == 4chan meme. In other words, you probably don’t want to know. ;) I added it to the word list because it is short and random word that is not really in a dictionary, but it is easy to spell and type in and also acts as a silly inside joke for some.

    Reply  |  Quote
  13. Will Sheldon UNITED KINGDOM Mozilla Firefox Windows says:

    Luke,

    The xlsx files are already zip compressed. Rename one to end with zip & you’ll be able to decompress it, then you can see how (un)cooperative m$ has been with their new ‘open’ standard.

    Reply  |  Quote
  14. Luke Maciak UNITED STATES Mozilla Firefox Windows Terminalist says:

    [quote comment=”8064″]Luke,

    The xlsx files are already zip compressed. Rename one to end with zip & you’ll be able to decompress it, then you can see how (un)cooperative m$ has been with their new ‘open’ standard.[/quote]

    Will, I know that. This is why I said this in my post:

    [quote post=”2282″]There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed – there is not much we can do about it! This files are and will be huge for many reasons.[/quote]

    Funny thing is that the XML inside is convoluted and it can’t be readily edited. I tried to unzip a Word file, slightly change some text and then zip it back but naturally this doesn’t work. They have checksums and hashes of the content stowed away in more than one place to ensure that modifying OOXML files is as convoluted as possible.

    Reply  |  Quote
  15. This may be slightly off-topic, but since you mention it…

    There is now a book on Gnuplot: “Gnuplot in Action”. You can pre-order it directly from the publisher: Manning: Gnuplot in Action.

    The book assumes no previous familiarity with Gnuplot. It introduces the most basic concepts rapidly, and then moves on to explain Gnuplot’s advanced concepts and power features in detail.

    If you want to learn more about the book and the author, check out my book page at Principal Value – Gnuplot in Action.

    Let me know if you are interested in a review copy.

    Reply  |  Quote
  16. Luke Maciak UNITED STATES Mozilla Firefox Windows Terminalist says:

    Nice! I’d love a review copy if you can spare one. :)

    I’ll shoot you an email.

    Reply  |  Quote
  17. Glenna James UNITED STATES Mozilla Firefox Windows says:

    I tried to send you an email listed on your contact me page, but it was rejected. So I’m taking a chance and asking this question here.

    You helped me with an excel problem a while back. I went back to you site to get help with this problem but I didn’t know how to look it up in your archives so I’m emailing you.

    BACKGROUND

    I am an accountant and I work in governmental accounting. (I am not employed by the government; I just do work for them)

    I have been working for months (on and off) on a spread sheet that will allow me entering information the least amount of times as possible. I have numerous formulas where the I enter the information once in a workbook and it pulls to multiple other sheets in the workbook. I work with three different areas so I did the master work on one area (Central) and a week or so ago when I finally finished my work, I copied that workbook (Central’s) and made a duplicate for the other two areas (Western and Southwest) After copying the original workbook I went in to Western and Southwest and entered their data into their workbook.

    PROBLEM

    With Central only many of my cells are not pulling the data I enter in the original without help from me. I have to go to the sheet and click in the cell and then click up at the top — I don’t know what you call that space but immediately to the left of it, it says, “fx”.

    Then I hit enter and the data is then pulled to that cell.

    I checked everything I could think of under TOOLs and Options. It is set to calculate automatically.

    Of course having to do this defeats the whole purpose of having the formula in the first place.

    The thing that puzzles me is that I have the problem only in the original workbook and not in the two I copied.

    Since I consider you to be a genius, I hope you will be able to help me. Thanks.

    Glenna James

    Reply  |  Quote

Leave a Reply

Your email address will not be published. Required fields are marked *