File Format Overhead for Data Storage

Posted on February 7, 2008 by Luke Maciak tein.co/2282

I’m not sure if I ever did this experiment here or not. I might have touched on this subject in the past but I don’t think I ever committed a full post to the issue. But let’s start from the begging…

I was messing around with a relatively small output file from an old scientific experiment. It was few hundred lines, and the data was organized into 4 tab separated columns. I kept it in basic ASCII format because that was how the application dumped it’s data, and because it is the best file format for manipulation. I can use all the wealth of unix based text processing tools to sort, edit, filter and parse it, or even graph it with tools like GNUPlot. Naturally to exchange this data with “normal” people (and I’m using the word normal in the most derogatory, condescending way possible) I have to put it in a different format. Lusers just don’t seem to be able to read plain text files these days for some reason. If the data is not in Excel it is un-parsable and scary. This is what I noticed:

TXT and XLSX Comparison

My plain text data file was 11 KB. When I opened it up and saved it with Excel it became 28 KB. Yes, this simple conversion more than doubled the size of my file. And all I did here was just simple conversion from one format to another. The additional 17 KB are just pure meta data that stores file formating and the like. Naturally as you could expect the file size doesn’t always double. I took another file, this time much larger weighing in at 7.9 MB and repeated the same process. It grew to nearly 8.3 MB.

Another Comparison

This is much more than 17 KB of metadata we needed in the first case. So the overhead grows as the function of the file size. Thia makes me wonder. How much space do we really waste keeping our data in the inefficient MS Office formats? Let’s take it one step further. The image below shows us what happens when we attempt to zip compress all these files:

Compression Comparison

My first text file shrunk down to 5 KB yielding 82% compression. The compression for the bigger text file was even better. It went down to 58 KB losing over 99% of it’s former size. The compression on the excel files on the other hand was disappointing. First file wend down to 26 KB shrinking by only 7%. The second file did even worse shrinking by less than 100 KB and loosing only 2% of it’s capacity.

There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed – there is not much we can do about it! This files are and will be huge for many reasons.

The striking realization here is that I can aggressively compress my plain text data for huge space savings. I’m talking about Terabytes of data across the world. The MS office files just do not shrink this way. They pretty much stay the same size no matter what you do to them. And the MS Office format is unfortunately the proffered way of storing information for millions of companies out there. Go to any office out them and ask them to have a peek into their archival files. I bet they will be mostly word and excel documents scattered all over the place. Most people are using these incredibly wasteful file formats around and think nothing off it. It’s actually kida scarry to think how much space is wasted this way.

Think of the savings we could get if we moved to storing plain text data as a rule. Think about the accessibility benefits! Think about portability. Plain text is the simplest, most portable and easiest format to work with. And yet it is used very rarely for anything serious these days. If we could just cut down on the gratuitous use of the Office programs we could see dramatic changes all around. And no – it wouldn’t be step backwards. It would be a step in the right direction. First step back on the path from which we strayed around the time Microsoft figured out they can squeeze money out of pointy haired managers and directors by ruthlessly locking them into their platform.

But this is a moot point. Most of these poor vendor locked in souls will fight tooth and nail to to maintain status quo. After all, change is scary. :P

[tags]plain text, excel, xlsx, xls, word, ms office, office[/tags]

This entry was posted in Uncategorized. Bookmark the permalink.

17 Responses to File Format Overhead for Data Storage

Muhammad says:

February 7, 2008 at 12:17 pm

Wow! Nice experiment you got there. I presume the text files are in CSV format?

I know that the same data kept in various formats will definitely have differing file sizes due to the metadata, but that compressed text file really got me thinking. How much space have I wasted and could have saved on my office PC? The 7861kB text file compressed to 58kB is just too cool. It also kinda shows the inefficiency of Excel 2007 though. Wonder how efficient is OpenOffice’s Calc? Gotta try this out someday! :)

Reply | Quote
Wikke says:

February 7, 2008 at 12:37 pm

And what about XML files?

I think it’s even more convenient to store data in xml files.
Especially these days, when lots of programming languages can handle XML.

I did a quick test with some Xml file (13 Mb)
It compressed to a smooth 300Kb :)

Reply | Quote
Luke Maciak says:

February 7, 2008 at 1:00 pm

[quote post=”2282″]And what about XML files?[/quote]

XML is good too. But then again both ODF and OpenXML formats are XML based. But you see what happened when I tried compressing it. So it really depends on how do you use the XML. :)

Reply | Quote
Matt` says:

February 7, 2008 at 2:04 pm

This being why I use notepad when all I want is a quick, plain little file. Waiting for Office (be it Open or MS) to load up just feels like dragging my feet for no good reason. Plus the file sizes are so much smaller, and the program interface so much simpler – just type, save and go back to whatever without worrying about any of the formatting guff

Reply | Quote
I says:

February 7, 2008 at 2:55 pm

OpenOffice does only slightly better (it uses ISO/IEC 26300 OpenDocument format which is also compressed XML files): raw .cvs data: 232 kB; .ods file: 182 kB; .zip of the raw csv data: 65 kB; zip of the .ods file: 170 kB.

Reply | Quote
Luke Maciak says:

February 7, 2008 at 4:30 pm

Heh, it seems that ODF is superior! :) God to know.

Reply | Quote
vacri says:

February 7, 2008 at 6:52 pm

I have to admit that I’m a little suspicious about a file that can be compressed to below 1% of it’s original file size. Was the bigdata file a realworld text file or was it a mocked up file that inadvertently could be easily compressed? What kind of data was it?

Although I think it speaks for itself that the .xls didn’t really compress at all.

Reply | Quote

Luke Maciak

says:

February 7, 2008 at 7:52 pm

The file was essentially rows of floating point numbers. They were readings taken at each iteration and they were supposed to converge and stop changing at some point. Here is the sample of the data:

0.002479512	0.003447492	0.004360885	0.00417913
0.002243689	0.003142163	0.003983163	0.003899171
0.002080678	0.002927045	0.003714036	0.00369954
0.001953793	0.002756688	0.003503308	0.003538016
0.001857135	0.002622514	0.003341999	0.003409899
0.001786521	0.002518304	0.003223567	0.003311575
0.001739003	0.002439494	0.003143176	0.003240258
0.001712549	0.002382714	0.003097226	0.003193814
0.001705819	0.002345494	0.003083038	0.003170632
0.001718007	0.002326056	0.003098644	0.003169526
0.001748726	0.00232317	0.003142633	0.003189665
0.001797924	0.002336054	0.003214054	0.003230512
0.001865815	0.002364298	0.003312335	0.003291781
0.001952821	0.002407811	0.003437223	0.003373395
0.002059526	0.002466784	0.003588746	0.00347545
0.002186619	0.002541665	0.003767171	0.003598183
0.002334844	0.002633143	0.003972968	0.003741938
0.002504938	0.002742141	0.004206782	0.003907128
0.002697553	0.002869812	0.00446939	0.004094192
0.002913174	0.003017547	0.004761655	0.004303547
0.003151996	0.003186983	0.005084467	0.004535527
0.003413796	0.003380021	0.005438662	0.004790304
0.003697758	0.003598848	0.005824921	0.005067787

It’s all numeric, no letters and the values are relatively close to each other. It compresses very well. For this test I zipped all the files it using WinRar with the “Best Compression” option.

jambarama says:

February 8, 2008 at 2:35 pm

What is this mudkips captcha, I don’t get it.

A blog I used to read (when entries were still forthcoming), 3monkeys, did a file size comparison between doc, xml, txt, & odt about a year ago. They came up with essentially similar same results as you. They also compared file sizes across different apps but the same filetype, different results. OOXML wasn’t out at that time, so no comparison there.

Thanks for this, it is pretty interesting to me. If you ever try it again, maybe run some odf comparisons too. :)

Reply | Quote
jambarama says:

February 8, 2008 at 2:39 pm

BTW – saving your data in ascii is a great idea for a few reasons: it is more searchable, more easily manipulated, smaller size (as you demonstrated), and you can bet your donkey you’ll be able to open a txt file in 30 years and import it into whatever you need. If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you’re probably hosed.

Reply | Quote
Luke Maciak says:

February 8, 2008 at 3:38 pm

[quote post=”2282″]you can bet your donkey you’ll be able to open a txt file in 30 years and import it into whatever you need. If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you’re probably hosed. [/quote]

Dito! You hit the nail on the head. I totally glossed over that, but the availability of your information in the future is paramount. Locking all your data in proprietary formats is not a smart thing to do.

Reply | Quote
Luke Maciak says:

February 8, 2008 at 3:54 pm

[quote post=”2282″]What is this mudkips captcha, I don’t get it.[/quote]

Mudkips == 4chan meme. In other words, you probably don’t want to know. ;) I added it to the word list because it is short and random word that is not really in a dictionary, but it is easy to spell and type in and also acts as a silly inside joke for some.

Reply | Quote
Will Sheldon says:

February 11, 2008 at 5:55 pm

Luke,

The xlsx files are already zip compressed. Rename one to end with zip & you’ll be able to decompress it, then you can see how (un)cooperative m$ has been with their new ‘open’ standard.

Reply | Quote
Luke Maciak says:

February 11, 2008 at 7:26 pm

[quote comment=”8064″]Luke,

The xlsx files are already zip compressed. Rename one to end with zip & you’ll be able to decompress it, then you can see how (un)cooperative m$ has been with their new ‘open’ standard.[/quote]

Will, I know that. This is why I said this in my post:

[quote post=”2282″]There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed – there is not much we can do about it! This files are and will be huge for many reasons.[/quote]

Funny thing is that the XML inside is convoluted and it can’t be readily edited. I tried to unzip a Word file, slightly change some text and then zip it back but naturally this doesn’t work. They have checksums and hashes of the content stowed away in more than one place to ensure that modifying OOXML files is as convoluted as possible.

Reply | Quote
Philipp K Janert says:

March 19, 2008 at 9:58 pm

This may be slightly off-topic, but since you mention it…

There is now a book on Gnuplot: “Gnuplot in Action”. You can pre-order it directly from the publisher: Manning: Gnuplot in Action.

The book assumes no previous familiarity with Gnuplot. It introduces the most basic concepts rapidly, and then moves on to explain Gnuplot’s advanced concepts and power features in detail.

If you want to learn more about the book and the author, check out my book page at Principal Value – Gnuplot in Action.

Let me know if you are interested in a review copy.

Reply | Quote
Luke Maciak says:

March 20, 2008 at 12:43 am

Nice! I’d love a review copy if you can spare one. :)

I’ll shoot you an email.

Reply | Quote
Glenna James says:

February 26, 2009 at 5:58 pm

I tried to send you an email listed on your contact me page, but it was rejected. So I’m taking a chance and asking this question here.

You helped me with an excel problem a while back. I went back to you site to get help with this problem but I didn’t know how to look it up in your archives so I’m emailing you.

BACKGROUND

I am an accountant and I work in governmental accounting. (I am not employed by the government; I just do work for them)

I have been working for months (on and off) on a spread sheet that will allow me entering information the least amount of times as possible. I have numerous formulas where the I enter the information once in a workbook and it pulls to multiple other sheets in the workbook. I work with three different areas so I did the master work on one area (Central) and a week or so ago when I finally finished my work, I copied that workbook (Central’s) and made a duplicate for the other two areas (Western and Southwest) After copying the original workbook I went in to Western and Southwest and entered their data into their workbook.

PROBLEM

With Central only many of my cells are not pulling the data I enter in the original without help from me. I have to go to the sheet and click in the cell and then click up at the top — I don’t know what you call that space but immediately to the left of it, it says, “fx”.

Then I hit enter and the data is then pulled to that cell.

I checked everything I could think of under TOOLs and Options. It is set to calculate automatically.

Of course having to do this defeats the whole purpose of having the formula in the first place.

The thing that puzzles me is that I have the problem only in the original workbook and not in the two I copied.

Since I consider you to be a genius, I hope you will be able to help me. Thanks.

Glenna James

Reply | Quote

File Format Overhead for Data Storage

17 Responses to File Format Overhead for Data Storage

Leave a Reply Cancel reply

Subscribe

Social

Categories

Archives

Interesting Blogs

My Software Projects

People I Know

Web Comics I Read

Meta

File Format Overhead for Data Storage

17 Responses to File Format Overhead for Data Storage

Leave a Reply Cancel reply

Subscribe

Social

Categories

Tags

Archives

Interesting Blogs

My Software Projects

People I Know

Web Comics I Read

Meta