<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.0.5" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: File Format Overhead for Data Storage</title>
	<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/</link>
	<description>Utterly random, incoherent and disjointed rants and ramblings...</description>
	<pubDate>Sat, 10 Jan 2009 01:49:24 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.5</generator>

	<item>
		<title>by: Luke Maciak</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8556</link>
		<pubDate>Thu, 20 Mar 2008 04:43:52 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8556</guid>
					<description>Nice! I'd love a review copy if you can spare one. :)

I'll shoot you an email.</description>
		<content:encoded><![CDATA[<p>Nice! I&#8217;d love a review copy if you can spare one. <img src="http://www.terminally-incoherent.com/blog/wp-includes/images/smilies/icon_smile.gif" alt=")" class="wp-smiley" /> </p>
<p>I&#8217;ll shoot you an email.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Philipp K Janert</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8555</link>
		<pubDate>Thu, 20 Mar 2008 01:58:47 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8555</guid>
					<description>This may be slightly off-topic, but since you mention it...

There is now a book on Gnuplot: "Gnuplot in Action". You can pre-order it directly from the publisher: &lt;a href='http://www.manning.com/janert/' rel="nofollow"&gt;Manning: Gnuplot in Action&lt;/a&gt;.

The book assumes no previous familiarity with Gnuplot. It introduces the most basic concepts rapidly, and then moves on to explain Gnuplot's advanced concepts and power features in detail.

If you want to learn more about the book and the author, check out my book page at &lt;a href="http://principal-value.com/my-book.php" rel="nofollow"&gt;Principal Value - Gnuplot in Action&lt;/a&gt;. 

Let me know if you are interested in a review copy.</description>
		<content:encoded><![CDATA[<p>This may be slightly off-topic, but since you mention it&#8230;</p>
<p>There is now a book on Gnuplot: &#8220;Gnuplot in Action&#8221;. You can pre-order it directly from the publisher: <a href='http://www.manning.com/janert/' rel="nofollow">Manning: Gnuplot in Action</a>.</p>
<p>The book assumes no previous familiarity with Gnuplot. It introduces the most basic concepts rapidly, and then moves on to explain Gnuplot&#8217;s advanced concepts and power features in detail.</p>
<p>If you want to learn more about the book and the author, check out my book page at <a href="http://principal-value.com/my-book.php" rel="nofollow">Principal Value - Gnuplot in Action</a>. </p>
<p>Let me know if you are interested in a review copy.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Luke Maciak</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8069</link>
		<pubDate>Tue, 12 Feb 2008 00:26:47 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8069</guid>
					<description>[quote comment="8064"]Luke,

The xlsx files are already zip compressed. Rename one to end with zip &#38; you'll be able to decompress it, then you can see how (un)cooperative m$ has been with their new 'open' standard.[/quote]

Will, I know that. This is why I said this in my post:


[quote post="2282"]There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed - there is not much we can do about it! This files are and will be huge for many reasons.[/quote]

Funny thing is that the XML inside is convoluted and it can't be readily edited. I tried to unzip a Word file, slightly change some text and then zip it back but naturally this doesn't work. They have checksums and hashes of the content stowed away in more than one place to ensure that modifying OOXML files is as convoluted as possible.</description>
		<content:encoded><![CDATA[<p><span style="padding-left: 10px;"><strong>Will Sheldon</strong> said:</span></p>
<blockquote cite="http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8064"><p>
Luke,</p>
<p>The xlsx files are already zip compressed. Rename one to end with zip &amp; you&#8217;ll be able to decompress it, then you can see how (un)cooperative m$ has been with their new &#8216;open&#8217; standard.</p>
</blockquote>
<p>Will, I know that. This is why I said this in my post:</p>
<blockquote cite="http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/"><p>
There is naturally a reason for this. In case you didn’t know the OpenXML files are really zip compressed directory trees full of verbose MSXML. They are already compressed - there is not much we can do about it! This files are and will be huge for many reasons.</p>
</blockquote>
<p>Funny thing is that the XML inside is convoluted and it can&#8217;t be readily edited. I tried to unzip a Word file, slightly change some text and then zip it back but naturally this doesn&#8217;t work. They have checksums and hashes of the content stowed away in more than one place to ensure that modifying OOXML files is as convoluted as possible.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Will Sheldon</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8064</link>
		<pubDate>Mon, 11 Feb 2008 22:55:18 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8064</guid>
					<description>Luke,

The xlsx files are already zip compressed. Rename one to end with zip &#38; you'll be able to decompress it, then you can see how (un)cooperative m$ has been with their new 'open' standard.</description>
		<content:encoded><![CDATA[<p>Luke,</p>
<p>The xlsx files are already zip compressed. Rename one to end with zip &amp; you&#8217;ll be able to decompress it, then you can see how (un)cooperative m$ has been with their new &#8216;open&#8217; standard.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Luke Maciak</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8034</link>
		<pubDate>Fri, 08 Feb 2008 20:54:48 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8034</guid>
					<description>[quote post="2282"]What is this mudkips captcha, I don’t get it.[/quote]

Mudkips == 4chan meme. In other words, you probably don't want to know. ;) I added it to the word list because it is short and random word that is not really in a dictionary, but it is easy to spell and type in and also acts as a silly inside joke for some.</description>
		<content:encoded><![CDATA[<blockquote cite="http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/"><p>
What is this mudkips captcha, I don’t get it.</p>
</blockquote>
<p>Mudkips == 4chan meme. In other words, you probably don&#8217;t want to know. <img src="http://www.terminally-incoherent.com/blog/wp-includes/images/smilies/icon_wink.gif" alt=")" class="wp-smiley" />  I added it to the word list because it is short and random word that is not really in a dictionary, but it is easy to spell and type in and also acts as a silly inside joke for some.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Luke Maciak</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8033</link>
		<pubDate>Fri, 08 Feb 2008 20:38:37 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8033</guid>
					<description>[quote post="2282"]you can bet your donkey you’ll be able to open a txt file in 30 years and import it into whatever you need. If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you’re probably hosed. [/quote]

Dito! You hit the nail on the head. I totally glossed over that, but the availability of your information in the future is paramount. Locking all your data in proprietary formats is not a smart thing to do.</description>
		<content:encoded><![CDATA[<blockquote cite="http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/"><p>
you can bet your donkey you’ll be able to open a txt file in 30 years and import it into whatever you need. If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you’re probably hosed. </p>
</blockquote>
<p>Dito! You hit the nail on the head. I totally glossed over that, but the availability of your information in the future is paramount. Locking all your data in proprietary formats is not a smart thing to do.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: jambarama</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8031</link>
		<pubDate>Fri, 08 Feb 2008 19:39:30 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8031</guid>
					<description>BTW - saving your data in ascii is a great idea for a few reasons: it is more searchable, more easily manipulated, smaller size (as you demonstrated), and you can bet your donkey you'll be able to open a txt file in 30 years and import it into whatever you need.  If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you're probably hosed.</description>
		<content:encoded><![CDATA[<p>BTW - saving your data in ascii is a great idea for a few reasons: it is more searchable, more easily manipulated, smaller size (as you demonstrated), and you can bet your donkey you&#8217;ll be able to open a txt file in 30 years and import it into whatever you need.  If your data was only in xls/xlsx or other proprietary formats (see stats program for more examples), in 30 years you&#8217;re probably hosed.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: jambarama</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8030</link>
		<pubDate>Fri, 08 Feb 2008 19:35:08 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8030</guid>
					<description>What is this mudkips captcha, I don't get it.  

A blog I used to read (when entries were still forthcoming), 3monkeys, did a file size comparison between doc, xml, txt, &#38; odt about a year ago.  They came up with essentially &lt;a href="http://3monkeyweb.com/3monkeys/2006/12/29/-odfodt-compared-to-microsoft-word-doc/" rel="nofollow"&gt; similar same&lt;/a&gt; results as you.  They also compared file sizes across different apps but the same filetype, different results.  OOXML wasn't out at that time, so no comparison there.  

Thanks for this, it is pretty interesting to me.  If you ever try it again, maybe run some odf comparisons too.  :)</description>
		<content:encoded><![CDATA[<p>What is this mudkips captcha, I don&#8217;t get it.  </p>
<p>A blog I used to read (when entries were still forthcoming), 3monkeys, did a file size comparison between doc, xml, txt, &amp; odt about a year ago.  They came up with essentially <a href="http://3monkeyweb.com/3monkeys/2006/12/29/-odfodt-compared-to-microsoft-word-doc/" rel="nofollow"> similar same</a> results as you.  They also compared file sizes across different apps but the same filetype, different results.  OOXML wasn&#8217;t out at that time, so no comparison there.  </p>
<p>Thanks for this, it is pretty interesting to me.  If you ever try it again, maybe run some odf comparisons too.  <img src="http://www.terminally-incoherent.com/blog/wp-includes/images/smilies/icon_smile.gif" alt=")" class="wp-smiley" />
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Luke Maciak</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8018</link>
		<pubDate>Fri, 08 Feb 2008 00:52:09 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8018</guid>
					<description>The file was essentially rows of floating point numbers. They were readings taken at each iteration and they were supposed to converge and stop changing at some point. Here is the sample of the data:

&lt;pre lang="txt"&gt;0.002479512	0.003447492	0.004360885	0.00417913
0.002243689	0.003142163	0.003983163	0.003899171
0.002080678	0.002927045	0.003714036	0.00369954
0.001953793	0.002756688	0.003503308	0.003538016
0.001857135	0.002622514	0.003341999	0.003409899
0.001786521	0.002518304	0.003223567	0.003311575
0.001739003	0.002439494	0.003143176	0.003240258
0.001712549	0.002382714	0.003097226	0.003193814
0.001705819	0.002345494	0.003083038	0.003170632
0.001718007	0.002326056	0.003098644	0.003169526
0.001748726	0.00232317	0.003142633	0.003189665
0.001797924	0.002336054	0.003214054	0.003230512
0.001865815	0.002364298	0.003312335	0.003291781
0.001952821	0.002407811	0.003437223	0.003373395
0.002059526	0.002466784	0.003588746	0.00347545
0.002186619	0.002541665	0.003767171	0.003598183
0.002334844	0.002633143	0.003972968	0.003741938
0.002504938	0.002742141	0.004206782	0.003907128
0.002697553	0.002869812	0.00446939	0.004094192
0.002913174	0.003017547	0.004761655	0.004303547
0.003151996	0.003186983	0.005084467	0.004535527
0.003413796	0.003380021	0.005438662	0.004790304
0.003697758	0.003598848	0.005824921	0.005067787&lt;/pre&gt;

It's all numeric, no letters and the values are relatively close to each other. It compresses very well. For this test I zipped all the files it using WinRar with the "Best Compression" option.</description>
		<content:encoded><![CDATA[<p>The file was essentially rows of floating point numbers. They were readings taken at each iteration and they were supposed to converge and stop changing at some point. Here is the sample of the data:</p>

<div class="wp_syntax"><div class="code"><pre>0.002479512	0.003447492	0.004360885	0.00417913
0.002243689	0.003142163	0.003983163	0.003899171
0.002080678	0.002927045	0.003714036	0.00369954
0.001953793	0.002756688	0.003503308	0.003538016
0.001857135	0.002622514	0.003341999	0.003409899
0.001786521	0.002518304	0.003223567	0.003311575
0.001739003	0.002439494	0.003143176	0.003240258
0.001712549	0.002382714	0.003097226	0.003193814
0.001705819	0.002345494	0.003083038	0.003170632
0.001718007	0.002326056	0.003098644	0.003169526
0.001748726	0.00232317	0.003142633	0.003189665
0.001797924	0.002336054	0.003214054	0.003230512
0.001865815	0.002364298	0.003312335	0.003291781
0.001952821	0.002407811	0.003437223	0.003373395
0.002059526	0.002466784	0.003588746	0.00347545
0.002186619	0.002541665	0.003767171	0.003598183
0.002334844	0.002633143	0.003972968	0.003741938
0.002504938	0.002742141	0.004206782	0.003907128
0.002697553	0.002869812	0.00446939	0.004094192
0.002913174	0.003017547	0.004761655	0.004303547
0.003151996	0.003186983	0.005084467	0.004535527
0.003413796	0.003380021	0.005438662	0.004790304
0.003697758	0.003598848	0.005824921	0.005067787</pre></div></div>

<p>It&#8217;s all numeric, no letters and the values are relatively close to each other. It compresses very well. For this test I zipped all the files it using WinRar with the &#8220;Best Compression&#8221; option.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: vacri</title>
		<link>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8016</link>
		<pubDate>Thu, 07 Feb 2008 23:52:31 +0000</pubDate>
		<guid>http://www.terminally-incoherent.com/blog/2008/02/07/file-format-overhead-for-data-storage/#comment-8016</guid>
					<description>I have to admit that I'm a little suspicious about a file that can be compressed to below 1% of it's original file size. Was the bigdata file a realworld text file or was it a mocked up file that inadvertently could be easily compressed? What kind of data was it?

Although I think it speaks for itself that the .xls didn't really compress at all.</description>
		<content:encoded><![CDATA[<p>I have to admit that I&#8217;m a little suspicious about a file that can be compressed to below 1% of it&#8217;s original file size. Was the bigdata file a realworld text file or was it a mocked up file that inadvertently could be easily compressed? What kind of data was it?</p>
<p>Although I think it speaks for itself that the .xls didn&#8217;t really compress at all.
</p>
]]></content:encoded>
				</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.649 seconds -->
