Is it ok that my file is only 1KB?

Yesterday I blogged about my experience teaching LOGO to non-cs students but I forgot to mention an interesting observation I made. Most students did not believe me when I told them to save their code as a plain text file when submitting. I pretty much had to stop at each desk and show them how to do “Save As”, then type in logo_lab.txt in the box and hit enter. Yes, I know it opens as a text file. That’s precisely how I want it!

It’s probably worth mentioning that we were using Tortue as our LOGO interpreter. It is a very simple tool – it actually does not even support the whole LOGO language. But for my purposes it is perfect, since the subset it does support is enough for my class. Added bonus is that it ships as a self contained jar file with no classpath dependencies. Students simply need to download it, and double click it to run the nice GUI interpreter.

Most students wanted to send me the Jar file. Some probably did, I haven’t checked the Backboard dropbox tool yet to see how many people were confused. The impression I got from the most people was that they were really bewildered at the size of the files they were submitting. One student even asked me explicitly if it is normal that his file is 1KB in size.

Yes, it is!

Really?

Yes! It’s text! Each character takes up about a byte of space on the disk. You have 15 lines of code, and each line is less than 10 characters long. How much space were you expecting?

This is not Microsoft Word my friends which needs to keep few kilobytes of metadata per paragraph. I wrote about the ridiculous overhead on common modern file formats before. I don’t know when it happened, but at some point we have turned away from plain text and started encoding our data in more and more complex formats. I can guess why – because users wanted to use rich text and formatting and they didn’t want to learn a markup language. But instead of developing working WYSIWYG editors for to run on top of a markup language we somehow moved towards pure WYSIWYG environments which encode text which-ever way they want.

Btw, have you ever tried exporting a word document into HTML page? It is a horrible mess, and to me an indicative of how Word actually stores it’s markup internally. The frightening forests of opening and closing span and font tags surrounding that surround blank spaces are artifacts left by words own file format. This is why these files take so much space. It’s not just metadata, but also garbage markup which is invisible to the GUI user, and thus almost impossible to detect and delete. Pure WYSIWYG editors are absolutely horrible at doing markup. And it’s not like it can’t be done right. Last time I checked the WYSIWYG feature in Dreamweaver had no such problems and spat out fairly clean HTML. But that’s because it was designed to do so. Because web developers were expected to switch to the HTML tab all the time to make corrections and adjustments. In Word and Excel on the other hand no one ever sees the internal markup. So no one cares that it wastes space, and causes odd WYSIWYG behavior or bizarre error conditions. But that is what we have right now.

The students were confused because by now they expect their average school paper to approach a megabyte or more in size. They see something smaller and they are worried it did not save properly. This saddens me. It just goes to show how far have we went with this pure WYSIWYG idea, and how deeply flawed it really is. I predict that this issue will only escalate and the overhead for storing data in these formats will inflate as the time progresses. Our only hope is to scrap this paradigm and go back to sane markup languages which can be edited by hand. I’m not saying we should scrap WYSIWYG. I’m saying we need to see what these editors do, and be able to correct pathological behaviors by directly modifying the markup, and then bugging the developer to fix the markup generation code. But we can’t do that if the file formats are semi-proprietary mess. ODF was a good idea that was a step forward. It was not enough though. And Microsoft cleverly took us full 12 steps back with it’s “open” but incomprehensible, undocumented and half-magical OOXML bullshit. The dirty little secret of that format is that there is no existing implementation for it. Office 2007 doesn’t actually implement the OOXML spec – at least not the way it was published. So the files you save in Word and Excel are not really OOXML, and a perfect OOXML implementation will not be 100% compatible with MS Office. Fun, eh?

But I digress. I want to close this post on a more positive note. As I explained to my students, the funny thing about programming is that you can write a program whose source code takes less than 10KB on the disk, compile it into a binary that is more than 10MB (static linking and etc) and see it eat up more than 100MB of RAM when you run it (with it’s data structures, reserved memory locations, dynamically linked dependencies and all that jazz).

5 Responses to Is it ok that my file is only 1KB?

Ajzimm3rman says:

December 9, 2008 at 10:13 pm

I don’t.
Linux has become a part of a staples magazine.
Open-source is a known term and will come to be more so.

Proprietary will become ‘obvious’ to the average user.
When the benefits outweigh the consequences, the ten-minute user will find it worthwhile to step out of the boxed protection of closed source monogamy.

Reply | Quote
Chris says:

December 10, 2008 at 5:30 am

A few years ago I was doing customer support for one of the major mobile/cell phone companies here in the UK. The names and addresses of the companies around the country that could service the phones were kept in HTML documents accessible over the internal network. I couldn’t figure out why every time I loaded this page it took *so* darned long to load (> 10 seconds). During a lull in the calls I investigated and found that this ‘simple’ html document which only contained one large table with no styling had been saved out from MS Word – filesize? 1.2Mb!! I promptly rewrote the page in plain hand-written HTML – filesize? 38Kb. My new version loaded pretty much instantaneously over the network, and me and my team were able to drop valuable seconds off each call we received simply because we didn’t have to wait those 10 odd seconds for the page to load (whilst saying to the customer, ‘just waiting for that information to come up…’). Of course big companies being big companies we weren’t allowed to let everyone use the new smaller filesize document, the original MS bloatware version had been prepared *personally* by one of the bosses…

Reply | Quote
Pingback: Terminally Incoherent » Blog Archive » Text Files Are Mysterious
Lily says:

July 18, 2010 at 8:12 am

I’ve got into the habit of saving most things as text files. Especially notes and so on. I just don’t need all that bloat (I imagine Word WYSIWYG editing is rather like the HTML you get out of applications like Dreamweaver – it sort of works, but the code’s ugly as hell). Ironically, I keep getting told by teachers at school etc not to do so even when a document requires no formatting, Just Because. I’m gonna stick to .txt anyway, because the amount of space students are allocated to save files in is ridiculously small and it can always be copied and pasted in to a Word document if absolutely necessary.

Reply | Quote
Luke Maciak says:

July 18, 2010 at 1:28 pm

@ Lily:

It only takes a few seconds to copy and paste a paper into word, format it appropriately and submit. There is really no reason to keep it in the bloated Word format. Good on ya!

Reply | Quote