Yesterday I blogged about my experience teaching LOGO to non-cs students but I forgot to mention an interesting observation I made. Most students did not believe me when I told them to save their code as a plain text file when submitting. I pretty much had to stop at each desk and show them how to do “Save As”, then type in logo_lab.txt in the box and hit enter. Yes, I know it opens as a text file. That’s precisely how I want it!
It’s probably worth mentioning that we were using Tortue as our LOGO interpreter. It is a very simple tool – it actually does not even support the whole LOGO language. But for my purposes it is perfect, since the subset it does support is enough for my class. Added bonus is that it ships as a self contained jar file with no classpath dependencies. Students simply need to download it, and double click it to run the nice GUI interpreter.
Most students wanted to send me the Jar file. Some probably did, I haven’t checked the Backboard dropbox tool yet to see how many people were confused. The impression I got from the most people was that they were really bewildered at the size of the files they were submitting. One student even asked me explicitly if it is normal that his file is 1KB in size.
Yes, it is!
Yes! It’s text! Each character takes up about a byte of space on the disk. You have 15 lines of code, and each line is less than 10 characters long. How much space were you expecting?
This is not Microsoft Word my friends which needs to keep few kilobytes of metadata per paragraph. I wrote about the ridiculous overhead on common modern file formats before. I don’t know when it happened, but at some point we have turned away from plain text and started encoding our data in more and more complex formats. I can guess why – because users wanted to use rich text and formatting and they didn’t want to learn a markup language. But instead of developing working WYSIWYG editors for to run on top of a markup language we somehow moved towards pure WYSIWYG environments which encode text which-ever way they want.
Btw, have you ever tried exporting a word document into HTML page? It is a horrible mess, and to me an indicative of how Word actually stores it’s markup internally. The frightening forests of opening and closing span and font tags surrounding that surround blank spaces are artifacts left by words own file format. This is why these files take so much space. It’s not just metadata, but also garbage markup which is invisible to the GUI user, and thus almost impossible to detect and delete. Pure WYSIWYG editors are absolutely horrible at doing markup. And it’s not like it can’t be done right. Last time I checked the WYSIWYG feature in Dreamweaver had no such problems and spat out fairly clean HTML. But that’s because it was designed to do so. Because web developers were expected to switch to the HTML tab all the time to make corrections and adjustments. In Word and Excel on the other hand no one ever sees the internal markup. So no one cares that it wastes space, and causes odd WYSIWYG behavior or bizarre error conditions. But that is what we have right now.
The students were confused because by now they expect their average school paper to approach a megabyte or more in size. They see something smaller and they are worried it did not save properly. This saddens me. It just goes to show how far have we went with this pure WYSIWYG idea, and how deeply flawed it really is. I predict that this issue will only escalate and the overhead for storing data in these formats will inflate as the time progresses. Our only hope is to scrap this paradigm and go back to sane markup languages which can be edited by hand. I’m not saying we should scrap WYSIWYG. I’m saying we need to see what these editors do, and be able to correct pathological behaviors by directly modifying the markup, and then bugging the developer to fix the markup generation code. But we can’t do that if the file formats are semi-proprietary mess. ODF was a good idea that was a step forward. It was not enough though. And Microsoft cleverly took us full 12 steps back with it’s “open” but incomprehensible, undocumented and half-magical OOXML bullshit. The dirty little secret of that format is that there is no existing implementation for it. Office 2007 doesn’t actually implement the OOXML spec – at least not the way it was published. So the files you save in Word and Excel are not really OOXML, and a perfect OOXML implementation will not be 100% compatible with MS Office. Fun, eh?
But I digress. I want to close this post on a more positive note. As I explained to my students, the funny thing about programming is that you can write a program whose source code takes less than 10KB on the disk, compile it into a binary that is more than 10MB (static linking and etc) and see it eat up more than 100MB of RAM when you run it (with it’s data structures, reserved memory locations, dynamically linked dependencies and all that jazz).