Remember how I always talk about redundant backups? Let me tell you a story about what happens when you don’t have them.
Few months ago I purchased a 2TB external drive to replace an older LaCie drive that I suspected would fail soon. Why? Well, it was becoming progressively louder, and sometimes it would take up to 10 minutes to “spin up” before the system could detect it. I wasn’t actively using that drive for anything vital but it still contained about 200GB of random crap – mostly movies, random torrents and etc.. Most of that stuff was either easily recoverable, or I didn’t care about it that much so I didn’t see much reason to back it up. Still, as long as it was there I did not want to delete any of that stuff. So I moved it to the new drive and after few days wiped the failing device clean intending to use it as an extended /tmp directory.
Fast forward few weeks and the new 2TB drive failed hard without a working. One day it just stopped working taking all my files with it. I was mildly annoyed, as I would have liked a chance to wade through that garbage to see if there was anything worth saving there. But I accepted the failure as a lesson for the future: unless you are going to delete it within the next 2-3 hours, it is worth saving in at least two places.
Fast forward to present day, and I experienced a sudden epiphany. I haven’t really been using the old, failing drive for anything all these months. The data I deleted from it might not all be gone. I promptly fired up Recuva and managed to recover few thousand files from that drive. Unfortunately it was not all of it because I did use the drive a few times to save random garbage. What’s worse, while Recuva managed to pull a lot of my files out of the brink of oblivion, it did loose the folder structure. More or less it just unceremoniously dumped everything into a huge, disorganized pile in a single directory, leaving me to wade through the wreckage trying to decide what goes where. Things I actually cared about (movies, ebooks, iso files) were lost in the sea of random jpegs, dat files and corrupted file fragments.
I started sifting through this junk but I realized it would take hours to organize. So I started writing a script that would knock out lion share of the random garbage. Initially I planned to do it by extension – just delete all the stuff that was not movies, pdf’s and etc.. But there drive was littered with vast number of different file formats so making a list (black or white) was going to be tedious. Then it dawned on me – I can just do it by size. Chances were that anything smaller than 10MB was either not worth my attention or not worth saving.
Small problem? I don’t think it is possible to write a Windows batch script that deletes files based on their size. Or at least I couldn’t figure out a way. I was about to whip out some Unix to deal with this but then I stumbled over a PowerShell icon on my way there.
“Hey, why not…” – I said to myself and got to work. It turns out that Powershell was made for stuff like this. Observe:
ls | where {$_.Length -lt 10mb} | Remove-Item
The keyword ls here is not the unix command, but rather an alias to gci which is short for Get-ChildItem wich is essentially the PowerShell version of ls. So in a way it is ls I guess.
The where keyword lets you iterate through all the listed files. Perl hackers will probably recognize the $_ which here stands for current item. The -lt is PowerShell speak for “less than” and Remove-Item should be self explanatory.
Verdict: PowerShell ain’t so bad. It is a little bit weird at first, but you get used to it.
Oh, and if you are about to say “why didn’t you just order the files by size in Explorer and hit delete”, don’t. I could have done that, sure. But I wouldn’t have a blog post if I did that, would I? So take that Mr. or Ms. Hindsight.
You could say that while all the file data on the drive was easily replaceable, there was still some important, hard-to-replace data on the drive: the file listing. It’s like losing all your bookmarks. The data is all out there but you lost the directions to get to it.
This information could be captured and backed up with a simple “
find /mnt/archive > listing.txt
“.@ Chris Wellons:
That’s actually an excellent point. I didn’t think about that, but yeah – if you don’t want to back something up, it is good idea to keep an up-to-date file index just so that in an event of a catastrophic data failure you know what you have lost.
I was once asked to recover files from a failed drive. They had managed to use some sort of autorecovery tool, but it not only cleared the folder structure like yours, but also cleared the filenames. Every single file had a hash-like name, and they were in the same directory.
I wrote a script that tries to guess the file type ($(file)), creates new directories for the filetypes (“/images”), and writes a ‘better’ name and an extension (“/images/image_001.jpg”) so that the owners could open the files without guessing the filetypes themselves.
Probably would have been slower anyway. Explorer’s search/sort functionality is a bear to work with and this way you have to wait twice rather than once.
I’ve done this kind of work before as well. I reconcile 20TiB of video files, including backups, backups of backups, and incremental work files. I wrote a lisp programs which hashed the files to determine uniqueness and grouped them based on tag and path information. I was able to quarter the filespace to be archived. They went ahead and archived the whole damn thing anyway. Librarians…
You stopped short of using all the available default aliases:
ls | ? {$_.Length -lt 10mb} | del
Also, doing it recursively is 3 more chars:
ls -r | ? {$_.Length -lt 10mb} | del