Download Website for Offline Reading with Wget

My boss had a tad unusual request the other day. Like me he is also living a double life of working at this company by day, and teaching at a university by night. It turns out that for some reason he can’t connect to the internet in the lecture but he wanted use some nice resource/tutorial site in class.

Seemed like an easy solution – mirror the site locally with wget, use it in the presentation and then give the students the URL for the real thing. So I went back to my desk and did:

wget --mirror -p --html-extension --convert-links -v http://example.com

If you are unfamiliar with wget, let me go over the options really quickly:

–mirror tells wget to recurse through the linked documents creating a local mirror.

-p forces wget to download all the additional files such like images, css, js, sounds and etc.

–html-extension since our website was dynamically generated, a lot of URL’s looked like this:

http://example.com/?p=1337

This parameter will ensure that these URL’s will be cached locally, and the files will be given names ending in .html:

http://example.com/?p=1337.html

This way you can easily browse them locally without worrying about browsers not recognizing the file type or something like that.

–convert-links this is the parameter that will actually make it possible to browse the site locally by converting all the absolute links that include the websites domain, into relative ones.

-v is of course a purely optional parameter that forces verbose output.

You can pretty much use this trick on just about any website out there. Might be useful if you are going to be on a plane for several hours without internet access. You can easily mirror some of the resource sites you often use. Then again, it’s probably not the best idea to mirror Wikipedia or something like that. :P

But there is a catch. If you are indexing a dynamic website that has the ? in a lot of URL’s you need to be careful. I ran wget on my Ubuntu box, where ? is a perfectly character for filenames. Then I moved the files over to a mounted Windows network share so that my boss could get them… And it stopped working. Why?

The question mark is illegal file name character under windows. So all the URL’s above became something like:

http://example.com/@p=1337.html

Of course the links (converted, relative links) pointed to the original file names that included the ? character. How to solve this? You need to tell wget to use Windows mode:

wget --mirror -p --restrict-file-names=windows --html-extension --convert-links -v http://example.com

The –restrict-file-names=windows parameter will tell wget to properly escape illegal windows characters such as \, │, /, :, ?, “, *, < and >. Of course if you use wget compiled for windows, this will be the default behavior.

[tags]wget, mirror, mirroring websites, escaping characters, illegal characters[/tags]

This entry was posted in sysadmin notes and tagged , . Bookmark the permalink.



4 Responses to Download Website for Offline Reading with Wget

  1. jambarama UNITED STATES Epiphany Linux Terminalist says:

    Wget is a great download manager, it handles multi-page multi-part downloads, following links, etc. Just like cURL (though I prefer wget as more capable). It does have the capability to be a site mirror-er, but httrack is a pretty dang good web mirrorer too. It has a nice step-by-step gui – webhttrack – that runs through your browser.

    Reply  |  Quote
  2. Aren’t Windows filename restrictions just ridiculous? And people like to think Windows is something other than just a toy.

    Reply  |  Quote
  3. hammou Mozilla Firefox Windows says:

    Thanks for this

    but me i want to get PHP and ASP code

    I want this website http://www.storynike.com

    Reply  |  Quote
  4. Luke Maciak UNITED STATES Mozilla Firefox Linux Terminalist says:

    @ hammou:

    Sorry buddy. That can’t be done. Server side code gets processed before it gets sent to the browser.

    Reply  |  Quote

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>