Let’s say you are surfing the web and you come across a cool website with a great collection of pictures. Some of them are located on that page and many more can be reached through various links on that page. There are hundreds of pictures and you want to download all of them. How would you do it? Would you click and save each image separately? Let’s say you really like the design of that site and you want to download the whole thing along with the source code. How would you do it automatically without wasting time?
Can we do it automatically?
The usual way people download an image from a website is by right-clicking and saving it. We cannot afford to do the same when it comes to a large collection. A website is nothing but a collection of various kinds of files. So we need a tool which can take input from the user and download those particular files.
A little Wget magic is all we need
There is a command line tool named Wget which lets us do exactly that. You can download the latest version from here. There might be many ways to achieve this, but this post is about Wget. It’s a very popular, flexible and powerful tool. Most of you might already be aware of this. For people who didn’t know, it’s definitely a good thing to know. In fact, you can write a neat little script to do exactly what you want, how you want and when you want.
This installation procedure is for *nix style operating systems (Linux, Unix and Mac). Windows users can check this link for the procedure. After unpacking it, go to the directory in the command line and type:
If it shows an error, then it is defaulting to GnuTLS which you don’t have. So use OpenSSL instead. You have to explicitly specify it using:
$ ./configure --with-ssl=openssl
Notice the two hypens before “with-ssl”. It might not be very clear from the font. After that step, type the following two commands:
$ make $ sudo make install
And you are done! You have wget installed on your machine.
Wget is one of the most powerful tools available out there to download stuff from the internet. It can download files using HTTP, HTTPS and FTP. These are the most widely used internet protocols. Let’s dive in, shall we?
To download a file:
$ wget http://the-url/to-your/file
To download an entire website:
$ wget -r http://website.com
The above command will parse the whole website and download the entire thing for you. You have to be careful with this because it might put a large load on the servers.
To mirror a site on your computer:
$ wget -m http://website.com
If you want to limit the number of levels to dig into while downloading:
$ wget -r -l4 http://website.com
The above command will limit itself to 4 levels. Level refers to the linked pages. The files in the main page is level 1, the links on that page are level 2. The links on the level 2 pages are level 3 and so on.
If you want to download a particular type of files and discard everything else:
$ wget -r -l1 -A.jpg http://website.com
This will download all the JPEG images on the webpage pointed to by the given link. It will limit itself to one level. This is very useful when you want to download a set of images, pdf documents, powerpoint presentations etc.
A bit on the hacky side
Some sites don’t allow download managers to automatically download files. They do it because it puts large loads on the servers and the site might crash. Also, some people might deliberately want to do this to crash the websites. Hence they protect themselves by blocking such download managers. To get around this, you can mask yourself as a web browser using -U option.
$ wget -r -U Mozilla http://the-url/to-your/file
The way in which the sites prevent the loads is by obeying the instructions in robots.txt. Whenever you download using wget, you will always get a file named robots.txt. You can check it out to see what it is. But if you want to get around that, you can make sure the rules in robots.txt are not followed:
$ wget -r -U Mozilla -erobots=off http://the-url/to-your/file
If you download continuously with full speed, webmasters will detect and stop it. To fool them, you should wait between downloading successive files. You can wait using the following flags:
$ wget -r -U Mozilla -erobots=off -w 10 http://the-url/to-your/file
This will ask the wget to wait 10 seconds before downloading the next file.
Wget uses the full available bandwidth to download files for you. If you want to do some other work, you might be blocked. So to limit the download speed:
$ wget -r -U Mozilla -erobots=off -w 10 --limit-rate=50 http://the-url/to-your/file
It will limit the rate to 50 KBps.
There are many more things you can do with wget. Some of them include resuming incomplete downloads, checking the validity of scheduled downloads, retry attempts, downloading from multiple urls, reject certain type of files, stop downloading when it exceeds a certain size etc. Just give it a try and you will get hooked on to it!