Automatic Downloading

Let’s say you are surfing the web and you come across a cool website with a great collection of pictures. Some of them are located on that page and many more can be reached through various links on that page. There are hundreds of pictures and you want to download all of them. How would you do it? Would you click and save each image separately? Let’s say you really like the design of that site and you want to download the whole thing along with the source code. How would you do it automatically without wasting time?

Can we do it automatically?

The usual way people download an image from a website is by right-clicking and saving it. We cannot afford to do the same when it comes to a large collection. A website is nothing but a collection of various kinds of files. So we need a tool which can take input from the user and download those particular files.

A little Wget magic is all we need

There is a command line tool named Wget which lets us do exactly that. You can download the latest version from here. There might be many ways to achieve this, but this post is about Wget. It’s a very popular, flexible and powerful tool. Most of you might already be aware of this. For people who didn’t know, it’s definitely a good thing to know. In fact, you can write a neat little script to do exactly what you want, how you want and when you want.

Installation

This installation procedure is for *nix style operating systems (Linux, Unix and Mac). Windows users can check this link for the procedure. After unpacking it, go to the directory in the command line and type:

$ ./configure

If it shows an error, then it is defaulting to GnuTLS which you don’t have. So use OpenSSL instead. You have to explicitly specify it using:

$ ./configure --with-ssl=openssl

Notice the two hypens before “with-ssl”. It might not be very clear from the font. After that step, type the following two commands:

$ make
$ sudo make install

And you are done! You have wget installed on your machine.

Usage

Wget is one of the most powerful tools available out there to download stuff from the internet. It can download files using HTTP, HTTPS and FTP. These are the most widely used internet protocols. Let’s dive in, shall we?

To download a file:

$ wget http://the-url/to-your/file

To download an entire website:

$ wget -r http://website.com

The above command will parse the whole website and download the entire thing for you. You have to be careful with this because it might put a large load on the servers.

To mirror a site on your computer:

$ wget -m http://website.com

If you want to limit the number of levels to dig into while downloading:

$ wget -r -l4 http://website.com

The above command will limit itself to 4 levels. Level refers to the linked pages. The files in the main page is level 1, the links on that page are level 2. The links on the level 2 pages are level 3 and so on.

If you want to download a particular type of files and discard everything else:

$ wget -r -l1 -A.jpg http://website.com

This will download all the JPEG images on the webpage pointed to by the given link. It will limit itself to one level. This is very useful when you want to download a set of images, pdf documents, powerpoint presentations etc.

A bit on the hacky side

Some sites don’t allow download managers to automatically download files. They do it because it puts large loads on the servers and the site might crash. Also, some people might deliberately want to do this to crash the websites. Hence they protect themselves by blocking such download managers. To get around this, you can mask yourself as a web browser using -U option.

$ wget -r -U Mozilla http://the-url/to-your/file

The way in which the sites prevent the loads is by obeying the instructions in robots.txt. Whenever you download using wget, you will always get a file named robots.txt. You can check it out to see what it is. But if you want to get around that, you can make sure the rules in robots.txt are not followed:

$ wget -r -U Mozilla -erobots=off http://the-url/to-your/file

If you download continuously with full speed, webmasters will detect and stop it. To fool them, you should wait between downloading successive files. You can wait using the following flags:

$ wget -r -U Mozilla -erobots=off -w 10 http://the-url/to-your/file

This will ask the wget to wait 10 seconds before downloading the next file.

Wget uses the full available bandwidth to download files for you. If you want to do some other work, you might be blocked. So to limit the download speed:

$ wget -r -U Mozilla -erobots=off -w 10 --limit-rate=50 http://the-url/to-your/file

It will limit the rate to 50 KBps.

There are many more things you can do with wget. Some of them include resuming incomplete downloads, checking the validity of scheduled downloads, retry attempts, downloading from multiple urls, reject certain type of files, stop downloading when it exceeds a certain size etc. Just give it a try and you will get hooked on to it!

————————————————————————————————-

4 thoughts on “Automatic Downloading”

SpeedFart says:

August 10, 2012 at 12:43 PM

Would you recommend it as an alternative Download Accelerator? Is it possible to get over sites like Rapidshare and other file hosting services and download the file?

Your response is much appreciated. Great post. Keep up the good work.

1. Prateek Joshi says:
  
  August 10, 2012 at 12:56 PM
  
  Thanks a lot. Wget is more of a download manager than a download accelerator. In fact, most download managers use download acceleration to increase the speed of download by using several simultaneous connections. If you are not hell-bent on features, there are other lightweight download accelerators like Axel, Curl, Aget etc which can be used to download large files quickly.
  
  Regarding Rapidshare, I haven’t tried downloading files from it using wget. The usage actually depends on individual websites and their constraints (number of connections per request, wait time etc). You can set the flags accordingly to optimize your download.
  
SpeedFart says:

August 10, 2012 at 2:19 PM

One of the things Zuckerberg used to breach the Harvard Network!

1. Prateek Joshi says:
  
  August 10, 2012 at 2:58 PM
  
  Yeah … depending on the site, he used different flags with wget to get all the images.

Automatic Downloading

Published by Prateek Joshi

4 thoughts on “Automatic Downloading”

Leave a comment Cancel reply

Share this:

Related

Published by Prateek Joshi

4 thoughts on “Automatic Downloading”

Leave a comment Cancel reply