How To Build A Web Crawler?

webI was reading an article the other day and I came across the term “web crawler”. The context in which it was used got me a little curious about the design of a web crawler. A web crawler is a simple program that scans or “crawls” through web pages to create an index of the data it’s looking for. There are several uses for the program, perhaps the most popular being search engines using it to provide web surfers with relevant websites. Google has perfected the art of crawling over the years! A web crawler can pretty much be used by anyone who is trying to search for information on the Internet in an organized manner. It is referred to by different names like web spider, bot, indexer etc. Anyway, that article got me thinking about building a web crawler. I just wanted to fiddle with it and see how much time it will take to get something working on my machine. It turned out to be quite easy!  

Where is it used?

Web crawlers can be used in many different ways, but they are usually used by someone seeking to collect information on the Internet. Search engines frequently use web crawlers to collect information about what is available on public web pages. The speed and accuracy of search engines is heavily dependent on the design of the crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. Linguists may use a web crawler to perform text and language analysis. Market researchers may use a web crawler to determine and assess trends in a given market. The possibilities are endless!

How does it actually work?

When a web crawler visits a web page, it reads all the visible text. It includes things like text, hyperlinks, content of the various tags used in the site etc. It then crawls to the links present on that site by treating it as a graph search problem. Each link is treated as a node and the traversal is achieved using depth first search or breadth first search depending on the application. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. The website is then included in the search engine’s database and its page ranking process.

Most of the times, web crawlers are designed to do a specific thing. If we want it to be general purpose (like a search engine), web crawlers should be programmed to crawl through the Internet periodically to check for any significant changes. This is very useful in keeping the results up-to-date.

I’m ready! Show me how to build a web crawler.

Now that we know how it works, we are ready to build a web crawler. I will show you how to get a basic Python web crawler working on your machine. Given a link, you will be able to crawl through the page and get all the links. You can then crawl through those pages and get more links. This process will continue until the desired depth is achieved. You will get a list of all the links in the end. To do anything further (like natural language processing, content parsing, etc), you will have to write your own code.

Prerequisites: The first thing you need to do is make sure you have Python installed. If you don’t have Python, install it before you proceed further. To run this code, you would need BeautifulSoup. It is a Python library to parse HTML documents. The steps to install BeautifulSoup are given below:

  1. Go to this site and download the file “beautifulsoup4-4.1.3.tar.gz”
  2. Unpack the file into a comfortable location
  3. Open the terminal and go to the unpacked folder
  4. Execute the following commands:
    $ python setup.py build 
    $ python setup.py install
  5. If the installation is successful, you will not see any errors on the terminal.

Running the crawler: Go to a comfortable location on your terminal and download the web crawler code from my github account using the following command:

$ git clone https://github.com/prateekvjoshi/Python-WebCrawler.git

The above command will create a folder called “Python-WebCrawler” on your machine and get all the required files in one go. It is a modified version of James Mills’ original recipe. If you don’t have git, then just download “crawler.py” from this link. The good thing about Python is that almost any function we would ever need is either inbuilt or already written by someone. You just have to arrange them properly so that they can do the dance for you!

Now we are ready to do some crawling! You can use “crawler.py” to crawl various websites. I have listed a few use cases below:

  • The following command will display the total number of links found on a particular website after crawling:
    $ python crawler.py http://website.com
  • If you want to crawl only up to a particular depth (in this case, 2 levels), then:
     $ python crawler.py -d 2 http://website.com
  • If you want the links which are only found on this particular url:
     $ python crawler.py -l http://website.com
  • There are many other options you can explore. Execute the following in the terminal and you will see a bunch of options:
     $ python crawler.py --help

———————————————————————————————————–

67 thoughts on “How To Build A Web Crawler?

  1. Useful information. Fortunate me I discovered your
    website accidentally, and I’m shocked why this coincidence did not took place in advance! I bookmarked it.

  2. You could definitely see your expertise within the work you write.
    The arena hopes for more passionate writers like you who are
    not afraid to say how they believe. Always follow your heart.

  3. Thanks a lot for sharing this with all people you
    really recognise what you are talking approximately! Bookmarked.
    Kindly also discuss with my website =). We can have a link change contract between us

  4. I’m really enjoying the design and layout of your site. It’s a very easy
    on the eyes which makes it much more enjoyable for me to come here
    and visit more often. Did you hire out a designer to create your
    theme? Exceptional work!

    • Thanks. I didn’t hire a designer to create this site. WordPress offers a lot of nice options, so I just spent some time fiddling around and designed it myself. It’s actually fun!

  5. hi!,I really like your writing very so much! share we
    communicate extra about your article on AOL?
    I need a specialist on this area to resolve my problem.
    Maybe that’s you! Taking a look ahead to peer you.

  6. Hello there, I found your blog by way of Google even as looking for a
    related matter, your site came up, it seems great. I have bookmarked it in my google bookmarks.

    Hi there, just changed into aware of your blog thru Google, and
    found that it’s really informative. I’m going to watch out for brussels.
    I will be grateful should you continue this in future. A lot of folks will probably be benefited out of your writing.
    Cheers!

  7. Superb blog! Do you have any helpful hints for aspiring writers?

    I’m planning to start my own website soon but I’m a little lost on everything.
    Would you recommend starting with a free platform like WordPress or go for a paid option?
    There are so many choices out there that I’m completely overwhelmed .. Any tips? Thanks!

    • Thanks. If you are feeling overwhelmed, the best thing to do is to just start writing about something you are really comfortable with. Don’t worry about how it’s going to come out or who’s going to read it. Once you get started, you will get more comfortable.

      If this is your first time, I would recommend starting out with a free platform like WordPress. You can always upgrade to a paid option later on.

  8. Great beat ! I wish to apprentice while you amend your site, how could i subscribe for a blog
    web site? The account aided me a appropriate deal.
    I had been tiny bit familiar of this your broadcast offered
    vibrant clear concept

  9. Buenas!
    Debo admitir que hasta hace poco noo me gustaba demasiado elsitio, sin embargo con los ultimos posts estoy leyendolo frecuentemente y mee ha empezado a
    gustar.
    Sige asi!

  10. Hello, i read your blog from time to time and i own a
    similar one and i was just curious if you get a lot of spam feedback?
    If so how do you reduce it, any plugin or anything you can recommend?
    I get so much lately it’s driving me insane so any help is very much appreciated.

    • Spam feedback can be a huge pain. I have faced similar problems in the past as well. Every time you receive a spam feedback, mark it as “spam” in the CMS you are using. After a while, your CMS will learn to filter it better.

  11. Hello there! I could have sworn I’ve been to this site before but after browsing through many
    of the articles I realized it’s new to me. Regardless, I’m certainly delighted I
    came across it and I’ll be bookmarking it and checking back often!

  12. I’m really enjoying the design and layout of
    your website. It’s a very easy oon the eyes which makes it much more pleasant for me to come here and visit more often.
    Did you hire out a designr to create your theme?
    Outstanding work!

  13. I will right away take hold of your rss feed as I can’t to find
    your e-mail subscription hyperlink or e-newsletter service.
    Do you’ve any? Please permit me know in order that I may
    just subscribe. Thanks.

  14. Fantastic goods from you, man. I have understand your stuff
    previous to and you are just extremely wonderful. I actually like what you’ve acquired here, certainly
    like what you are saying and the way in which you say it.
    You make it enjoyable and you still take care of to keep it smart.
    I can’t wait to read far more from you. This is actually
    a tremendous site.

  15. fantastic publish, very informative. I’m wondering why the other experts of this sector do not understand this.

    You should proceed your writing. I am confident, you’ve a great readers’ base
    already!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s