I was reading an article the other day and I came across the term “web crawler”. The context in which it was used got me a little curious about the design of a web crawler. A web crawler is a simple program that scans or “crawls” through web pages to create an index of the data it’s looking for. There are several uses for the program, perhaps the most popular being search engines using it to provide web surfers with relevant websites. Google has perfected the art of crawling over the years! A web crawler can pretty much be used by anyone who is trying to search for information on the Internet in an organized manner. It is referred to by different names like web spider, bot, indexer etc. Anyway, that article got me thinking about building a web crawler. I just wanted to fiddle with it and see how much time it will take to get something working on my machine. It turned out to be quite easy!
Where is it used?
Web crawlers can be used in many different ways, but they are usually used by someone seeking to collect information on the Internet. Search engines frequently use web crawlers to collect information about what is available on public web pages. The speed and accuracy of search engines is heavily dependent on the design of the crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. Linguists may use a web crawler to perform text and language analysis. Market researchers may use a web crawler to determine and assess trends in a given market. The possibilities are endless!
How does it actually work?
When a web crawler visits a web page, it reads all the visible text. It includes things like text, hyperlinks, content of the various tags used in the site etc. It then crawls to the links present on that site by treating it as a graph search problem. Each link is treated as a node and the traversal is achieved using depth first search or breadth first search depending on the application. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. The website is then included in the search engine’s database and its page ranking process.
Most of the times, web crawlers are designed to do a specific thing. If we want it to be general purpose (like a search engine), web crawlers should be programmed to crawl through the Internet periodically to check for any significant changes. This is very useful in keeping the results up-to-date.
I’m ready! Show me how to build a web crawler.
Now that we know how it works, we are ready to build a web crawler. I will show you how to get a basic Python web crawler working on your machine. Given a link, you will be able to crawl through the page and get all the links. You can then crawl through those pages and get more links. This process will continue until the desired depth is achieved. You will get a list of all the links in the end. To do anything further (like natural language processing, content parsing, etc), you will have to write your own code.
Prerequisites: The first thing you need to do is make sure you have Python installed. If you don’t have Python, install it before you proceed further. To run this code, you would need BeautifulSoup. It is a Python library to parse HTML documents. The steps to install BeautifulSoup are given below:
- Go to this site and download the file “beautifulsoup4-4.1.3.tar.gz”
- Unpack the file into a comfortable location
- Open the terminal and go to the unpacked folder
- Execute the following commands:
$ python setup.py build $ python setup.py install
- If the installation is successful, you will not see any errors on the terminal.
Running the crawler: Go to a comfortable location on your terminal and download the web crawler code from my github account using the following command:
$ git clone https://github.com/prateekvjoshi/Python-WebCrawler.git
The above command will create a folder called “Python-WebCrawler” on your machine and get all the required files in one go. It is a modified version of James Mills’ original recipe. If you don’t have git, then just download “crawler.py” from this link. The good thing about Python is that almost any function we would ever need is either inbuilt or already written by someone. You just have to arrange them properly so that they can do the dance for you!
Now we are ready to do some crawling! You can use “crawler.py” to crawl various websites. I have listed a few use cases below:
- The following command will display the total number of links found on a particular website after crawling:
$ python crawler.py http://website.com
- If you want to crawl only up to a particular depth (in this case, 2 levels), then:
$ python crawler.py -d 2 http://website.com
- If you want the links which are only found on this particular url:
$ python crawler.py -l http://website.com
- There are many other options you can explore. Execute the following in the terminal and you will see a bunch of options:
$ python crawler.py --help
———————————————————————————————————–
Here is another way of creating web crawler using Java and crawler4j
http://www.buggybread.com/2013/01/create-your-own-email-and-image.html
Thanks for the link Vikas. Good to know!
HEY!!!! do you know anything about the necessary steps in creating a search engine??? if so could you please email me A.S.A.P.??? your response is greatly appreciated!!!
Useful information. Fortunate me I discovered your
website accidentally, and I’m shocked why this coincidence did not took place in advance! I bookmarked it.
Thanks. Always good to know that the information is useful to others.
I love your blog.. very nice colors & theme. Did you design this website yourself or did you hire someone to do it for you?
Plz reply as I’m looking to design my own blog and would like to find out where u got this from. thanks
Thanks, I did it myself. You can do it yourself as well. Just spend a little time with wordpress and you will figure out a way.
You could definitely see your expertise within the work you write.
The arena hopes for more passionate writers like you who are
not afraid to say how they believe. Always follow your heart.
Thanks a lot. I enjoy writing about things I am passionate about.
Hi all, here every one is sharing such knowledge, so it’s good to read this web site, and I used to go to see this weblog every day.
Hi everyone, it’s my first pay a quick visit at this website, and paragraph is genuinely fruitful designed for me, keep up posting these articles.
Thanks. Good to know that it’s useful.
whoah this blog is great i like studying your articles.
Keep up the great work! You recognize, many people are searching around for this
information, you can help them greatly.
Thanks a lot 🙂
It’s not my first time to go to see this web site, i am visiting this web site dailly and get good facts from here daily.
Thanks.
I have read so many articles on the topic of the blogger lovers except this post is genuinely a fastidious paragraph, keep it up.
Thanks a lot.
I’m curious to find out what blog system you’re utilizing?
I’m experiencing some small security issues with my latest site and I’d like
to find something more risk-free. Do you have any suggestions?
Thanks a lot for sharing this with all people you
really recognise what you are talking approximately! Bookmarked.
Kindly also discuss with my website =). We can have a link change contract between us
I’m really enjoying the design and layout of your site. It’s a very easy
on the eyes which makes it much more enjoyable for me to come here
and visit more often. Did you hire out a designer to create your
theme? Exceptional work!
Thanks. I didn’t hire a designer to create this site. WordPress offers a lot of nice options, so I just spent some time fiddling around and designed it myself. It’s actually fun!
hi!,I really like your writing very so much! share we
communicate extra about your article on AOL?
I need a specialist on this area to resolve my problem.
Maybe that’s you! Taking a look ahead to peer you.
Thanks a lot. What is the problem you are facing?
Hello there, I found your blog by way of Google even as looking for a
related matter, your site came up, it seems great. I have bookmarked it in my google bookmarks.
Hi there, just changed into aware of your blog thru Google, and
found that it’s really informative. I’m going to watch out for brussels.
I will be grateful should you continue this in future. A lot of folks will probably be benefited out of your writing.
Cheers!
Thanks a lot. Always good to hear that!
Superb blog! Do you have any helpful hints for aspiring writers?
I’m planning to start my own website soon but I’m a little lost on everything.
Would you recommend starting with a free platform like WordPress or go for a paid option?
There are so many choices out there that I’m completely overwhelmed .. Any tips? Thanks!
Thanks. If you are feeling overwhelmed, the best thing to do is to just start writing about something you are really comfortable with. Don’t worry about how it’s going to come out or who’s going to read it. Once you get started, you will get more comfortable.
If this is your first time, I would recommend starting out with a free platform like WordPress. You can always upgrade to a paid option later on.
Great beat ! I wish to apprentice while you amend your site, how could i subscribe for a blog
web site? The account aided me a appropriate deal.
I had been tiny bit familiar of this your broadcast offered
vibrant clear concept
Buenas!
Debo admitir que hasta hace poco noo me gustaba demasiado elsitio, sin embargo con los ultimos posts estoy leyendolo frecuentemente y mee ha empezado a
gustar.
Sige asi!
Gracias Paula.
I go to see daily a few sites and blogs to read content, except
this webpage provides quality based writing.
Thanks a lot
What’s up Dear, are you actually visiting this web site daily, if
so then you will without doubt get pleasant know-how.
Greetings! Very helpful advice within this post! It’s the little changes
that will make the biggest changes. Thanks a lot for sharing!
You’re welcome! Glad to hear that the post was helpful.
Hello, i read your blog from time to time and i own a
similar one and i was just curious if you get a lot of spam feedback?
If so how do you reduce it, any plugin or anything you can recommend?
I get so much lately it’s driving me insane so any help is very much appreciated.
Spam feedback can be a huge pain. I have faced similar problems in the past as well. Every time you receive a spam feedback, mark it as “spam” in the CMS you are using. After a while, your CMS will learn to filter it better.
Whats Happening i am new to this, I stumbled upon this I’ve found It positively helpful and it has aided me out loads. I hope to contribute & aid other users like its aided me. Good job.
Thanks Latonya. Good to hear that!
When someone writes an article he/she maintains the thought of a user
in his/her brain that how a user can know it.
Thus that’s why this post is great. Thanks!
Thanks a lot.
I am sure this paragraph has touched all the internet people, its really really pleasant post on building up
new weblog.
Thanks. Good to know that people find this post helpful.
Hello there! I could have sworn I’ve been to this site before but after browsing through many
of the articles I realized it’s new to me. Regardless, I’m certainly delighted I
came across it and I’ll be bookmarking it and checking back often!
Thanks. If you want, you can also subscribe to the blog via email. In the right side bar on the home page, you can see “Follow blog via email” section.
I want to to thank you for this wonderful read!! I certainly enjoyed every bit
of it. I have you book-marked to check out new things you post…
Thanks a lot.
Good day! I just would like to give you a huge thumbs up for your great information you’ve got right here on this post.
I’ll be coming back to your site for more soon.
Thanks you so much.
Hurrah! At last I got a web site from where I know how to
really get valuable data regarding my study and knowledge.
Thank you. Good to know that the blog post is helpful.
I’m really enjoying the design and layout of
your website. It’s a very easy oon the eyes which makes it much more pleasant for me to come here and visit more often.
Did you hire out a designr to create your theme?
Outstanding work!
It’s hard to come by educated people on this subject, but
you seem like you know what you’re talking about! Thanks
Nice post. I learn something new and challenging on blogs I stumbleupon everyday.
It will always be interesting to read through content from other writers and
use something from their websites.
Hi there to all, because I am in fact keen of reading this weblog’s post to be
updated daily. It carries fastidious stuff.
Ι really like it, have а look at thiѕ contributοr’s other articles.
Don’t you find it absolutely up to dste – creative
as ԝell aas coherent.
I will right away take hold of your rss feed as I can’t to find
your e-mail subscription hyperlink or e-newsletter service.
Do you’ve any? Please permit me know in order that I may
just subscribe. Thanks.
Thanks! There is an email subscription link on the right sidebar of the homepage. It is titled “Follow blog via email”.
Hello my friend! I wish to say that this post is amazing, great written and include approximately all vital infos.
I’d like to look more posts like this .
Thanks!
We are a group of volunteers and opening a new scheme in our
community. Your website offered us with valuable information to work on.
You have done a formidable job and our whole community will be grateful to you.
Thanks
Fantastic goods from you, man. I have understand your stuff
previous to and you are just extremely wonderful. I actually like what you’ve acquired here, certainly
like what you are saying and the way in which you say it.
You make it enjoyable and you still take care of to keep it smart.
I can’t wait to read far more from you. This is actually
a tremendous site.
fantastic publish, very informative. I’m wondering why the other experts of this sector do not understand this.
You should proceed your writing. I am confident, you’ve a great readers’ base
already!
Thanks a lot!
When I give the command as -> crawler.py http://www.mtv.com, then the output always comes as Error:HTTP Error 400: Bad Request. What could be the error?