Project 7 Option 2
Web Sucker
Due December 12
by 5pm
60 points
Web site sucker/verifier
Our goal is to write a single program which will perform the following tasks:
- Given a URL, verify all links leading from that URL and all
links leading from those links etc (see below for instructions
on how this process is to be terminated)
- Given a URL, download and store in files that page and pages to
which it refers (inlined images can/should be downloaded also)
You may write this program in any language you like.
Common attributes
Both modes should support identification of circular references (links
which were already visited) and understand relative vs absolute href's.
Both modes should utilize the command line arguments listed below
(see "man 3 getopt" for info on how to parse command line options).
In the descriptions below, the text before the colon (:) is the option
text, the text after the colon is the description. User specified
arguments are enclosed in angle brackets.
- --depth <depth>: indicates the maximum depth of link
following (default to 3). The initial URL specified should be
considered "depth 0".
- --ignore-file <ignore-file>: the name of a file which contains POSIX
format regular expressions (see "info regex") for URL's which
should not be followed (used to avoid CGI scripts, ASP's
etc). (default to no ignored URLs)
- --this-host: stay on the specified host only. Do not follow
links which lead to a different host
- --help: print a useful help message
The general format of the command should be
webbeast [options] <start-url>
where start-url is any valid http URL. Examples of running this
command include:
webbeast --depth 10 --ignore-file ignores.txt --suck
webbeast --depth 1 --verify
The verifier
The verifier is actived by the --verify option. It should produce
output on stdout indicating all of the links which were bad and the
page on which they occurred or the full path leading to the bad link
(try to format this nicely). Links followed should be any of those
available from anchor tags <a href="...">...</a>
for which the transport specified is "http" and which do not
match a regular expression in the ignore file, if one was specified.
Do not try to follow "ftp" or other types of links.
The sucker
The sucker is actived by the --suck option (the --verify and --suck
options are mutually exclusive and --verify is the default). It
should operate in a manner identical to the verifier but whenever a
link is followed, the contents of the link should be stored in a file.
A directory structure reflecting the tree structure of each web site
visited should be created and the files placed in each directory
should correspond to the html documents obtained at that location.
The top level of the directory structure will always be the name of
the host. Do not nest hosts within the directory structure of other
hosts as this will lead to all kinds of problems.
Requirements
- Your program should make use of one of the three sockets
support libraries discussed in class or one with suitably similar
functionality.
- Your program must compile and perform all of the duties
described above. Failure to produce a running program will
result in a zero on this project. It is better to hand in a
partial assignment that one that does not work at all.
- Your program must compile by typing "make" in its home directory.
Resources
Learning about the HTTP protocol
- If you're going to use C, study the libwww
library provided by w3. I recommend that you use this
library. They also provide examples
of C robots...check them out!
- If you plan to use java, take a look at the class
HttpURLConnection in the java.net
package.
- If you're going to take the blood and guts approach, I'd suggest
that you take my DoNothingServer
run it (it will listen on port 8080 by default) then hit it with
a web browser to see how a web browser requests
pages from an HTTP server. The DoNothingServer will echo all
of the requests on the console so you can see how they are
formatted. Unfortunately it doesn't respond so you will have to
kill it (ctrl-C) to free your web browser. Also you should look
at a web server from the client end. Try
telneting to a web server and sending the commands that you
learned above. What does the response look like? What if you
request an image?
- Since the ad-hoc methods above is generally considered bad
practice you
might consider reading the HTTP RFC (Request For
Comments) which is the de-facto standard for the protocol
(yawn). Keep in mind that in its simplest version your program
only has to support .html files (mime type text/html) although
if you structure your program correctly the mime type shouldn't
make much of a difference.
- There are many abbreviated descriptions of this protocol
available on the web.
Learning about HTML
- Take a look at the
definitive HTML reference.
- There are lots of abbreviated HTML descriptions available on the
web.
- Again, the w3 folks have some helpful HTML libraries (although a
bit of overkill for us).
Advice
This is an ambitious project. You will receive most of the credit for
getting the "verifier" done correctly. I suggest that you
build the project in "modules" where each module is a
separate .c file:
- webifier.c -- main(), parsing the command line options etc
- ignore_file.c -- functions to manage the ignore file
- http.c -- functions to deal with the http protocol
- html.c -- functions to parse an html file and pull out the
anchor tags
Develop and test each module separately. I would be happy to help
guide you in selecting the functions and structures which belong in
each module. Do not try to implement the ignore file support right
away since learning POSIX regular expressions takes a little time.
Simply place stub function calls in the appropriate locations such
as
void parse_ignore_file(char *fileName); /* parse the
contents of the igonre file for later use by should_ignore */
int should_igonre(char *url); /* returns 1 if a URL
should not be followed and 0 if it should */
Then you can add the actual versions of these functions once the
networking part of your project is coming closer to completion.
Possible advanced extensions
Process the robots.txt file to avoid following dangerous links
Add support for inline image (<img> tags) downloading in the
sucker and image link checking in the verifier.
Add support for publishing which can upload an entire directory
structure to a web server via the http "PUT" command. This
might be a separate program but it could certainly benifit from the
libraries you developed in this project. Pitfall: You will have to
find a server on which you have publishing privs.
Page maintained by:
David Shaffer cdshaffer@acm.org
Last modified: Mon Nov 19 09:57:26 EST 2001
|  |