Project 7 Option 2
Web Sucker
Due December 12 by 5pm
60 points


Web site sucker/verifier

Our goal is to write a single program which will perform the following tasks:
  1. Given a URL, verify all links leading from that URL and all links leading from those links etc (see below for instructions on how this process is to be terminated)
  2. Given a URL, download and store in files that page and pages to which it refers (inlined images can/should be downloaded also)
You may write this program in any language you like.

Common attributes

Both modes should support identification of circular references (links which were already visited) and understand relative vs absolute href's.

Both modes should utilize the command line arguments listed below (see "man 3 getopt" for info on how to parse command line options). In the descriptions below, the text before the colon (:) is the option text, the text after the colon is the description. User specified arguments are enclosed in angle brackets.

The general format of the command should be
webbeast [options] <start-url>
where start-url is any valid http URL. Examples of running this command include:
webbeast --depth 10 --ignore-file ignores.txt --suck
webbeast --depth 1 --verify

The verifier

The verifier is actived by the --verify option. It should produce output on stdout indicating all of the links which were bad and the page on which they occurred or the full path leading to the bad link (try to format this nicely). Links followed should be any of those available from anchor tags <a href="...">...</a> for which the transport specified is "http" and which do not match a regular expression in the ignore file, if one was specified. Do not try to follow "ftp" or other types of links.

The sucker

The sucker is actived by the --suck option (the --verify and --suck options are mutually exclusive and --verify is the default). It should operate in a manner identical to the verifier but whenever a link is followed, the contents of the link should be stored in a file. A directory structure reflecting the tree structure of each web site visited should be created and the files placed in each directory should correspond to the html documents obtained at that location. The top level of the directory structure will always be the name of the host. Do not nest hosts within the directory structure of other hosts as this will lead to all kinds of problems.

Requirements

  1. Your program should make use of one of the three sockets support libraries discussed in class or one with suitably similar functionality.
  2. Your program must compile and perform all of the duties described above. Failure to produce a running program will result in a zero on this project. It is better to hand in a partial assignment that one that does not work at all.
  3. Your program must compile by typing "make" in its home directory.

Resources

Learning about the HTTP protocol

Learning about HTML

Advice

This is an ambitious project. You will receive most of the credit for getting the "verifier" done correctly. I suggest that you build the project in "modules" where each module is a separate .c file:
  1. webifier.c -- main(), parsing the command line options etc
  2. ignore_file.c -- functions to manage the ignore file
  3. http.c -- functions to deal with the http protocol
  4. html.c -- functions to parse an html file and pull out the anchor tags
Develop and test each module separately. I would be happy to help guide you in selecting the functions and structures which belong in each module. Do not try to implement the ignore file support right away since learning POSIX regular expressions takes a little time. Simply place stub function calls in the appropriate locations such as
void parse_ignore_file(char *fileName); /* parse the contents of the igonre file for later use by should_ignore */
int should_igonre(char *url); /* returns 1 if a URL should not be followed and 0 if it should */
Then you can add the actual versions of these functions once the networking part of your project is coming closer to completion.

Possible advanced extensions

Process the robots.txt file to avoid following dangerous links

Add support for inline image (<img> tags) downloading in the sucker and image link checking in the verifier.

Add support for publishing which can upload an entire directory structure to a web server via the http "PUT" command. This might be a separate program but it could certainly benifit from the libraries you developed in this project. Pitfall: You will have to find a server on which you have publishing privs.


Page maintained by:

David Shaffer
cdshaffer@acm.org
Last modified: Mon Nov 19 09:57:26 EST 2001