Network Programming Project 3
Due Monday February 21, 2000


Web site sucker/verifier

Our goal is to write a single program which will perform the following tasks:
  1. Given a URL, verify all links leading from that URL and all links leading from those links etc (see below for instructions on how this process is to be terminated)
  2. Given a URL, download and store in files that page and pages to which it refers (inlined images can/should be downloaded also)

Common attributes

Both modes should support identification of circular references (links which were already visited) and understand relative vs absolute href's as discussed in class.

Both modes should utilize the command line arguments listed below (see "man 3 getopt" for info on how to parse command line options). In the descriptions below, the text before the colon (:) is the option text, the text after the colon is the description. User specified arguments are enclosed in angle brackets.

The general format of the command should be
webbeast [options] <start-url>
where start-url is any valid http URL. Examples of running this command include:
webbeast --depth 10 --ignore-file ignores.txt --suck
webbeast --depth 1 --verify

The verifier

The verifier is actived by the --verify option. It should produce output on stdout indicating all of the links which were bad and the page on which they occurred or the full path leading to the bad link (try to format this nicely). Links followed should be any of those available from anchor tags <a href="...">...</a> for which the transport specified is "http" and which do not match a regular expression in the ignore file, if one was specified. Do not try to follow "ftp" or other types of links.

The sucker

The sucker is actived by the --suck option (the --verify and --suck options are mutually exclusive and --verify is the default). It should operate in a manner identical to the verifier but whenever a link is followed, the contents of the link should be stored in a file. A directory structure reflecting the tree structure of each web site visited should be created and the files placed in each directory should correspond to the html documents obtained at that location.

Requirements

  1. Your program must make use of one of the three sockets support libraries discussed in class or one with suitably similar functionality.
  2. Your program may not make use of an "HTTP" support library of any kind (unless you write it yourself, of course).
  3. Your program must compile and perform all of the duties described above. Failure to produce a running program will result in a zero on this project. It is better to hand in a partial assignment that one that does not work at all.
  4. Your program must compile by typing "make" in its home directory.

Items to hand in/e-mail

  1. You must e-mail a ".tar.gz" version of your project directory in uuencode or base64 encoding in a MIME formatted message to: DavidShaffer@psu.edu.
  2. You must hand in a printed version of all of your source code (you may exclude printouts of the libraries as long as you used one of those mentioned in class)

Resources

Learning about the HTTP protocol

Learning about HTML

Advice

This is an ambitious project. You will receive most of the credit for getting the "verifier" done correctly. I suggest that you build the project in "modules" where each module is a separate .c file:
  1. webifier.c -- main(), parsing the command line options etc
  2. ignore_file.c -- functions to manage the ignore file
  3. http.c -- functions to deal with the http protocol
  4. html.c -- functions to parse an html file and pull out the anchor tags
Develop and test each module separately. I would be happy to help guide you in selecting the functions and structures which belong in each module. Do not try to implement the ignore file support right away since learning POSIX regular expressions takes a little time. Simply place stub function calls in the appropriate locations such as
void parse_ignore_file(char *fileName); /* parse the contents of the igonre file for later use by should_ignore */
int should_igonre(char *url); /* returns 1 if a URL should not be followed and 0 if it should */
Then you can add the actual versions of these functions once the networking part of your project is coming closer to completion.

This a long project so START TODAY.

Possible advanced extensions

Add support for inline image (<img> tags) downloading in the sucker and image link checking in the verifier.

Add support for publishing which can upload an entire directory structure to a web server via the http "PUT" command. This might be a separate program but it could certainly benifit from the libraries you developed in this project. Pitfall: You will have to find a server on which you have publishing privs.


Page maintained by:

David Shaffer
DavidShaffer@psu.edu
Last modified: Mon Feb 7 14:50:20 EST 2000