Network Programming Project 3
Due Monday February
21, 2000
Web site sucker/verifier
Our goal is to write a single program which will perform the following tasks:
- Given a URL, verify all links leading from that URL and all
links leading from those links etc (see below for instructions
on how this process is to be terminated)
- Given a URL, download and store in files that page and pages to
which it refers (inlined images can/should be downloaded also)
Common attributes
Both modes should support identification of circular references (links
which were already visited) and understand relative vs absolute href's
as discussed in class.
Both modes should utilize the command line arguments listed below
(see "man 3 getopt" for info on how to parse command line options).
In the descriptions below, the text before the colon (:) is the option
text, the text after the colon is the description. User specified
arguments are enclosed in angle brackets.
- --depth <depth>: indicates the maximum depth of link
following (default to 3). The initial URL specified should be
considered "depth 0".
- --ignore-file <ignore-file>: the name of a file which contains POSIX
format regular expressions (see "info regex") for URL's which
should not be followed (used to avoid CGI scripts, ASP's
etc). (default to no ignored URLs)
- --help: print a useful help message
The general format of the command should be
webbeast [options] <start-url>
where start-url is any valid http URL. Examples of running this
command include:
webbeast --depth 10 --ignore-file ignores.txt --suck
webbeast --depth 1 --verify
The verifier
The verifier is actived by the --verify option. It should produce
output on stdout indicating all of the links which were bad and the
page on which they occurred or the full path leading to the bad link
(try to format this nicely). Links followed should be any of those
available from anchor tags <a href="...">...</a>
for which the transport specified is "http" and which do not
match a regular expression in the ignore file, if one was specified.
Do not try to follow "ftp" or other types of links.
The sucker
The sucker is actived by the --suck option (the --verify and --suck
options are mutually exclusive and --verify is the default). It
should operate in a manner identical to the verifier but whenever a
link is followed, the contents of the link should be stored in a file.
A directory structure reflecting the tree structure of each web site
visited should be created and the files placed in each directory
should correspond to the html documents obtained at that location.
Requirements
- Your program must make use of one of the three sockets support
libraries discussed in class or one with suitably similar
functionality.
- Your program may not make use of an "HTTP" support library of
any kind (unless you write it yourself, of course).
- Your program must compile and perform all of the duties
described above. Failure to produce a running program will
result in a zero on this project. It is better to hand in a
partial assignment that one that does not work at all.
- Your program must compile by typing "make" in its home directory.
Items to hand in/e-mail
- You must e-mail a ".tar.gz" version of your project
directory in uuencode or base64 encoding in a MIME formatted
message to:
DavidShaffer@psu.edu.
- You must hand in a printed version of all of your source code
(you may exclude printouts of the libraries as long as you used
one of those mentioned in class)
Resources
Learning about the HTTP protocol
- First I'd suggest that you take my DoNothingServer
run it (it will listen on port 8080 by default) then hit it with
a web browser to see how a web browser requests
pages from an HTTP server. The DoNothingServer will echo all
of the requests on the console so you can see how they are
formatted. Unfortunately it doesn't respond so you will have to
kill it (ctrl-C) to free your web browser.
- Second you should look at a web server from the client end. Try
telneting to a web server and sending the commands that you
learned above. What does the response look like? What if you
request an image?
- Since the methods above is generally considered bad practice you
should consider reading the HTTP RFC (Request For
Comments) which is the de-facto standard for the protocol
(yawn). Keep in mind that in its simplest version your program
only has to support .html files (mime type text/html) although
if you structure your program correctly the mime type shouldn't
make much of a difference.
- There are many abbreviated descriptions of this protocol
available on the web.
Learning about HTML
Advice
This is an ambitious project. You will receive most of the credit for
getting the "verifier" done correctly. I suggest that you
build the project in "modules" where each module is a
separate .c file:
- webifier.c -- main(), parsing the command line options etc
- ignore_file.c -- functions to manage the ignore file
- http.c -- functions to deal with the http protocol
- html.c -- functions to parse an html file and pull out the
anchor tags
Develop and test each module separately. I would be happy to help
guide you in selecting the functions and structures which belong in
each module. Do not try to implement the ignore file support right
away since learning POSIX regular expressions takes a little time.
Simply place stub function calls in the appropriate locations such
as
void parse_ignore_file(char *fileName); /* parse the
contents of the igonre file for later use by should_ignore */
int should_igonre(char *url); /* returns 1 if a URL
should not be followed and 0 if it should */
Then you can add the actual versions of these functions once the
networking part of your project is coming closer to completion.
This a long project so START TODAY.
Possible advanced extensions
Add support for inline image (<img> tags) downloading in the
sucker and image link checking in the verifier.
Add support for publishing which can upload an entire directory
structure to a web server via the http "PUT" command. This
might be a separate program but it could certainly benifit from the
libraries you developed in this project. Pitfall: You will have to
find a server on which you have publishing privs.