Cobweb client/server design

Cobweb is a web crawler, for those of you that didn't know.

The operation of this package is designed to run with the server (cobweb) process in a console-like state, where you basically run it, and never touch it again, its just looked at for status and similar things.
If you need to change anything in the server, you do it via the client (bunny). If you need the server to quit, you send it a quit message via the bunny, if you need to load a new module, you do it via the bunny using loadmod, etc.. the command line nature allows you do whatever you wish to the server even from remote (reasons why explained later).

Typically what you do is run cobweb, and then run the bunnies to load all your modules. Then run a bunny with a URL as first argument. This will start the crawling. If you run 2 bunnies, you will clean out the queue twice as fast, as well as add to it twice as fast (theoretically). Although there is a problem, bunnies get lagged by DNS lookups and routing problems and other similar IP slow-downs, so if you run a few or so, you'll always have a few that are crawling and a few that are waiting for connect()s or gethostbyaddr()s to timeout.
What's neat is that you can run like 30 bunnies on a really fast net connect, and the cobweb server over a slower link, and the data is kept on the slower link (possibly local), while the major downloading and speed intensive stuff is done remotely at a fast location. Since there is very little data being transmitted between cobweb and the bunnies, the slower link doesnt hurt you as much, although faster link is always better, of course.

I'll talk more later on about this, and maybe organize this rant. The stuff below is how it all works internally, details left out.


Bunny (client)

  1. Bunny runs with URL as argv[1] or under a more minor mode with other args:
    show
    Sends SHOWDATA to cobweb
    quit
    Sends QUIT to cobweb
    modstat
    Sends MODSTAT to cobweb
    log filename
    Sends LOGDATA to cobweb
    read filename
    Sends READDATA to cobweb
    loadmod filename
    Sends NEWMOD to cobweb
    If one of the choice above is specified, it sends the appropriate signal and then exits. Otherwise it assumes the argv[1] is a URL and continues with this list.
  2. Bunny adds URL from argv[1] to cobweb, after parsing.
  3. Bunny requests parsed URL from cobweb from the queue (DEQUEUE), if no url is available, bunny commits suicide.
  4. Bunny tell cobweb that to mark the url it received in the previous step as visited (STOREURL).
  5. Bunny grabs web page and parses it, for every it runs into, it adds the link it to the Cobweb queue, after parsing the url.
  6. Goto step 2

Cobweb (server)

  1. Cobweb runs with no args.
  2. Cobweb listens for connections and grabs packets, only scans packets if its fully received.
  3. Cobweb does its packet specific portions, and then goes to step 2

Packets

MODSTAT
Prints the modules loaded and closes connection.

NEWMOD
Data of packet is asciiz of the module name. Loads new module with the module name after making sure its not already there. Loads the module MODULENAME.so in the current directory that cobweb is running from. Then it runs the function _MODULENAME() with 0 for all params as a test to make sure it runs, you should probably do some diagnostic print here and watch for it. After all succeeds or something fails, it closes the connection.

MODINFO
This is information returned to the main cobweb process from children processes that are running modules. After output of the results of the module on the server, it closes connection.

SHOWDATA
Prints the visited sites (STOREURL) and the directories for each site as well as the entries in the queue. After completion, it closes connection.

READDATA
Reads in a log file as saved from LOGDATA. Essentially emulating that the sites in the log were already hit. Number of directories is set to 1, "/". After read, it closes connection.

LOGDATA
Logs the visited sites (STOREURL) and the module information for each site to a file specified. Data is append to the file. After completion, it closes connection.

DEQUEUE
If there is nothing in the queue, it returns packet of 0 length. If there is, it will dequeue and runs a couple of tests on the URL:
  • "Probability"
    Basically a back off, like after X number of hits to that server, only hit it 1/2 the time, after Y hits, only 1/3rd, after Z hits, 1/4th, etc..
  • "Already hit"
    Check to see if the ip/port/dir have already been STOREURL'd. Don't want to run in circles.
If any tests have failed, it will start this step over, otherwise, will send back the URL and not close the connection.

ENQUEUE
It first runs a couple of test on the URL it receives:
  • "Bad extension"
    A substring search of the dir, if the dir includes the string, its an invalid URL for queuing. Usages include detecting for certain types of files such as images and movies, or skipping of CGI-BINs and user home directories.
  • "Already hit"
    Check to see if the ip/port/dir have already been STOREURL'd. Don't want to run in circles.
If any tests fail, it will break and continue. If all succeeds, it will also continue. Both cases do not close the connection.

STOREURL
If the ip/port combination has already been hit, it adds the dir of the URL under that ip/port combo's directory list and increments the count of URLs its hit on that site. If it hasn't been hit, it adds the ip/port as a new site, storing the dir of the URL under it. The URL count is set to 1, and the module information is cleared. Modules are then executed on that site in order. (see Modules) The connection is not closed.

QUIT
A log is written using the same method as in LOGDATA (however connection is not closed yet), to a file called cobweb.log. Existing log is overwritten. Then cobweb exits.

Modules

When a new site is visited (STOREURL), all the modules are executed on that site. First, tests are done to see if it should do the probe or not. Then, for each module, cobweb will fork, run the module, and then return the response to its parent (the main cobweb process), in the same way bunnies do (via another connection), with a MODINFO.


Problems/Bugs/Issues


Back


not done yet


DEQUEUE -> format:
	2 bytes -- 3 
	1 byte --  DEQUEUE

DEQUEUE <- format:
	2 bytes -- 10 + strlen(hostname) + strlen(filename)
	4 bytes -- addr
	2 bytes -- port
	X bytes -- hostname + NULL
	X bytes -- filename + NULL

ENQUEUE -> format:
	2 bytes -- 11 + strlen(hostname) + strlen(filename)
	1 byte --  ENQUEUE
	4 bytes -- addr
	2 bytes -- port
	X bytes -- hostname + NULL
	X bytes -- filename + NULL

STOREURL -> format:
	2 bytes -- 11 + strlen(hostname) + strlen(filename)
	1 byte --  STOREURL
	4 bytes -- addr
	2 bytes -- port
	X bytes -- hostname + NULL
	X bytes -- filename + NULL

LOGDATA -> format:
	2 bytes -- 4 + strlen(filename)
	1 byte --  LOGDATA
	X bytes -- filename + NULL

MODSTAT -> format:
	2 bytes -- 3
	1 byte --  MODSTAT

SHOWDATA -> format:
	2 bytes -- 3
	1 byte --  SHOWDATA

READDATA -> format:
	2 bytes -- 4 + strlen(filename)
	1 byte --  READDATA
	X bytes -- filename + NULL

NEWMOD -> format:
	2 bytes -- 4 + strlen(modname)
	1 byte --  NEWMOD
	X bytes -- modname + NULL

QUIT -> format:
	2 bytes -- 3 
	1 byte --  QUIT

bunny does the name lookup since bunnies can do multiple name
lookups at once if you run multiple bunnies.