Cobweb client/server design

Cobweb is a web crawler, for those of you that didn't know.

The operation of this package is designed to run with the server (cobweb) process in a console-like state, where you basically run it, and never touch it again, its just looked at for status and similar things.

If you need to change anything in the server, you do it via the client (bunny). If you need the server to quit, you send it a quit message via the bunny, if you need to load a new module, you do it via the bunny using loadmod, etc.. the command line nature allows you do whatever you wish to the server even from remote (reasons why explained later).

Typically what you do is run cobweb, and then run the bunnies to load all your modules. Then run a bunny with a URL as first argument. This will start the crawling. If you run 2 bunnies, you will clean out the queue twice as fast, as well as add to it twice as fast (theoretically). Although there is a problem, bunnies get lagged by DNS lookups and routing problems and other similar IP slow-downs, so if you run a few or so, you'll always have a few that are crawling and a few that are waiting for connect()s or gethostbyaddr()s to timeout.

What's neat is that you can run like 30 bunnies on a really fast net connect, and the cobweb server over a slower link, and the data is kept on the slower link (possibly local), while the major downloading and speed intensive stuff is done remotely at a fast location. Since there is very little data being transmitted between cobweb and the bunnies, the slower link doesnt hurt you as much, although faster link is always better, of course.

I'll talk more later on about this, and maybe organize this rant. The stuff below is how it all works internally, details left out.

Bunny (client)

Bunny runs with URL as argv[1] or under a more minor mode with other args:

show
Sends SHOWDATA to cobweb
quit
Sends QUIT to cobweb
modstat
Sends MODSTAT to cobweb
log filename
Sends LOGDATA to cobweb
read filename
Sends READDATA to cobweb
loadmod filename
Sends NEWMOD to cobweb
If one of the choice above is specified, it sends the appropriate signal and then exits. Otherwise it assumes the argv[1] is a URL and continues with this list.
Bunny adds URL from argv[1] to cobweb, after parsing.
Bunny requests parsed URL from cobweb from the queue (DEQUEUE), if no url is available, bunny commits suicide.
Bunny tell cobweb that to mark the url it received in the previous step as visited (STOREURL).
Bunny grabs web page and parses it, for every it runs into, it adds the link it to the Cobweb queue, after parsing the url.
Goto step 2

Cobweb (server)

Cobweb runs with no args.
Cobweb listens for connections and grabs packets, only scans packets if its fully received.
Cobweb does its packet specific portions, and then goes to step 2

Packets

MODSTAT

Prints the modules loaded and closes connection.

NEWMOD

Data of packet is asciiz of the module name. Loads new module with the module name after making sure its not already there. Loads the module MODULENAME.so in the current directory that cobweb is running from. Then it runs the function _MODULENAME() with 0 for all params as a test to make sure it runs, you should probably do some diagnostic print here and watch for it. After all succeeds or something fails, it closes the connection.

MODINFO

This is information returned to the main cobweb process from children processes that are running modules. After output of the results of the module on the server, it closes connection.

SHOWDATA

Prints the visited sites (STOREURL) and the directories for each site as well as the entries in the queue. After completion, it closes connection.

READDATA

Reads in a log file as saved from LOGDATA. Essentially emulating that the sites in the log were already hit. Number of directories is set to 1, "/". After read, it closes connection.

LOGDATA

Logs the visited sites (STOREURL) and the module information for each site to a file specified. Data is append to the file. After completion, it closes connection.

DEQUEUE

If there is nothing in the queue, it returns packet of 0 length. If there is, it will dequeue and runs a couple of tests on the URL:

"Probability"
Basically a back off, like after X number of hits to that server, only hit it 1/2 the time, after Y hits, only 1/3rd, after Z hits, 1/4th, etc..
"Already hit"
Check to see if the ip/port/dir have already been STOREURL'd. Don't want to run in circles.

If any tests have failed, it will start this step over, otherwise, will send back the URL and not close the connection.

ENQUEUE

It first runs a couple of test on the URL it receives:

"Bad extension"
A substring search of the dir, if the dir includes the string, its an invalid URL for queuing. Usages include detecting for certain types of files such as images and movies, or skipping of CGI-BINs and user home directories.
"Already hit"
Check to see if the ip/port/dir have already been STOREURL'd. Don't want to run in circles.

If any tests fail, it will break and continue. If all succeeds, it will also continue. Both cases do not close the connection.

STOREURL

If the ip/port combination has already been hit, it adds the dir of the URL under that ip/port combo's directory list and increments the count of URLs its hit on that site. If it hasn't been hit, it adds the ip/port as a new site, storing the dir of the URL under it. The URL count is set to 1, and the module information is cleared. Modules are then executed on that site in order. (see Modules) The connection is not closed.

QUIT

A log is written using the same method as in LOGDATA (however connection is not closed yet), to a file called cobweb.log. Existing log is overwritten. Then cobweb exits.

Modules

When a new site is visited (STOREURL), all the modules are executed on that site. First, tests are done to see if it should do the probe or not.

"Bad probe site"
The hostname is tested with a list of substrings. If it contains any, the test fails. This is good for not probing sites that know about you ;-)

Then, for each module, cobweb will fork, run the module, and then return the response to its parent (the main cobweb process), in the same way bunnies do (via another connection), with a MODINFO.

Problems/Bugs/Issues

Since cobweb does the module stuff, the probes are done from the slower link, if you choose to run the server and client on slow and fast link, respectively.
Module response is limited to 1 byte per module right now.
DNS lookups are repeated over and over, even if you hit a site over and over, should be caching locally or something, possibly even sharing between bunnies.
READDATA doesnt support directories, only ips/ports.
No cobweb config file to automatically load modules on startup.
Module calling interface is weak, only passes ip and port of web server. Module should be given access to read entire structure for its site, including the module data so stacking of modules is possible.
Although cobweb does back off on servers, it still lets bunnies pound them to death in the beginning, better if that could be spread out through the queue somehow. Pounding a server should only be done if that server is one of the only things left in the queue.

Back

not done yet


DEQUEUE -> format:
	2 bytes -- 3 
	1 byte --  DEQUEUE

DEQUEUE <- format:
	2 bytes -- 10 + strlen(hostname) + strlen(filename)
	4 bytes -- addr
	2 bytes -- port
	X bytes -- hostname + NULL
	X bytes -- filename + NULL

ENQUEUE -> format:
	2 bytes -- 11 + strlen(hostname) + strlen(filename)
	1 byte --  ENQUEUE
	4 bytes -- addr
	2 bytes -- port
	X bytes -- hostname + NULL
	X bytes -- filename + NULL

STOREURL -> format:
	2 bytes -- 11 + strlen(hostname) + strlen(filename)
	1 byte --  STOREURL
	4 bytes -- addr
	2 bytes -- port
	X bytes -- hostname + NULL
	X bytes -- filename + NULL

LOGDATA -> format:
	2 bytes -- 4 + strlen(filename)
	1 byte --  LOGDATA
	X bytes -- filename + NULL

MODSTAT -> format:
	2 bytes -- 3
	1 byte --  MODSTAT

SHOWDATA -> format:
	2 bytes -- 3
	1 byte --  SHOWDATA

READDATA -> format:
	2 bytes -- 4 + strlen(filename)
	1 byte --  READDATA
	X bytes -- filename + NULL

NEWMOD -> format:
	2 bytes -- 4 + strlen(modname)
	1 byte --  NEWMOD
	X bytes -- modname + NULL

QUIT -> format:
	2 bytes -- 3 
	1 byte --  QUIT

bunny does the name lookup since bunnies can do multiple name
lookups at once if you run multiple bunnies.

add newtype that does a STOREURL and DEQUEUE at the same time
no need to dequeue the hostname, ip is fine
specify flag in the STOREURL which determines if the site has already been hit.. easy to do by checking in the url parser if the url starts is relative or absolute.
possibly add DNS lookup check between cobweb and bunny.. before hostname parsed, bunny can ask cobweb for hostname ip, if cobweb has it, it gives it up, it can also return stuff such as an error saying Ive hit this site too many times, skip it and take this one form the queue instead (while returning one from the queue)
take advantage of HTTP/1.1 stay alive