What's neat is that you can run like 30 bunnies on a really fast
net connect, and the cobweb server over a slower link, and the data is
kept on the slower link (possibly local), while the major downloading
and speed intensive stuff is done remotely at a fast location. Since
there is very little data being transmitted between cobweb and the bunnies,
the slower link doesnt hurt you as much, although faster link is always
better, of course.
I'll talk more later on about this, and maybe organize this rant.
The stuff below is how it all works internally, details left out.
Bunny (client)
- Bunny runs with URL as argv[1] or under a more minor mode
with other args:
- show
- Sends SHOWDATA to cobweb
- quit
- Sends QUIT to cobweb
- modstat
- Sends MODSTAT to cobweb
- log filename
- Sends LOGDATA to cobweb
- read filename
- Sends READDATA to cobweb
- loadmod filename
- Sends NEWMOD to cobweb
If one of the choice above is specified, it sends the appropriate
signal and then exits. Otherwise it assumes the argv[1]
is a URL and continues with this list.
- Bunny adds URL from argv[1] to cobweb, after parsing.
- Bunny requests parsed URL from cobweb from the queue (DEQUEUE),
if no url is available, bunny commits suicide.
- Bunny tell cobweb that to mark the url it received in the
previous step as visited (STOREURL).
- Bunny grabs web page and parses it, for every it runs
into, it adds the link it to the Cobweb queue, after parsing
the url.
- Goto step 2
Cobweb (server)
- Cobweb runs with no args.
- Cobweb listens for connections and grabs packets, only scans
packets if its fully received.
- Cobweb does its packet specific portions, and then goes to step 2
Packets
- MODSTAT
- Prints the modules loaded and closes connection.
- NEWMOD
- Data of packet is asciiz of the module name.
Loads new module with the module name after making sure
its not already there. Loads the module MODULENAME.so
in the current directory that cobweb is running from.
Then it runs the function _MODULENAME() with 0 for all
params as a test to make sure it runs, you should probably
do some diagnostic print here and watch for it. After
all succeeds or something fails, it closes the connection.
- MODINFO
- This is information returned to the main cobweb process
from children processes that are running modules. After
output of the results of the module on the server, it
closes connection.
- SHOWDATA
- Prints the visited sites (STOREURL) and the directories
for each site as well as the entries in the queue.
After completion, it closes connection.
- READDATA
- Reads in a log file as saved from LOGDATA. Essentially
emulating that the sites in the log were already hit.
Number of directories is set to 1, "/". After read, it
closes connection.
- LOGDATA
- Logs the visited sites (STOREURL) and the module
information for each site to a file specified. Data is
append to the file. After completion, it closes
connection.
- DEQUEUE
- If there is nothing in the queue, it returns packet of 0
length. If there is, it will dequeue and runs a couple
of tests on the URL:
- "Probability"
- Basically a back off, like after X number of hits to
that server, only hit it 1/2 the time, after Y hits,
only 1/3rd, after Z hits, 1/4th, etc..
- "Already hit"
- Check to see if the ip/port/dir have already been
STOREURL'd. Don't want to run in circles.
If any tests have failed, it will start this step over,
otherwise, will send back the URL and not close the connection.
- ENQUEUE
- It first runs a couple of test on the URL it receives:
- "Bad extension"
- A substring search of the dir, if the dir includes the
string, its an invalid URL for queuing. Usages include
detecting for certain types of files such as images
and movies, or skipping of CGI-BINs and user home
directories.
- "Already hit"
- Check to see if the ip/port/dir have already been
STOREURL'd. Don't want to run in circles.
If any tests fail, it will break and continue. If all
succeeds, it will also continue. Both cases do not
close the connection.
- STOREURL
- If the ip/port combination has already been hit, it adds
the dir of the URL under that ip/port combo's directory
list and increments the count of URLs its hit on that site.
If it hasn't been hit, it adds the ip/port as a new
site, storing the dir of the URL under it. The URL
count is set to 1, and the module information is cleared.
Modules are then executed on that site in order. (see Modules)
The connection is not closed.
- QUIT
- A log is written using the same method as in LOGDATA
(however connection is not closed yet), to a file called
cobweb.log. Existing log is overwritten. Then cobweb
exits.
Modules
When a new site is visited (STOREURL), all the modules
are executed on that site. First, tests are
done to see if it should do the probe or not.
- "Bad probe site"
- The hostname is tested with a list of
substrings. If it contains any, the test
fails. This is good for not probing sites
that know about you ;-)
Then, for each module, cobweb will fork, run the module,
and then return the response to its parent (the main
cobweb process), in the same way bunnies do (via another
connection), with a MODINFO.
Problems/Bugs/Issues
- Since cobweb does the module stuff, the probes are done
from the slower link, if you choose to run the server and
client on slow and fast link, respectively.
- Module response is limited to 1 byte per module right now.
- DNS lookups are repeated over and over, even if you hit
a site over and over, should be caching locally or something,
possibly even sharing between bunnies.
- READDATA doesnt support directories, only ips/ports.
- No cobweb config file to automatically load modules on startup.
- Module calling interface is weak, only passes ip and port of web
server. Module should be given access to read entire structure
for its site, including the module data so stacking of modules is
possible.
- Although cobweb does back off on servers, it still lets
bunnies pound them to death in the beginning, better if that
could be spread out through the queue somehow. Pounding a server
should only be done if that server is one of the only things left
in the queue.
Back
not done yet
DEQUEUE -> format:
2 bytes -- 3
1 byte -- DEQUEUE
DEQUEUE <- format:
2 bytes -- 10 + strlen(hostname) + strlen(filename)
4 bytes -- addr
2 bytes -- port
X bytes -- hostname + NULL
X bytes -- filename + NULL
ENQUEUE -> format:
2 bytes -- 11 + strlen(hostname) + strlen(filename)
1 byte -- ENQUEUE
4 bytes -- addr
2 bytes -- port
X bytes -- hostname + NULL
X bytes -- filename + NULL
STOREURL -> format:
2 bytes -- 11 + strlen(hostname) + strlen(filename)
1 byte -- STOREURL
4 bytes -- addr
2 bytes -- port
X bytes -- hostname + NULL
X bytes -- filename + NULL
LOGDATA -> format:
2 bytes -- 4 + strlen(filename)
1 byte -- LOGDATA
X bytes -- filename + NULL
MODSTAT -> format:
2 bytes -- 3
1 byte -- MODSTAT
SHOWDATA -> format:
2 bytes -- 3
1 byte -- SHOWDATA
READDATA -> format:
2 bytes -- 4 + strlen(filename)
1 byte -- READDATA
X bytes -- filename + NULL
NEWMOD -> format:
2 bytes -- 4 + strlen(modname)
1 byte -- NEWMOD
X bytes -- modname + NULL
QUIT -> format:
2 bytes -- 3
1 byte -- QUIT
bunny does the name lookup since bunnies can do multiple name
lookups at once if you run multiple bunnies.
- add newtype that does a STOREURL and DEQUEUE at the same time
- no need to dequeue the hostname, ip is fine
- specify flag in the STOREURL which determines if the site
has already been hit.. easy to do by checking in the url parser
if the url starts is relative or absolute.
- possibly add DNS lookup check between cobweb and bunny..
before hostname parsed, bunny can ask cobweb for hostname ip,
if cobweb has it, it gives it up, it can also return stuff
such as an error saying Ive hit this site too many times, skip
it and take this one form the queue instead (while returning
one from the queue)
- take advantage of HTTP/1.1 stay alive