I am want to build a spider application in java and after
some very simple prototyping, I started looking around how
others have solved the problem and I discovered j-spider.
I spent a few hours browsing the code and I like the
architecture very much! I like the overall architecture and
I like the way you use visitors, strategies, events and
factories. There is obviously quite some javadoc missing and
for some classes and methods it's not easy to figure out
what the meaning is. I have seen a lot of code (in my
professional life) and j-spider belongs to the best stuff
I've seen. It has a clear architecture, clear separation of
interfaces and implementations, many junit tests and it
seems you have focused on the important stuff first.
Congratulations!
Unless I find anything better (which I really doubt ;-), I
want to use j-spider for my application. Fortunately, you
have chosen LGPL, so I can use it for my commercial
application. And I would volunteer to contribute and help
you :-D.
I wonder what application you had in mind, when you created
j-spider...
Unfortunately, I cannot say too much about the application I
working on, because it is still "top secret". But so much
should be possible: it is a kind of interactive spider. It
spiders the web and rates pages links and resources by some
algorithm and the spider should then investigate sites with a
higher sore with a higher probability (this means it's not
depth first or breath first, but a kind of probabilistic
scanning).
There is also a requirement to access sites with
authentication. I need a good cookie management. In some
cases it needs to submit some forms. In other cases it needs
to access pages that are only accessible via javascript. I
not yet looked deep enough into j-spider to see how it could
be extended in those directions.
Both support authentication, cookies and javascript.
Because they are website checkers, they throw errors cases
where something goes wrong, and it seems pretty hard to me
to convert them into "robust" spiders. But I need both, an
efficient multithreaded spider and support of the above
listed features. So, I would be interested to work on
extensions for j-spider that allow integration of
authentication, cookies and javascript, maybe "peeking"
into those implementations to "borrow" some ideas (or even
creating a super-application that combines both)....
A few questions:
- Is what I am asking for compatible with your plan/vision
of j-spider?
- I have never run j-spider: can you say anything about how
efficient j-spider is?
- Does it use HTTP/1.1 connections?
- htmlunit and httpunit, both support "fake web connection"
to make testing simpler. Would that make sense for
j-spider?
- Do you know any other java spider frameworks? And: how
does j-spider compare with them?
I hope I'm not asking too much ;-)
Michael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
first of all, thanks for your interest and your nice comments about the project and the code :)
Actually, JSpider started off as a simple java-link checker program I developed to check a customer's site for errors (it was a HUGE site with many malformed and dead links, some resources on other hosts, and other stuff other link checkers didn't seem to be able to deal with).
What I needed was a program that spidered this huge site in a few hours, and generated simple reports of all missing resources (and the pages that linked to them, all malformed URLs, all scripts that resulted in 500 (internal server) errors, etc...
After the project, I let the code as it was (VERY alpha - it was just written to save me some months of checking links manually).
Another thing I needed some time later was a tool to download the images and text from a website. I picked up the spider code again, and made a simple mod that wrote everything it found to disk.
Now I started thinking. I wanted to separate the functionality (link checking, resource downloading) from the rules and configuration, and certainly from the spidering engine (which is now JSpider).
Instead of writing the spider from scratch again, I decided to train my java refactoring capabilities and refactored the old program to what JSpider is today.
If you would check out the most earliest versions of JSpider in the sourceforge CVS, you would find many traces of the old (badly designed) stuff!
So, that's for the history.
Now, your questions:
* JSpider can, and in most cases, will be be used as a depth-first scanner that will spider complete sites. You can, however, configure some Rule implementations (simple decision-taking classes you implement and assign in the configuration files, that decide which resources should be spidered.
There are already a bunch of implementations that come with JSpider.
>- Is what I am asking for compatible with your plan/vision
>of j-spider?
Yeah, sure. Why not? By creating your own plugins and rule implementations, you can have JSpider do what you want. And if it's not possible, this means the engine isn't as generic as it should be and we should file a bug report :)
>- I have never run j-spider: can you say anything about how efficient j-spider is?
For the moment, no optimizations have been done yet. If you looked through the sourcecode, you moght have noticed that every single thing JSpider has to do is a Task implementation. There are spider tasks (go out on the net and fetch a web page, robots.txt, image, ...
There are also thinker tasks (decide whether a certain URL should be fetched, decide whether a certain page should be interpreted and sought for new URLs, ...)
One thread pool is dedicated to thinker tasks, one for spider tasks.
The performance of JSpider is mainly influenced by the Throttle configuration: in order to prevent the flooding of a webserver with requests, JSpider 'throttles' the requests to 1 every x milliseconds (other strategies can be applied too).
>- Does it use HTTP/1.1 connections?
Yes. Since it simply uses the Java HTTP libs, it is HTTP/1.1 capable
>- htmlunit and httpunit, both support "fake web connection"
>to make testing simpler. Would that make sense for
>j-spider?
Hmmm... maybe not a bad idea. I should look into this into detail to see if it's applicable.
The way I test today, is:
1) simple technical class-level JUnit tests (called 'testTechnical'in the ANT script)
2) 'functinoal' tests, that spider resources put on our web site, designed for the unit tests.
These resources are constructed to test certain aspects: redirection, errors, malformed urls, ...
After the spidering process, a check is done whether JSpider reported everything as we thought it should be.
>- Do you know any other java spider frameworks? And: how
>does j-spider compare with them?
Actually, no!
There are some link checkers around, there are some site downloaders around, but two important things were lacking in each of the projects I found:
* Complete configurability (a configurable rule system)
* Extensibilitiy (If the system doesn't do what you want, write a plugin)
Hope this answers your questions.
regards,
Gnther.
TIP: Keep an eye on the site in the coming days/weeks. Version 0.5.0-DEV of JSpider is underway, and I've written a complete User Manual (120+ pages) that explains the installation, configuration and usage of JSpider).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thank you for your reply! I'll spend some more time reading
the j-spider code and maybe running some tests. If you need
a "beta-tester" for your manual, I'm willing to read and
comment it :-)
Hmmm, maybe I should wait for the manual, because it might
take me much more time to figure out what j-spider is doing
by reading the bare code without any manual.... When can I
expect to see a first version of the manual?
> Instead of writing the spider from scratch again, I
> decided to train my java refactoring capabilities and
> refactored the old program to what JSpider is today.
While searching for a Spidering framework, I came across JSpider. The project looks very exciting. I've read the docs and plan to look at the source this weekend. My only concern is future support and development. Is anyone currently working on this project? Since May, have either of you run across any comparable projects?
Thank you very much for your time and hard work.
Anthony
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anthony, I am preparing a 1.0 release to appear in the coming weeks. Many improvements. I plan to support this as long as I can, and to leave it in a clean state for when I can't.
EJP
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Gnther,
I am want to build a spider application in java and after
some very simple prototyping, I started looking around how
others have solved the problem and I discovered j-spider.
I spent a few hours browsing the code and I like the
architecture very much! I like the overall architecture and
I like the way you use visitors, strategies, events and
factories. There is obviously quite some javadoc missing and
for some classes and methods it's not easy to figure out
what the meaning is. I have seen a lot of code (in my
professional life) and j-spider belongs to the best stuff
I've seen. It has a clear architecture, clear separation of
interfaces and implementations, many junit tests and it
seems you have focused on the important stuff first.
Congratulations!
Unless I find anything better (which I really doubt ;-), I
want to use j-spider for my application. Fortunately, you
have chosen LGPL, so I can use it for my commercial
application. And I would volunteer to contribute and help
you :-D.
I wonder what application you had in mind, when you created
j-spider...
Unfortunately, I cannot say too much about the application I
working on, because it is still "top secret". But so much
should be possible: it is a kind of interactive spider. It
spiders the web and rates pages links and resources by some
algorithm and the spider should then investigate sites with a
higher sore with a higher probability (this means it's not
depth first or breath first, but a kind of probabilistic
scanning).
There is also a requirement to access sites with
authentication. I need a good cookie management. In some
cases it needs to submit some forms. In other cases it needs
to access pages that are only accessible via javascript. I
not yet looked deep enough into j-spider to see how it could
be extended in those directions.
There are two other projects that implement the above
requirements quite well, but they are not spiders but web
side checkers:
http://sourceforge.net/projects/htmlunit
http://sourceforge.net/projects/httpunit
Both support authentication, cookies and javascript.
Because they are website checkers, they throw errors cases
where something goes wrong, and it seems pretty hard to me
to convert them into "robust" spiders. But I need both, an
efficient multithreaded spider and support of the above
listed features. So, I would be interested to work on
extensions for j-spider that allow integration of
authentication, cookies and javascript, maybe "peeking"
into those implementations to "borrow" some ideas (or even
creating a super-application that combines both)....
A few questions:
- Is what I am asking for compatible with your plan/vision
of j-spider?
- I have never run j-spider: can you say anything about how
efficient j-spider is?
- Does it use HTTP/1.1 connections?
- htmlunit and httpunit, both support "fake web connection"
to make testing simpler. Would that make sense for
j-spider?
- Do you know any other java spider frameworks? And: how
does j-spider compare with them?
I hope I'm not asking too much ;-)
Michael
Hi,
first of all, thanks for your interest and your nice comments about the project and the code :)
Actually, JSpider started off as a simple java-link checker program I developed to check a customer's site for errors (it was a HUGE site with many malformed and dead links, some resources on other hosts, and other stuff other link checkers didn't seem to be able to deal with).
What I needed was a program that spidered this huge site in a few hours, and generated simple reports of all missing resources (and the pages that linked to them, all malformed URLs, all scripts that resulted in 500 (internal server) errors, etc...
After the project, I let the code as it was (VERY alpha - it was just written to save me some months of checking links manually).
Another thing I needed some time later was a tool to download the images and text from a website. I picked up the spider code again, and made a simple mod that wrote everything it found to disk.
Now I started thinking. I wanted to separate the functionality (link checking, resource downloading) from the rules and configuration, and certainly from the spidering engine (which is now JSpider).
Instead of writing the spider from scratch again, I decided to train my java refactoring capabilities and refactored the old program to what JSpider is today.
If you would check out the most earliest versions of JSpider in the sourceforge CVS, you would find many traces of the old (badly designed) stuff!
So, that's for the history.
Now, your questions:
* JSpider can, and in most cases, will be be used as a depth-first scanner that will spider complete sites. You can, however, configure some Rule implementations (simple decision-taking classes you implement and assign in the configuration files, that decide which resources should be spidered.
There are already a bunch of implementations that come with JSpider.
>- Is what I am asking for compatible with your plan/vision
>of j-spider?
Yeah, sure. Why not? By creating your own plugins and rule implementations, you can have JSpider do what you want. And if it's not possible, this means the engine isn't as generic as it should be and we should file a bug report :)
>- I have never run j-spider: can you say anything about how efficient j-spider is?
For the moment, no optimizations have been done yet. If you looked through the sourcecode, you moght have noticed that every single thing JSpider has to do is a Task implementation. There are spider tasks (go out on the net and fetch a web page, robots.txt, image, ...
There are also thinker tasks (decide whether a certain URL should be fetched, decide whether a certain page should be interpreted and sought for new URLs, ...)
One thread pool is dedicated to thinker tasks, one for spider tasks.
The performance of JSpider is mainly influenced by the Throttle configuration: in order to prevent the flooding of a webserver with requests, JSpider 'throttles' the requests to 1 every x milliseconds (other strategies can be applied too).
>- Does it use HTTP/1.1 connections?
Yes. Since it simply uses the Java HTTP libs, it is HTTP/1.1 capable
>- htmlunit and httpunit, both support "fake web connection"
>to make testing simpler. Would that make sense for
>j-spider?
Hmmm... maybe not a bad idea. I should look into this into detail to see if it's applicable.
The way I test today, is:
1) simple technical class-level JUnit tests (called 'testTechnical'in the ANT script)
2) 'functinoal' tests, that spider resources put on our web site, designed for the unit tests.
These resources are constructed to test certain aspects: redirection, errors, malformed urls, ...
After the spidering process, a check is done whether JSpider reported everything as we thought it should be.
>- Do you know any other java spider frameworks? And: how
>does j-spider compare with them?
Actually, no!
There are some link checkers around, there are some site downloaders around, but two important things were lacking in each of the projects I found:
* Complete configurability (a configurable rule system)
* Extensibilitiy (If the system doesn't do what you want, write a plugin)
Hope this answers your questions.
regards,
Gnther.
TIP: Keep an eye on the site in the coming days/weeks. Version 0.5.0-DEV of JSpider is underway, and I've written a complete User Manual (120+ pages) that explains the installation, configuration and usage of JSpider).
Hi Gnther,
thank you for your reply! I'll spend some more time reading
the j-spider code and maybe running some tests. If you need
a "beta-tester" for your manual, I'm willing to read and
comment it :-)
Hmmm, maybe I should wait for the manual, because it might
take me much more time to figure out what j-spider is doing
by reading the bare code without any manual.... When can I
expect to see a first version of the manual?
> Instead of writing the spider from scratch again, I
> decided to train my java refactoring capabilities and
> refactored the old program to what JSpider is today.
What IDE are you using? I use eclipse (www.eclipse.org), it
has fantastic refactoring capabilities built in (plus CVS
support and JUnit test support...):
http://dev.eclipse.org/help21/topic/org.eclipse.jdt.doc.user/tasks/tasks-80.htm
Michael
Hi Gnther, Michael,
While searching for a Spidering framework, I came across JSpider. The project looks very exciting. I've read the docs and plan to look at the source this weekend. My only concern is future support and development. Is anyone currently working on this project? Since May, have either of you run across any comparable projects?
Thank you very much for your time and hard work.
Anthony
Anthony, I am preparing a 1.0 release to appear in the coming weeks. Many improvements. I plan to support this as long as I can, and to leave it in a clean state for when I can't.
EJP