Menu

Is j-spider the library for my application?

2003-04-30
2021-07-27
  • Michael Scharf

    Michael Scharf - 2003-04-30

    Hi Gnther,

    I am want to build a spider application in java and after
    some very simple prototyping, I started looking around how
    others have solved the problem and I discovered j-spider.

    I spent a few hours browsing the code and I like the
    architecture very much! I like the overall architecture and
    I like the way you use visitors, strategies, events and
    factories. There is obviously quite some javadoc missing and
    for some classes and methods it's not easy to figure out
    what the meaning is. I have seen a lot of code (in my
    professional life) and j-spider belongs to the best stuff
    I've seen. It has a clear architecture, clear separation of
    interfaces and implementations, many junit tests and it
    seems you have focused on the important stuff first.
    Congratulations!

    Unless I find anything better (which I really doubt ;-), I
    want to use j-spider for my application. Fortunately, you
    have chosen LGPL, so I can use it for my commercial
    application. And I would volunteer to contribute and help
    you :-D.

    I wonder what application you had in mind, when you created
    j-spider...

    Unfortunately, I cannot say too much about the application I
    working on, because it is still "top secret". But so much
    should be possible: it is a kind of interactive spider. It
    spiders the web and rates pages links and resources by some
    algorithm and the spider should then investigate sites with a
    higher sore with a higher probability (this means it's not
    depth first or breath first, but a kind of probabilistic
    scanning).

    There is also a requirement to access sites with
    authentication. I need a good cookie management. In some
    cases it needs to submit some forms. In other cases it needs
    to access pages that are only accessible via javascript. I
    not yet looked deep enough into j-spider to see how it could
    be extended in those directions.

    There are two other projects that implement the above
    requirements quite well, but they are not spiders but web
    side checkers:
      http://sourceforge.net/projects/htmlunit
      http://sourceforge.net/projects/httpunit

    Both support authentication, cookies and javascript.
    Because they are website checkers, they throw errors cases
    where something goes wrong, and it seems pretty hard to me
    to convert them into "robust" spiders. But I need both, an
    efficient multithreaded spider and support of the above
    listed features. So, I would be interested to work on
    extensions for j-spider that allow integration of
    authentication, cookies and javascript, maybe "peeking"
    into those implementations to "borrow" some ideas (or even
    creating a super-application that combines both)....

    A few questions:
    - Is what I am asking for compatible with your plan/vision
      of j-spider?
    - I have never run j-spider: can you say anything about how
      efficient j-spider is?
    - Does it use HTTP/1.1 connections?
    - htmlunit and httpunit, both support "fake web connection"
      to make testing simpler. Would that make sense for
      j-spider?
    - Do you know any other java spider frameworks? And: how
      does j-spider compare with them?

    I hope I'm not asking too much ;-)

    Michael

     
    • Günther Van Roey

      Hi,

      first of all, thanks for your interest and your nice comments about the project and the code :)

      Actually, JSpider started off as a simple java-link checker program I developed to check a customer's site for errors (it was a HUGE site with many malformed and dead links, some resources on other hosts, and other stuff other link checkers didn't seem to be able to deal with).

      What I needed was a program that spidered this huge site in a few hours, and generated simple reports of all missing resources (and the pages that linked to them, all malformed URLs, all scripts that resulted in 500 (internal server) errors, etc...

      After the project, I let the code as it was (VERY alpha - it was just written to save me some months of checking links manually).

      Another thing I needed some time later was a tool to download the images and text from a website.  I picked up the spider code again, and made a simple mod that wrote everything it found to disk.

      Now I started thinking.  I wanted to separate the functionality (link checking, resource downloading) from the rules and configuration, and certainly from the spidering engine (which is now JSpider).

      Instead of writing the spider from scratch again, I decided to train my java refactoring capabilities and refactored the old program to what JSpider is today.

      If you would check out the most earliest versions of JSpider in the sourceforge CVS, you would find many traces of the old (badly designed) stuff!

      So, that's for the history.
      Now, your questions:

      * JSpider can, and in most cases, will be be used as a depth-first scanner that will spider complete sites.  You can, however, configure some Rule implementations (simple decision-taking classes you implement and assign in the configuration files, that decide which resources should be spidered.
      There are already a bunch of implementations that come with JSpider.

      >- Is what I am asking for compatible with your plan/vision
      >of j-spider?

      Yeah, sure.  Why not?  By creating your own plugins and rule implementations, you can have JSpider do what you want.  And if it's not possible, this means the engine isn't as generic as it should be and we should file a bug report :)

      >- I have never run j-spider: can you say anything about how efficient j-spider is?
      For the moment, no optimizations have been done yet.  If you looked through the sourcecode, you moght have noticed that every single thing JSpider has to do is a Task implementation.  There are spider tasks (go out on the net and fetch a web page, robots.txt, image, ...
      There are also thinker tasks (decide whether a certain URL should be fetched, decide whether a certain page should be interpreted and sought for new URLs, ...)
      One thread pool is dedicated to thinker tasks, one for spider tasks.
      The performance of JSpider is mainly influenced by the Throttle configuration:  in order to prevent the flooding of a webserver with requests, JSpider 'throttles' the requests to 1 every x milliseconds (other strategies can be applied too).

      >- Does it use HTTP/1.1 connections?
      Yes.  Since it simply uses the Java HTTP libs, it is HTTP/1.1 capable

      >- htmlunit and httpunit, both support "fake web connection"
      >to make testing simpler. Would that make sense for
      >j-spider?
      Hmmm... maybe not a bad idea.  I should look into this into detail to see if it's applicable.
      The way I test today, is:
      1) simple technical class-level JUnit tests (called 'testTechnical'in the ANT script)
      2) 'functinoal' tests, that spider resources put on our web site, designed for the unit tests.
      These resources are constructed to test certain aspects: redirection, errors, malformed urls, ...
      After the spidering process, a check is done whether JSpider reported everything as we thought it should be.

      >- Do you know any other java spider frameworks? And: how
      >does j-spider compare with them?
      Actually, no!
      There are some link checkers around, there are some site downloaders around, but two important things were lacking in each of the projects I found:

      * Complete configurability (a configurable rule system)
      * Extensibilitiy (If the system doesn't do what you want, write a plugin)

      Hope this answers your questions.

      regards,

      Gnther.

      TIP:  Keep an eye on the site in the coming days/weeks.  Version 0.5.0-DEV of JSpider is underway, and I've written a complete User Manual (120+ pages) that explains the installation, configuration and usage of JSpider).

       
    • Michael Scharf

      Michael Scharf - 2003-05-01

      Hi Gnther,

      thank you for your reply! I'll spend some more time reading
      the j-spider code and maybe running some tests. If you need
      a "beta-tester" for your manual, I'm willing to read and
      comment it :-)

      Hmmm, maybe I should wait for the manual, because it might
      take me much more time to figure out what j-spider is doing
      by reading the bare code without any manual.... When can I
      expect to see a first version of the manual?

      > Instead of writing the spider from scratch again, I
      > decided to train my java refactoring capabilities and
      > refactored the old program to what JSpider is today.

      What IDE are you using? I use eclipse (www.eclipse.org), it
      has fantastic refactoring capabilities built in (plus CVS
      support and JUnit test support...):
        http://dev.eclipse.org/help21/topic/org.eclipse.jdt.doc.user/tasks/tasks-80.htm

      Michael

       
    • Anthony Yam

      Anthony Yam - 2003-11-26

      Hi Gnther, Michael,

      While searching for a Spidering framework, I came across JSpider.  The project looks very exciting.  I've read the docs and plan to look at the source this weekend.  My only concern is future support and development.  Is anyone currently working on this project?  Since May, have either of you run across any comparable projects?

      Thank you very much for your time and hard work.
      Anthony

       
  • EJP

    EJP - 2021-07-27

    Anthony, I am preparing a 1.0 release to appear in the coming weeks. Many improvements. I plan to support this as long as I can, and to leave it in a clean state for when I can't.

    EJP

     

Log in to post a comment.

MongoDB Logo MongoDB