Logging revisions
Sitemap plugin improvements
Rationalize Rules subsystem
Sitemap parser
Set If-Modified-Since header and observe 304 status
Use HEAD request when spidering: only fetch content if actually required
Rewrite cookie handling to use JDK facilities
Rewrite authentication to use standard JDK facilities
Rewrite proxying to use standard JDK facilities
Logging revisions
Use 'default' configuration as the default for all other general configurations
Accept Content-Encoding: gzip, deflate
Accept Content-Encoding: gzip, deflate
Lucene Plug-In
Set If-Modified-Since header and observe 304 status
Use HEAD request when spidering: only fetch content if actually required
Rewrite cookie handling to use JDK facilities
Rewrite cookie handling to use JDK facilities
Rewrite authentication to use standard JDK facilities
Rewrite proxying to use standard JDK facilities
Logging revisions
Rationalize Rules subsystem
Use 'default' configuration as the default for all other general configurations
Sitemap parser
Sitemap plugin improvements
Use 'default' configuration as the default for all other general configurations
Rationalize Rules subsystem
Logging revisions
Rewrite proxying to use standard JDK facilities
Rewrite authentication to use standard JDK facilities
Rewrite cookie handling to use JDK facilities
Use HEAD request when spidering: only fetch content if actually required
Set If-Modified-Since header and observe 304 status
Sitemap plugin improvements
Deleted this code again in favour of using a java.net.Authenticator as God intended.
Lucene Plug-In
Accepted for upcoming 1.0 release. Configurability remains an issue, mainly of Analyzers. I presently have it that you can name your own class if it just needs no-args construction, default StandardAnalyzer, and if you need to use a more complex one you override a getAnalyzer() method. Not sure what else can be done. It's a long time since I used Lucene and they keep changing the API, which doesn't help. Also it introduces a dependency on SLF4J grrr.
All this is being addressed in the upcoming 1.0 release.
Anthony, I am preparing a 1.0 release to appear in the coming weeks. Many improvements. I plan to support this as long as I can, and to leave it in a clean state for when I can't. EJP
Lucene Plug-In
Fixed in upcoming 1.0 release.
jSpider reusability.
Events not visited
no favicon.ico support
String index out of range: -1
Spider CSS and JS files
Good idea. A CSS parser and a more flexible mime-type system have been added to the upcoming 1.0 release.
Make Caching site specific property
This patch also requires adding the following to SpiderContextImpl.registerNewSite(), after all the other 'sitei' configurations: sitei.setCacheControl(siteProps.getString(ConfigConstants.SITE_CACHE_CONTROL, Constants.CACHE_CONTROL)); sitei.setUseCache(siteProps.getBoolean(ConfigConstants.SITE_CACHE_USE, false));
OnDiskStorage implementation
Add basic http auth (with patch)
OnDiskStorage implementation
Add basic http auth (with patch)
Make Caching site specific property
This patch also requires adding the following to SpiderContextImpl.registerNewSite(), after all the other 'sitei' configurations: sitei.setCacheControl(siteProps.getString(ConfigConstants.SITE_CACHE_CONTROL, Constants.CACHE_CONTROL)); sitei.setUseCache(siteProps.getBoolean(ConfigConstants.SITE_CACHE_USE, false));
Added functionality to Storage
OnDiskStorage implementation
Incorporated in upcoming 1.0 release.
Add basic http auth (with patch)
Incorporated in upcoming 1.0 release.
new to search engine
Certainly.
fixes problem with redirects not specifying full url
Java 1.5.0 API Changes Thread.getState()
WorkerThread int getState method clashes with Java 1.5
Jspider crash on Web site
The StringIndexOutOfBoundsException is a duplicate of #5. The HTTP 500 status isn't a problem with JSpider. The notfound.htm error should have been fixed.
PANIC! Task net.javacoding.jspider.core.task.work.FetchRobot
Fixed in upcoming 1.0 release.
No activity besides start page when robotstxt.fetch is false
TextHtmlMimeTypeOnlyRule: Exception not caught
The only exception that can be thrown is InvalidStateForActionException, which indicates a coding bug. Unclear why this should be fixed.
Fixed in upcoming 1.0 release. Plugins can now have any of four constructors: () (PropertySet config) (String name) (String name, PropertySet config)
DiskWriterPlugin constructor missing parameter
Fixed in upcoming 1.0 release. Plugins can now have any of three constructors: () (PropertySet config) (String name, PropertySet config)
Port is not included in the DiskWriterPlugin folder.
After consideration it seems to me that the present behaviour is correct. A site is identified by a hostname. A different protocol (http/https) or port doesn't make it a different site.
Two sites on same physical machine considered equal
Fixed in upcoming 1.0 release. SiteDAOImpl now keys from the hostname, not the URL, which solves other problems as well, e.g. different protocols or ports.
Improper resolving of relative URLs
Fixed in upcoming 1.0 release.
I solve a spider exit jdbc bug
Fixed in both setError() methods in upcoming 1.0 release.
Spider follows commented out links
Fixed in upcoming 1.0 release. A proper HTML parser is now used and links are identifed via XPath expressions. Comments are therefore ignored, as are element/attribute pairs that can't be links, even if they are.
Query part is trimmed from URLs
Queyr removal has been made optional in the upcoming 1.0 release.
CookieDAOImpl uses incorrect SQL
Apparent URLs in HTML PRE elements are checked
Plugin compatibility with libgcj
The GNU CLASSPATH project to which libgcj belongs is defunct. It never really progressed beyond Java 1.2: it hasn't been updated since 2012; and the companion GCJ Java compiler project was terminated some years ago. Closing as out of date.
The GNIU CLASSPATH project is defunct. It never really progressed beyond Java 1.2: it hasn't been updated since 2012; and the companion GCJ Java compiler project was terminated some years ago. Closing as out of date.
CookieDAOImpl uses incorrect SQL
Fixed in upcoming 1.0 release.
You can get it by listening to the SpideringStartedEvent event, i.e. by providing a visit(SpideringStartedEvent event) method.
MySQL Blobs have 2^16-1 bytes. Change the column to a MEDIUMBLOB (2^24-1 bytes).
It could be argued that the current behaviour is correct, given that say you don't want to treat http://localhost:80 and https://localhost:443 as different sites. Maybe it should distinguish different ports within the same protocol. I will ponder on this for the upcoming release.
It could be argued that this is correct, given that say you don't want to treat http://localhost:80 and https://localhost:443 as different sites. Maybe it should distinguish different ports within the same protocol. I will ponder on this for the upcoming release.