You can subscribe to this list here.
| 2004 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
|
From: Sigletos G. <sig...@te...> - 2007-12-12 13:42:32
|
Hello. Is this list currently active?
I get the following error message when compiling jspider:
[javac] Compiling 58 source files to
/projects/home/jspider-src-0.5.0-dev/src/stage/compiled/main
[javac]
/projects/home/jspider-src-0.5.0-dev/src/stage/src/main/net/javacoding/jspider/core/threading/WorkerThread.java:109:
getState() in net.javacoding.jspider.core.threading.WorkerThread cannot
override getState() in java.lang.Thread; attempting to use incompatible
return type
[javac] found : int
[javac] required: java.lang.Thread.State
[javac] public int getState ( ) {
[javac] ^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error
Any ideas? Thanks
|
|
From: Phil S. <ph...@mk...> - 2005-11-03 13:10:40
|
A formal announcement follows, but I would like to take this opportunity to thank Gunther for his original work on JSpider. MKSearch uses a number of custom plug-ins to extract metadata from Web pages, which others may be interested in. Best regards, Phil MKDoc Ltd. would like to announce the first beta release of MKSearch, under the GNU General Public Licence. Source and pre-compiled binary downloads are available from the project Web site. http://www.mksearch.mkdoc.org/downloads/ MKSearch is a metadata search engine that indexes structured metadata in Web documents, not free text in the document body. The data acquisition system: * Conforms to the Dublin Core metadata in HTML recommendations [1] * Supports other application profiles, such as the UK e-Government Metadata Standard [2] * Indexes native RDF formats, including RSS 1.0 The MKSearch system has five major components: 1. A Web crawler based on JSpider [3] * Multi-threaded processing * Per-site throttle, user agent, depth and linking rules * Respects the robots.txt exclusion policy * Extensible plug-in based content handling 2. An HTML document validator and formatter based on JTidy [4] * Cleans-up and corrects HTML syntax errors * Converts HTML to XHTML 3. A set of custom indexers based on the Simple API for XML (SAX) * Extracts metadata from HTML meta and link elements * Converts metadata to RDF triple statements * Configurable application profiles 4. An RDF storage and query system based on Sesame [5] * XML/RDF file-based storage * Database storage using PostgreSQL or MySQL * Sophisticated Sesame RDF Query Language (SeRQL) queries * Scope for more semantically rich queries with inferencing 5. A public query interface, provided through a standard servlet container * Simple, expandable query builder form * Configurable application profile-based presentation * Wildcard query handling * Phrase searches * Paged HTML results * Standing RSS results The two main elements of the MKSearch system can be used independently. The data acquisition system can be used to gather large quantities of metadata from the Web and store it as RDF. The query system can be used to provide a typical search engine-style interface to existing RDF content. The MKSearch beta 1 distribution includes sample configurations that crawl a Web site and create: * A mirror of the site on the local file system in valid XHTML * An RDF N-Triple record for each page on the local file system * UK e-Government metadata in a Sesame file-based repository (XML/RDF) This distribution also includes a demonstration of the MKSearch query interface, in the form of a Web Application Archive (WAR) that can be deployed directly to an existing servlet container. The sample search content is from an index of the MKSearch project Web site on 2 November 2005. See the site documentation below: http://www.mksearch.mkdoc.org/documentation/tomcat-on-fc4/ http://www.mksearch.mkdoc.org/howto/ http://www.mksearch.mkdoc.org/plans/beta-1-release- tasks/mksearch-beta-1-release-notes/ System requirements and licence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MKSearch is written in the Java programming language and is designed to run on any platform that supports a Java environment equivalent to the Sun Java 2 language specification. The system has specifically been designed, developed and tested to run on GNU/Linux systems using the GNU Compiler for Java (GCJ) [6] and Apache Tomcat 5 servlet container, as available on Fedora Core 4 [7]. This provision means that MKSearch can be built and run on software systems that are entirely open source and free from proprietary licencing. The system has been tested extensively using the Sun Java SDK 1.5 on Microsoft Windows 2000. JUnit test suites for the MKSearch code base cover 99% of all code branches. If you have any comments or questions about the MKSearch system, please join us on the project mailing list. http://www.email-lists.org/mailman/listinfo/mksearch-dev References ~~~~~~~~~~ [1] http://dublincore.org/documents/2003/11/30/dcq-html/ [2] http://www.govtalk.gov.uk/schemasstandards/metadata_document. asp?docnum=805 [3] http://j-spider.sourceforge.net/ [4] http://jtidy.sourceforge.net/ [5] http://www.openrdf.org/ [6] http://gcc.gnu.org/java/ [7] http://fedora.redhat.com/ -- MKSearch (beta) http://www.mksearch.mkdoc.org/ Free, open source metadata search engine with RDF storage and query. |
|
From: Bernardo <lis...@gm...> - 2005-09-16 13:10:27
|
Hello people, are you doing ok? I'm using JSpider to blacklist some porn sites but it's still not working 100% as I would like. What I'm trying to do is to list only external links (i.e. links to others sites) from a given website. Example: Given the, let's say, www.heleninha.com.br I want Jspider to list only external addresses like: http://www.carlinha.com.br/ http://celebrazil.gratishost.com/ And not: http://www.heleninha.com.br/thumbs/b17.jpg http://www.heleninha.com.br/thumbs/c23.jpg I've tried to use "jspider-tool findlinks" with some rules in jspider.properties but got no lucky at all. Also, I want to scan the entire site for external links and not only the URL I gave as parameter. Can anyone give me some hints? Thanks in advance. Bernardo |
|
From: Phil S. <ph...@mk...> - 2004-11-12 08:04:56
|
On 11 Nov 2004, at 17:33, j-s...@li... wrote: > I attach my notes as a text file to avoid any formatting corruptions. > If it doesn't come through successfully, I'll post this as a "bug" > report. This attachment didn't make it to the message display, so I have posted it as a bug item, ID 1065003. Phil |
|
From: Phil S. <ph...@mk...> - 2004-11-11 17:34:09
|
I am preparing to use JSpider for a project I am developing and have been studying the user manual in detail. Having a manual is a great thing in itself and I thought a useful contribution would be some detailed feedback on it. This may be more detail than you wanted, but please take it in the constructive spirit in which it is offered. I attach my notes as a text file to avoid any formatting corruptions. If it doesn't come through successfully, I'll post this as a "bug" report. Best regards, Phil |
|
From: <ben...@id...> - 2004-05-25 08:07:40
|
Dear Open Source developer I am doing a research project on "Fun and Software Development" in which I kindly invite you to participate. You will find the online survey under http://fasd.ethz.ch/qsf/. The questionnaire consists of 53 questions and you will need about 15 minutes to complete it. With the FASD project (Fun and Software Development) we want to define the motivational significance of fun when software developers decide to engage in Open Source projects. What is special about our research project is that a similar survey is planned with software developers in commercial firms. This procedure allows the immediate comparison between the involved individuals and the conditions of production of these two development models. Thus we hope to obtain substantial new insights to the phenomenon of Open Source Development. With many thanks for your participation, Benno Luthiger PS: The results of the survey will be published under http://www.isu.unizh.ch/fuehrung/blprojects/FASD/. We have set up the mailing list fa...@we... for this study. Please see http://fasd.ethz.ch/qsf/mailinglist_en.html for registration to this mailing list. _______________________________________________________________________ Benno Luthiger Swiss Federal Institute of Technology Zurich 8092 Zurich Mail: benno.luthiger(at)id.ethz.ch _______________________________________________________________________ |
|
From: Sigbert K. <si...@wi...> - 2004-03-24 15:47:19
|
gun...@pa... wrote: >Hi, > >are you sure of this? >have a look at the log files to verify that the images are not spidered. > > I know I have broken links to pictures and PDF files, but they are not reported in 404.out or error-report.out . I only got in the reports the broken links to HTML pages. That is why I suspect that these links are not tested or is it a question of reporting ? Sigbert Klinke |
|
From: <gun...@pa...> - 2004-03-24 14:27:12
|
Hi, are you sure of this? have a look at the log files to verify that the images are not spidered. Anyway, the two config entries you mention: * jspider.rules.spider.(x).class (set to HttpOnlyProtocolRule) Means that each and every HTTP resource (HTML,images,PDFs,etc...) should = be spidered. URLs pointing to for example an FTP server are skipped. * jspider.rules.parser.(x).class Defines that only HTML resources should be parsed to find new links insid= e it (we wouldn't want JSpider to open an image file and look inside for = links). So, by default the 'checkErrors' config should check images and PDFs. (I've use this config on different sites) If you can't find a trace in any of the files of your output folder, plea= se send them to me and we'll dig further into this issue. regards, G=FCnther. >----- Oorspronkelijk bericht ----- >Van : Sigbert Klinke [mailto:si...@wi...] >Verzonden : woensdag , maart 24, 2004 11:49 AM >Aan : j-s...@li... >Onderwerp : [JSpider-user] checkErrors > >Hi, > >I have used the preconfigured checkErrors method to check my Website. It= =20 >found two mistakes with external HTML pages. But in this configuration=20 >it does not check if images are there or PDF files. I tried to modify= =20 >in conf/checkErrors/jspider.properties the entries=20 >jspider.rules.spider.1.class/jspider.rules.parser.1.class to=20 >AcceptAllRule, but the images and PDF files are still not checked for=20 >presence. What can I do ? > >Thanks a lot Sigbert > > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcl= ick >_______________________________________________ >J-spider-user mailing list >J-s...@li... >https://lists.sourceforge.net/lists/listinfo/j-spider-user > > > |
|
From: Sigbert K. <si...@wi...> - 2004-03-24 11:47:58
|
Hi, I have used the preconfigured checkErrors method to check my Website. It found two mistakes with external HTML pages. But in this configuration it does not check if images are there or PDF files. I tried to modify in conf/checkErrors/jspider.properties the entries jspider.rules.spider.1.class/jspider.rules.parser.1.class to AcceptAllRule, but the images and PDF files are still not checked for presence. What can I do ? Thanks a lot Sigbert |