I am having an issues spidering sites with url get parameters in the links. It looks as though its not parsing those links. I don't have the NoURLParamsRule in place either. Any ideas?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The NoURLParamsRule will simply reject all urls that have parameters.
The problem you're facing, however, is the hard-coded behaviour of JSpider to remove all URL parameters.
This should become configurable in the future.
For now, you can go adapt the class net.javacoding.jspider.core.util.URLUtil, and comment out the call to 'normalizeStripQuery'.
However, now you need to be carefull what you spider, because you can create a virtual 'unlimited' URL space in a site (same pages over and over, with different params), and JSpider will not see that these are actually the same!
hope this helps.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for getting back, I noticed the URLUtil yesterday and commented out the normalizeStripQuery. I have also set up a max resources downloaded so it will not go to infinity. Excellent spider. I have written my own perl one but am now moving over to a java only infrastructure and JSpider seems to be a great tool. for the job.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am still having the problem of not being able to crawl a site, after commenting out the mentioned method. The site I am trying to download is http://www.gasnatural.com
(It is a fatwire site, that is, most of the links are using parameters)
Please help.
Thanks in advance,
Miguel Tato
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I vote for this "bug", too (I think it's more an important feature request ;)
This issue makes it very hard (if not impossible at the moment) to fetch data from some cms, which code all of their content with something like blabla.com/index.php?node=123
It would be great if this will be implemented soon.
Otherwise this tool is from superb quality, and the userdocumentation is also very great. Keep it up! :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
when jspider output the file, the file name will be the URL filename. if the URL contains parameter, that will be something like this "article.jsp?ymd=20090218". then file name doesnt allow "?", so the website contains parameter wont be exist in your folder.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am having an issues spidering sites with url get parameters in the links. It looks as though its not parsing those links. I don't have the NoURLParamsRule in place either. Any ideas?
The NoURLParamsRule will simply reject all urls that have parameters.
The problem you're facing, however, is the hard-coded behaviour of JSpider to remove all URL parameters.
This should become configurable in the future.
For now, you can go adapt the class net.javacoding.jspider.core.util.URLUtil, and comment out the call to 'normalizeStripQuery'.
However, now you need to be carefull what you spider, because you can create a virtual 'unlimited' URL space in a site (same pages over and over, with different params), and JSpider will not see that these are actually the same!
hope this helps.
Thanks for getting back, I noticed the URLUtil yesterday and commented out the normalizeStripQuery. I have also set up a max resources downloaded so it will not go to infinity. Excellent spider. I have written my own perl one but am now moving over to a java only infrastructure and JSpider seems to be a great tool. for the job.
Hi all,
I am still having the problem of not being able to crawl a site, after commenting out the mentioned method. The site I am trying to download is http://www.gasnatural.com
(It is a fatwire site, that is, most of the links are using parameters)
Please help.
Thanks in advance,
Miguel Tato
I vote for this "bug", too (I think it's more an important feature request ;)
This issue makes it very hard (if not impossible at the moment) to fetch data from some cms, which code all of their content with something like blabla.com/index.php?node=123
It would be great if this will be implemented soon.
Otherwise this tool is from superb quality, and the userdocumentation is also very great. Keep it up! :)
when jspider output the file, the file name will be the URL filename. if the URL contains parameter, that will be something like this "article.jsp?ymd=20090218". then file name doesnt allow "?", so the website contains parameter wont be exist in your folder.
All this is being addressed in the upcoming 1.0 release.