JSpider / Discussion / Help: URL Get Parameters problem

jIRleeCH - 2004-04-12

I am having an issues spidering sites with url get parameters in the links. It looks as though its not parsing those links. I don't have the NoURLParamsRule in place either. Any ideas?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Günther Van Roey - 2004-04-13
  
  The NoURLParamsRule will simply reject all urls that have parameters.
  
  The problem you're facing, however, is the hard-coded behaviour of JSpider to remove all URL parameters.
  
  This should become configurable in the future.
  
  For now, you can go adapt the class net.javacoding.jspider.core.util.URLUtil, and comment out the call to 'normalizeStripQuery'.
  
  However, now you need to be carefull what you spider, because you can create a virtual 'unlimited' URL space in a site (same pages over and over, with different params), and JSpider will not see that these are actually the same!
  
  hope this helps.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- jIRleeCH - 2004-04-13
  
  Thanks for getting back, I noticed the URLUtil yesterday and commented out the normalizeStripQuery. I have also set up a max resources downloaded so it will not go to infinity. Excellent spider. I have written my own perl one but am now moving over to a java only infrastructure and JSpider seems to be a great tool. for the job.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Miguel Tato - 2005-06-01
  
  Hi all,
  
  I am still having the problem of not being able to crawl a site, after commenting out the mentioned method. The site I am trying to download is http://www.gasnatural.com
  (It is a fatwire site, that is, most of the links are using parameters)
  Please help.
  
  Thanks in advance,
  
  Miguel Tato
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel - 2006-10-24
  
  I vote for this "bug", too (I think it's more an important feature request ;)
  
  This issue makes it very hard (if not impossible at the moment) to fetch data from some cms, which code all of their content with something like blabla.com/index.php?node=123
  
  It would be great if this will be implemented soon.
  
  Otherwise this tool is from superb quality, and the userdocumentation is also very great. Keep it up! :)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- donnakaren - 2009-02-19
  
  when jspider output the file, the file name will be the URL filename. if the URL contains parameter, that will be something like this "article.jsp?ymd=20090218". then file name doesnt allow "?", so the website contains parameter wont be exist in your folder.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

EJP - 2021-07-27

All this is being addressed in the upcoming 1.0 release.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

URL Get Parameters problem

Forums

Help

URL Get Parameters problem document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

URL Get Parameters problem