Menu

URL Get Parameters problem

Help
jIRleeCH
2004-04-12
2021-07-27
  • jIRleeCH

    jIRleeCH - 2004-04-12

    I am having an issues spidering sites with url get parameters in the links.  It looks as though its not parsing those links.  I don't have the NoURLParamsRule in place either.  Any ideas?

     
    • Günther Van Roey

      The NoURLParamsRule will simply reject all urls that have parameters.

      The problem you're facing, however, is the hard-coded behaviour of JSpider to remove all URL parameters.

      This should become configurable in the future.

      For now, you can go adapt the class net.javacoding.jspider.core.util.URLUtil, and comment out the call to 'normalizeStripQuery'.

      However, now you need to be carefull what you spider, because you can create a virtual 'unlimited' URL space in a site (same pages over and over, with different params), and JSpider will not see that these are actually the same!

      hope this helps.

       
    • jIRleeCH

      jIRleeCH - 2004-04-13

      Thanks for getting back,  I noticed the URLUtil yesterday and commented out the normalizeStripQuery.  I have also set up a max resources downloaded so it will not go to infinity.  Excellent spider.  I have written my own perl one but am now moving over to a java only infrastructure and JSpider seems to be a great tool.  for the job.

       
    • Miguel Tato

      Miguel Tato - 2005-06-01

      Hi all,

      I am still having the problem of not being able to crawl a site, after commenting out the mentioned method. The site I am trying to download is http://www.gasnatural.com
      (It is a fatwire site, that is, most of the links are using parameters)
      Please help.

      Thanks in advance,

      Miguel Tato

       
    • Daniel

      Daniel - 2006-10-24

      I vote for this "bug", too (I think it's more an important feature request ;)

      This issue makes it very hard (if not impossible at the moment) to fetch data from some cms, which code all of their content with something like blabla.com/index.php?node=123

      It would be great if this will be implemented soon.

      Otherwise this tool is from superb quality, and the userdocumentation is also very great. Keep it up! :)

       
    • donnakaren

      donnakaren - 2009-02-19

      when jspider output the file, the file name will be the URL filename. if the URL contains parameter, that will be something like this "article.jsp?ymd=20090218". then file name doesnt allow "?", so the website contains parameter wont be exist in your folder.

       
  • EJP

    EJP - 2021-07-27

    All this is being addressed in the upcoming 1.0 release.

     

Log in to post a comment.

MongoDB Logo MongoDB