Activity for Jericho HTML Parser

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Ethan, Thank you for the suggestion. Yes I got a request for this already last year: https://sourceforge.net/p/jerichohtml/bugs/93/ The biggest barrier at the moment is the fact that I implemented a new major feature a few years ago (a web crawler API) but it remains poorly documented, and could probably use a couple of minor enhancements before it is officially released. That means all bug fixes since then have just gone into the DEV release: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...

  • Ethan McCue Ethan McCue posted a comment on discussion Help

    Hi there! Would it be possible to add a module-info to the jar distributed for jericho? A few of us in the wider community would like to use the jlink/jpackage flow more often and to be able to recommend others do the same. Without a module-info, if Jericho is a dependency (no matter how deep down the tree) then folks need to manually figure out what modules it depends on + configure their build tool to account for that. The technical solution here is pretty straightforward - compile a module-info.java...

  • Martin Jericho Martin Jericho modified ticket #95

    Typo in StreamEncodingDetector.isDifinitive

  • Martin Jericho Martin Jericho posted a comment on ticket #95

    Haha thanks!

  • Lolo101 Lolo101 created ticket #95

    Typo in StreamEncodingDetector.isDifinitive

  • Martin Jericho Martin Jericho posted a comment on ticket #93

    Hi Samuel. Thank you for doing all of this. Unfortunately this library is way down my priority list these days. I still use it in heaps of projects, and it still works well, and I still fix an occasional reported bug, but I haven't done an official release for years, and I can't see myself getting to it in the near future. When I do eventually release a new version, I will definitely incorporate your suggestions. Cheers Martin

  • Samael Samael posted a comment on ticket #93

    Attaching an updated build.sh script that will also compile test sources and run unit tests

  • Samael Samael posted a comment on ticket #93

    Sorry for the delay in looking into this. I ended up using standard Java instead of relying on Jericho. However I did checkout the source code and take a look. First off I was surprised to find it's using bazaar, if you do intend to migrate to git it should be as simple as initialising a git repo in the same directory and running brz fast-export -b main | git fast-import assuming the branch is called main. See https://jugmac00.github.io/blog/migrate-a-repository-from-bazaar-to-git/ for details. As...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    P.S. When you want to include HTML in your post, you need to enclose it in a code block, otherwise the HTML is parsed and doesn't show properly. For example, your sample document should look like this: <html> <head> <meta http-equiv=&quot;Content-Type&quot; content=&quot;html; charset=UTF-8&quot;> </head> </html>

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Davy, The sample HTML you are feeding it doesn't specify a valid encoding and is therefore parsed correctly. Because the quotes are encoded, they are included in the value of the content attribute, which is why the end quote is interpreted as part of the encoding name. You say that the sample content occurs when it is "inserted into an iframe". I assume you mean it appears as the value of the iframe srcdoc attribute. In that case, your sample document should be the HTML containing the iframe,...

  • Davy Jones Davy Jones posted a comment on discussion Help

    Hello, The parser seems to incorrectly parse certain htmls. Example: <meta http-equiv=""Content-Type"" content=""html;" charset="UTF-8""> This type of html happens when a whole web page is inserted into an iframe, so there is an html document inside of an html document as a string and all quote characters are converted into their html counterparts. For the html above, the parser tries to assume that the encoding is UTF-8" (with double quotes at the end). An example of a correctly parsed meta tag:...

  • Samael Samael modified a comment on ticket #93

    It's possible to have multi release jars but it's not trivial. You'd need to compile the existing source with JDK 7 and use JDK 9 or above to compile the module-info.java and bundle it into the same jar. See here for some info about multi-release jars: https://nipafx.dev/multi-release-jars-multiple-java-versions/ I don't really get the desire to support JDK 7 these days, or anything below Java 11 tbh, but if it's needed perhaps it's worth continuing a 3.* branch for JDK 7/8 support and starting a...

  • Samael Samael posted a comment on ticket #93

    It's possible to have multi release jars but it's not trivial. You'd need to compile the existing source with JDK 7 and use JDK 9 or above to compile the module-info.java and bundle it into the same jar. See here for some info about multi-release jars: https://nipafx.dev/multi-release-jars-multiple-java-versions/ I don't really get the desire to support JDK 7 these days, or anything below Java 11 tbh, but if it's needed perhaps it's worth continuing a 3. branch for JDK 7/8 support and starting a...

  • Remi Rosenthal Remi Rosenthal posted a comment on discussion Help

    Hi Martin, Thanks a lot for patching this. I now get expected behaviour! Kind regards, Remi

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Remi, I didn't document anywhere why I made the decision to remove the content of button elements. In general I was copying the behaviour of how some email clients create pure text versions of HTML emails. Maybe I just thought they should be removed because all other form elements (INPUT, TEXTAREA etc) are removed. Or maybe I just didn't think much about it! I've modified the Render class in version 3.5 to include the content of BUTTON elements. Until version 3.5 is officially released, the development...

  • Remi Rosenthal Remi Rosenthal modified a comment on discussion Help

    Hi, I've noticed that the Jericho Renderer doesn't include Button elements in its toString(). This is presumably because button is mapped to a RemoveElementHandler in Renderer. I would be interested to hear the rationale behind this, but more importantly, is there a way to override this behaviour on my end? You can reproduce with something as simple as: <html><body><button>My Button</button></body></html> Which will result in an empty string. Many thanks

  • Remi Rosenthal Remi Rosenthal posted a comment on discussion Help

    Hi, I've noticed that the Jericho Renderer doesn't include Button elements in its toString(). This is presumably because button is mapped to a RemoveElementHandler in Renderer. I would be interested to hear the rationale behind this, but more importantly, is there a way to override this behaviour on my end? You can reproduce with something as simple as: <button>My Button</button> Which will result in an empty string. Many thanks

  • Chris Santos-Lang Chris Santos-Lang posted a comment on ticket #94

    Thanks. I get it now :)

  • Martin Jericho Martin Jericho posted a comment on ticket #94

    No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them. This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the...

  • Chris Santos-Lang Chris Santos-Lang posted a comment on ticket #94

    Thanks for the quick response! :) I can avoid the practice of using xml-style server tags inside HTML tags in my own code, but I am running into this difficulty when processing legacy code. Can I ask some follow-up questions? 1. I would be happy to ignore all tags starting with <nested: or <logic: and I'd prefer to use Jericho to find them, rather than write my own parser to find them, but I am confused by the instruction that ignoreWhenParsing() must be called before fullSequentialParse() occurs....

  • Martin Jericho Martin Jericho modified ticket #93

    please support java modules in the next release

  • Martin Jericho Martin Jericho posted a comment on ticket #93

    Hi Samael, Thanks for raising this issue. Sorry I'm not very familiar with the java modules system so I'd appreciate if you could give me some advice and assistance. I'm still targeting java 1.7 to maximise compatibility with older programs. In order to support modules I'd have to compile targeting java 9, right? I'm not sure whether that would mean the library doesn't work with projects targeting Java 8, which I believe is still very common. Do you know whether that would be an issue? Have you already...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Andrew, The release.txt file does mention "minor changes to Renderer behaviour" for version 3.5. The new behaviour is more consistent with browser behaviour so it is most likely an intended change. Cheers Martin

  • Martin Jericho Martin Jericho modified a comment on ticket #94

    Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...

  • Martin Jericho Martin Jericho modified ticket #94

    Incorrectly parsing attribute values containing tags containing quotes

  • Martin Jericho Martin Jericho posted a comment on ticket #94

    Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...

  • Chris Santos-Lang Chris Santos-Lang created ticket #94

    Incorrectly parsing attribute values containing tags containing quotes

  • Andrew Smith Andrew Smith posted a comment on discussion Open Discussion

    Hi Martin, Thanks for responding so quickly. Since my last message, I've been trying out 3.5-dev as I was hoping to take advantage of the memory consumption improvements, but have come across a behaviour difference for the Renderer between 3.4 and 3.5. For example, <p>Hello</p><p><br></p><p>There</p> used to output Hello\r\n\r\nThere But now in 3.5-dev it outputs Hello\r\n\r\n\r\nThere Is this an expected behaviour change? I have attached a screenshot of the Renderer configured

  • Samael Samael created ticket #93

    please support java modules in the next release

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Andrew. Version 3.5 hasn't been officially released yet because the newest feature, a web crawler API, has not been fully documented yet. The project is not dead, and minor improvements continue to make their way into the DEV version, but other time commitments have prevented the completion of the documentation and an official release for years. The 3.5-dev version (http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip) is always a release candidate and can be used as a reliable substitute...

  • Andrew Smith Andrew Smith posted a comment on discussion Open Discussion

    Hello, Are there any plans to release 3.5 onto Maven? I've seen the version being developed mentioned on here a few times

  • Martin Jericho Martin Jericho modified ticket #22

    Renderer Always Adds Brackets Around Links

  • Martin Jericho Martin Jericho posted a comment on ticket #22

    Hi Ryan, You can customise this by overriding the renderHyperlinkURL method. An example is provided in the documentation: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Renderer.html#renderHyperlinkURL(net.htmlparser.jericho.StartTag) If you only want to remove the brackets if the URL is the same as the element contents, you can check for startTag.getAttributeValue("href").equals(startTag.getElement().getTextExtractor().toString()) or if you want it a bit more efficient and disregard...

  • Ryan Holdren Ryan Holdren created ticket #22

    Renderer Always Adds Brackets Around Links

  • Remi Rosenthal Remi Rosenthal posted a comment on ticket #92

    Thanks for clearing this up, Martin!

  • Martin Jericho Martin Jericho modified ticket #92

    Query parameter names in hyperlinks being incorrectly decoded

  • Martin Jericho Martin Jericho posted a comment on ticket #92

    I just confirmed that character references are decoded inside link elements, at least in Chrome. But the character reference in your example is not terminated, meaning it is missing the final semicolon. When this parser was firs written, most browsers still decoded unterminated character references, but each browser behaved differently. So I created a CompatibilityMode class to encapsulate the decoding behaviour of unterminated character references. To configure the parser not to decode any unterminated...

  • Martin Jericho Martin Jericho posted a comment on ticket #92

    Hi Remi, I'm not aware that HTML character references shouldn't be decoded inside link elements. Do you have a source for that? Cheers Martin

  • Remi Rosenthal Remi Rosenthal created ticket #92

    Query parameter names in hyperlinks being incorrectly decoded

  • Daniel Gonzalez Daniel Gonzalez posted a comment on ticket #91

    Ok thanks, I understand and agree with your point. I'll give it some thought of how to handle this scenario with customers that have this issue and decide whether fixing this up silently for them by customising the Renderer class or leave it as is, as that is what a non styled representation of their page would look like. Thanks for your help Danny

  • Martin Jericho Martin Jericho posted a comment on ticket #91

    The HTML should reflect the fact that these links are on separate lines. CSS should not be used to change the meaning of the content, only to style it. Correct HTML would wrap these links in some sort of block element such as li or div, or ideally the HTML5 menuitem element. It's an important aspect of making the web content accessible.

  • Daniel Gonzalez Daniel Gonzalez posted a comment on ticket #91

    Hi Martin, The customer who reported this issue gave this web page as an example that showed this issue: https://leadsforward.com/generating-the-best-solar-leads-before-the-end-of-2019/ The Render class joins the words "Do" and "Lead" which are on different lines in the visible page: I've attached a screenshot that shows the visible page and the source code below. Thanks Danny

  • Martin Jericho Martin Jericho posted a comment on ticket #91

    Hi Daniel, In your example you're using relative URLs (foo.com), which the default Renderer class doesn't render at all. This is documented in the Renderer.renderHyperlinkURL(String) method. Because the URL isn't rendered, the content inside the hyperlink is just rendered as normal text. Configuration options such as setHyperlinkContentDelimiters only affect the output when the URL is rendered. I think your only option is going to be to create a copy of the Renderer class and cusomise the A_ElementHandler,...

  • Daniel Gonzalez Daniel Gonzalez posted a comment on ticket #91

    Hi Martin, Thanks for your response. The renderer.setHyperlinkContentDelimiters(null," ") method would work for us. However, this doesn't seem to work for me. We have Jericho-thml-3.5-dev-3. The following code with the above line setHyperlinkContentDelimiters call still concatenates the words "some" and "text": public void testIssue() { final String html = "\n" + " \n" + " sometext\n" + " \n" + ""; final Source source = new Source(html); final Renderer renderer = new Renderer(source); renderer.setHyperlinkContentDelimiters(null,...

  • Martin Jericho Martin Jericho posted a comment on ticket #91

    Hi Daniel, It is very uncommon for CSS to style a hyperlink to add whitespace around it. Two consecutive A elements without whitespace in between will almost always render as a single word in a browser. Therefore the behaviour of the Render class is correct and your end user probably does want to know about the lack of a space between the two words when running the spell checker in your example above. Note that the current release version 3.4 contains a bug relating to the rendering of hyperlinks....

  • Daniel Gonzalez Daniel Gonzalez created ticket #91

    Renderer class joins words in consecutive anchor tags text

  • Daniel Gonzalez Daniel Gonzalez posted a comment on ticket #90

    Thanks for the fix. Danny

  • Martin Jericho Martin Jericho modified ticket #90

    Renderer class picks out content from within a script tag

  • Martin Jericho Martin Jericho posted a comment on ticket #90

    Thanks for the bug report! Fixed in version 3.5. Although the parser was already designed to ignore other tags inside SCRIPT elements, there was a bug triggered by the presence of server tags inside the script element. In your example it was the <%- data.price %> tag causing the problem. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been 5 years since the last official release, version...

  • Martin Jericho Martin Jericho modified ticket #89

    ArrayIndexOutOfBoundsException from Renderer: negative left margin.

  • Martin Jericho Martin Jericho modified ticket #88

    HTML5 parsing problems - links without quotes

  • Daniel Gonzalez Daniel Gonzalez created ticket #90

    Renderer class picks out content from within a script tag

  • David Cockbill David Cockbill posted a comment on ticket #89

    Thanks Martin, Dev release looks to do the job! I can use this until 3.5 is officially released.

  • Martin Jericho Martin Jericho posted a comment on ticket #89

    Thanks for the bug report! Fixed in version 3.5. Negative margins and padding are now treated as zero margin. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been almost 5 years since the last official release, version 3.5 includes a new major feature that requires significant time to document, and I don't envisage having spare time in the foreseeable future. So the official 3.5 release...

  • David Cockbill David Cockbill created ticket #89

    ArrayIndexOutOfBoundsException from Renderer: negative left margin.

  • Jiří Hak Jiří Hak modified a comment on discussion Help

    I develop text editor which tokenizes text and it has HTML inside. Next I working with it and search some references and soo on. Bacisly is same advanced fulltext filtering. I need valid HTML for it. It is quite hard to explain. And need libraby wich not use third side (api, "internet" ...). Your librabry has great marks on the internet. The HTML can by simple like i use for example to same "hell"code. Thanks for anwser and i will try it.

  • Jiří Hak Jiří Hak posted a comment on discussion Help

    I develop text editor which tokenizes text and it has HTML inside. Next I working with it and search some references and soo on. Bacisly is same advanced fulltext filtering. I need valid HTML for it. It is quite hard to explain. And need libraby wich not use third side (api, "internet" ...). Your librabry has great marks on the internet. The HTML can by simple like i use for example to same "hell"code.

  • Martin Jericho Martin Jericho modified a comment on discussion Help

    If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optional end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optionan end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?

  • Jiří Hak Jiří Hak modified a comment on discussion Help

    I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app. Sorry if my english is bad.

  • Jiří Hak Jiří Hak modified a comment on discussion Help

    I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app. Sorry if my english is bad.

  • Jiří Hak Jiří Hak modified a comment on discussion Help

    I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app. Sorry if my english is bad.

  • Jiří Hak Jiří Hak posted a comment on discussion Help

    I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app.

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Jiří, The Source Formatter is a tool for formatting valid HTML with indentation, not for fixing broken HTML. If you need to fix broken HTML you'll need to try a library that specialises in that task, such as HtmlCleaner https://sourceforge.net/projects/htmlcleaner/ (I haven't used it myself, but it's under active development) Cheers, Martin

  • Jiří Hak Jiří Hak modified a comment on discussion Help

    I need format my html. I tested same example with http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp and i find little problem. Formater skip end tag without start tag. Example: <html><body> test </test> </body></html> and formater do this: <html> <body> test </test> </body> Where is start tag for tag *test*????? </html> Is some settings turn on for this? This is big issue for me. I tested this also with version 3.4 in code and same result.

  • Jiří Hak Jiří Hak modified a comment on discussion Help

    I need format my html. I tested same example with http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp and i find little problem. Formater skip end tag without start tag. Example: " test " and formater do this: " test Where is start tag for tag test????? " Is some settings turn on for this? This is big issue for me. I tested this also with version 3.4 in code and same result.

  • Jiří Hak Jiří Hak posted a comment on discussion Help

    I need format my html. I tested same example with http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp and i find little problem. Formater skip end tag without start tag. Example: test and formater do this: test Where is start tag for tag test????? Is some settings turn on for this? This is big issue for me. I tested this also with version 3.4 in code and same result.

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Wise Mike, I'm not sure why it's not working for you, but I suspect it will work if you download the Java JDK instead of the Java JRE. If that doesn't work, download the latest development version of the parser from the link below: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Then open a command prompt in the folder samples/console and enter the following command: Encoding.bat arabic-test-file.html > out.txt (replace arabic-test-file.html with the full pathname of your arabic HTML...

  • wise_mike wise_mike posted a comment on discussion Help

    Hi Martin, Thanks for the prompt reply. I tried adding the file and restart, but nothing changed. And I tried to redownload Java (currently have Version 8 Update 211 (build 1.8.0_211-b12)) but I didn't find an option during setup or install to choose the "international versiuon", nor couldn't find a download link for it.. Any suggestions? Thanks,

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Wise Mike, According to this page: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html the problem is that you have a "European languages" version of Java installed on your computer, which doesn't support the Windows-1256 encoding. This seems to be the default if the Java installer "recognizes that the host operating system only supports European languages". To fix the problem, you could either install the international version (apparently no need to download again, just...

  • wise_mike wise_mike modified a comment on discussion Help

    An example file attached. Thanks

  • wise_mike wise_mike modified a comment on discussion Help

    An example file attached. Thanks

  • wise_mike wise_mike posted a comment on discussion Help

    An example file attached

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Hi Wise Mike, Could you please attach the HTML file that isn't doing what you expect? Cheers, Martin

  • wise_mike wise_mike posted a comment on discussion Help

    I tried to use DocFetcher to index some of my local Arabic html files, however the indexer didn't work, and I was tild that the issue might be related to Jericho HTML Parser: https://sourceforge.net/p/docfetcher/discussion/702424/thread/1b89958212/ Any idea how to fix this issue? Or make the parser read the Arabic text that has this tag: <meta content="text/html; charset=windows-1256" http-equiv="Content-Type"> Thanks,

  • Code Buddy Code Buddy posted a comment on ticket #88

    Thanks Martin!

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    I added a WebBot class for crawling/downloading websites but I want to document it before releasing. There's a fair bit to document so it's taking a while, but I'm plugging away at it in my spare time. Still probably months rather than weeks away though, I don't have a lot of spare time.

  • Code Buddy Code Buddy posted a comment on ticket #88

    Hi Martin - do you have a timeline in place for version 3.5?

  • Tobias Schwarz Tobias Schwarz modified a comment on ticket #88

    Hi Martin! Our tests are not really in a format that we can share and you could easily adapt. Our use case is kind of specialized. With Audisto we operate a service for technical website audits and most of our internal test cases work in a way, that we look at reproducable results for crawls of an internal test environement. Most of our tests refer to our so called hints which often come with additional logic on top of parsing . If you want to create a better test suite I suggest you start looking...

  • Tobias Schwarz Tobias Schwarz posted a comment on ticket #88

    Hi Martin! Our tests are not really in a format that we can share and you could easily adapt. Our use case is kind of specialized. With audisto (https://audisto.com/) we operate a service for technical website audits and most of our internal test cases work in a way, that we look at reproducable results for crawls of an internal test environement. Most of our tests refer to our so called hints (https://audisto.com/help/crawler/features/hints/) which often come with additional logic on top of parsing...

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    You have a test suite for parsing? Would you mind sharing that with me? I only have a handful of unit tests at present.

  • Tobias Schwarz Tobias Schwarz posted a comment on ticket #88

    Hi Martin. Looks great. Our test for this is green now. Thank you very much for the fast reaction.

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    Fixed in version 3.5. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip

  • Martin Jericho Martin Jericho posted a comment on ticket #88

    Thanks Tobias. Strange that I wrote the parser to interpret the closing slash as an empty element tag when no browsers interpret it that way. Maybe browser behaviour has changed in that respect over the years. There's a fair bit of code and documentation to update to fix this but I'll see if I can get it done tonight.

  • Tobias Schwarz Tobias Schwarz created ticket #88

    HTML5 parsing problems - links without quotes

  • Ritesh Bhambhani Ritesh Bhambhani posted a comment on discussion Open Discussion

    Thanks so much

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Your example code didn't make it into the post, but try something like this: OutputDocument outputDocument=new OutputDocument(source); outputDocument.remove(source.getAllElements(HTMLElementName.STYLE)); for (StartTag startTag : source.getAllStartTags("style",null)) { // iterate all tags with a style attribute outputDocument.remove(startTag.getAttributes().get("style")); } String newHTML=outputDocument.toString();

  • Ritesh Bhambhani Ritesh Bhambhani posted a comment on discussion Open Discussion

    Hello, I was wondering if I can use Jericho for removing all inline CSS. The need is that if there is a HTML that comes with a tag with embeded styles then before sending it out to any other application which may want to apply its own styles remove all inline style. Ex: Ritesh Bhambhani Output: Ritesh Bhambhani

  • Greg K. Greg K. posted a comment on discussion Open Discussion

    OK, if you feel this way, I deleted the GitHub project. I converted it to IntelliJ so that I could better understand it, fix bugs myself if I find any, trace under debugger what happens etc. I tend to not trust something I cannot build myself. And to me an IntellJ Idea build is simpler, not Windows .bat files. I howerver respect your wish and I'm grateful for the excellent code you so generously provide. I'm still testing its usefulness for my purpose (splitting very long HTML files for processing...

  • Martin Jericho Martin Jericho posted a comment on discussion Help

    Yes use the Reader constructor for either Source or StreamedSource. For Source you can also pass a String to the constructor that accepts a CharSequence argument. The Source class always generates a String containing the entire document anyway so there's no advantage using the Reader constructor instead.

  • Greg K. Greg K. modified a comment on discussion Help

    Or I guess, I could use StreamedSource(final Reader reader) constructor, where reader is an instance of InputStreamReader, which I can create with InputStreamReader(InputStream in, String charsetName)... Guess will use this method, thanks! Few minutes later: OK, it worked, created my source like: streamedSource = new StreamedSource(new InputStreamReader(new FileInputStream(sourceFileName), "Windows-1251")); and all is fine. Now need to connect to this my ICU encoding detector. Thank you again for...

  • Greg K. Greg K. posted a comment on discussion Help

    Or I guess, I could use StreamedSource(final Reader reader) constructor, where reader is an instance of InputStreamReader, which I can create with InputStreamReader(InputStream in, String charsetName)... Guess will use this method, thanks!

  • Greg K. Greg K. posted a comment on discussion Help

    I have some HTML files without encoding specification, written e.g. in Russian Windows encoding (Windows-1251), where Jericho parser detects cp1252 encoding, so nothing works correctly in processed output. All the Source and StreamedSource constructors where either encoding or EncodingDetector can be specified, as well as setEncoding() method, are either private or package private. My app actually uses ICU library to detect encodings in text and HTML, so I would prefer either to provide encoding...

  • Martin Jericho Martin Jericho posted a comment on discussion Open Discussion

    Hi Greg, Thanks for using the library and for taking an interest in hosting it on github. I'm curious as to why you thought it necessary to put it on github just to use it in an android app. Why not just use the jar file or the public maven repository? http://repo1.maven.org/maven2/net/htmlparser/jericho/jericho-html/3.4/ I haven't done any android development for a while but I believe it's still possible to specify a jar file as a dependency in an android project. If there is no real need for the...

  • Greg K. Greg K. posted a comment on discussion Open Discussion

    Don't know if this will be of any interest to the author of Jericho HTML Parser or anyone else in this community, but considering usage of the Parser in my Android app project, I converted it into IntelliJ Idea project, with the main library build under Maven. Posted it on GitHub, see: https://github.com/gregko/jericho-html-parser If the author of this Parser would like to take over this conversion - improve it and maintain, will happily transfer it or remove my port and let the author post his....

  • Laurie Poon Laurie Poon posted a comment on ticket #87

    Thank you for the quick response. I have added conditions to check for degenerate...

  • Martin Jericho Martin Jericho posted a comment on ticket #87

    The program works fine for me. The likely reason it is "hanging" when you run it...

  • Martin Jericho Martin Jericho modified ticket #87

    getParentElement hangs

  • Laurie Poon Laurie Poon created ticket #87

    getParentElement hangs

  • Code Buddy Code Buddy posted a comment on ticket #86

    Awsome, thanks Martin, much appreciated!

1 >