Hi Ethan, Thank you for the suggestion. Yes I got a request for this already last year: https://sourceforge.net/p/jerichohtml/bugs/93/ The biggest barrier at the moment is the fact that I implemented a new major feature a few years ago (a web crawler API) but it remains poorly documented, and could probably use a couple of minor enhancements before it is officially released. That means all bug fixes since then have just gone into the DEV release: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip...
Hi there! Would it be possible to add a module-info to the jar distributed for jericho? A few of us in the wider community would like to use the jlink/jpackage flow more often and to be able to recommend others do the same. Without a module-info, if Jericho is a dependency (no matter how deep down the tree) then folks need to manually figure out what modules it depends on + configure their build tool to account for that. The technical solution here is pretty straightforward - compile a module-info.java...
Typo in StreamEncodingDetector.isDifinitive
Haha thanks!
Typo in StreamEncodingDetector.isDifinitive
Hi Samuel. Thank you for doing all of this. Unfortunately this library is way down my priority list these days. I still use it in heaps of projects, and it still works well, and I still fix an occasional reported bug, but I haven't done an official release for years, and I can't see myself getting to it in the near future. When I do eventually release a new version, I will definitely incorporate your suggestions. Cheers Martin
Attaching an updated build.sh script that will also compile test sources and run unit tests
Sorry for the delay in looking into this. I ended up using standard Java instead of relying on Jericho. However I did checkout the source code and take a look. First off I was surprised to find it's using bazaar, if you do intend to migrate to git it should be as simple as initialising a git repo in the same directory and running brz fast-export -b main | git fast-import assuming the branch is called main. See https://jugmac00.github.io/blog/migrate-a-repository-from-bazaar-to-git/ for details. As...
P.S. When you want to include HTML in your post, you need to enclose it in a code block, otherwise the HTML is parsed and doesn't show properly. For example, your sample document should look like this: <html> <head> <meta http-equiv="Content-Type" content="html; charset=UTF-8"> </head> </html>
Hi Davy, The sample HTML you are feeding it doesn't specify a valid encoding and is therefore parsed correctly. Because the quotes are encoded, they are included in the value of the content attribute, which is why the end quote is interpreted as part of the encoding name. You say that the sample content occurs when it is "inserted into an iframe". I assume you mean it appears as the value of the iframe srcdoc attribute. In that case, your sample document should be the HTML containing the iframe,...
Hello, The parser seems to incorrectly parse certain htmls. Example: <meta http-equiv=""Content-Type"" content=""html;" charset="UTF-8""> This type of html happens when a whole web page is inserted into an iframe, so there is an html document inside of an html document as a string and all quote characters are converted into their html counterparts. For the html above, the parser tries to assume that the encoding is UTF-8" (with double quotes at the end). An example of a correctly parsed meta tag:...
It's possible to have multi release jars but it's not trivial. You'd need to compile the existing source with JDK 7 and use JDK 9 or above to compile the module-info.java and bundle it into the same jar. See here for some info about multi-release jars: https://nipafx.dev/multi-release-jars-multiple-java-versions/ I don't really get the desire to support JDK 7 these days, or anything below Java 11 tbh, but if it's needed perhaps it's worth continuing a 3.* branch for JDK 7/8 support and starting a...
It's possible to have multi release jars but it's not trivial. You'd need to compile the existing source with JDK 7 and use JDK 9 or above to compile the module-info.java and bundle it into the same jar. See here for some info about multi-release jars: https://nipafx.dev/multi-release-jars-multiple-java-versions/ I don't really get the desire to support JDK 7 these days, or anything below Java 11 tbh, but if it's needed perhaps it's worth continuing a 3. branch for JDK 7/8 support and starting a...
Hi Martin, Thanks a lot for patching this. I now get expected behaviour! Kind regards, Remi
Hi Remi, I didn't document anywhere why I made the decision to remove the content of button elements. In general I was copying the behaviour of how some email clients create pure text versions of HTML emails. Maybe I just thought they should be removed because all other form elements (INPUT, TEXTAREA etc) are removed. Or maybe I just didn't think much about it! I've modified the Render class in version 3.5 to include the content of BUTTON elements. Until version 3.5 is officially released, the development...
Hi, I've noticed that the Jericho Renderer doesn't include Button elements in its toString(). This is presumably because button is mapped to a RemoveElementHandler in Renderer. I would be interested to hear the rationale behind this, but more importantly, is there a way to override this behaviour on my end? You can reproduce with something as simple as: <html><body><button>My Button</button></body></html> Which will result in an empty string. Many thanks
Hi, I've noticed that the Jericho Renderer doesn't include Button elements in its toString(). This is presumably because button is mapped to a RemoveElementHandler in Renderer. I would be interested to hear the rationale behind this, but more importantly, is there a way to override this behaviour on my end? You can reproduce with something as simple as: <button>My Button</button> Which will result in an empty string. Many thanks
Thanks. I get it now :)
No you don't need to create multiple copies of the Source object. The important thing is just that you find the server tags and call source.ignoreWhenParsing() on them before searching for the HTML tags that contain them. This implies that you need to call it before a full sequential parse is performed, but that may never even be called if you don't call any of the methods getAllTags(), getAllStartTags(), getAllElements(), getChildElements(), iterator() or Segment.getNodeIterator() method on the...
Thanks for the quick response! :) I can avoid the practice of using xml-style server tags inside HTML tags in my own code, but I am running into this difficulty when processing legacy code. Can I ask some follow-up questions? 1. I would be happy to ignore all tags starting with <nested: or <logic: and I'd prefer to use Jericho to find them, rather than write my own parser to find them, but I am confused by the instruction that ignoreWhenParsing() must be called before fullSequentialParse() occurs....
please support java modules in the next release
Hi Samael, Thanks for raising this issue. Sorry I'm not very familiar with the java modules system so I'd appreciate if you could give me some advice and assistance. I'm still targeting java 1.7 to maximise compatibility with older programs. In order to support modules I'd have to compile targeting java 9, right? I'm not sure whether that would mean the library doesn't work with projects targeting Java 8, which I believe is still very common. Do you know whether that would be an issue? Have you already...
Hi Andrew, The release.txt file does mention "minor changes to Renderer behaviour" for version 3.5. The new behaviour is more consistent with browser behaviour so it is most likely an intended change. Cheers Martin
Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...
Incorrectly parsing attribute values containing tags containing quotes
Hi Chris, The situation you've encountered is an-unavoidable consequence of using xml-style server tags inside HTML tags, and the reason I consider it to be bad practice. It is no possible for a parser to know which tags are server tags unless you tell it, so the solution you proposed does not work in all cases. If you can't avoid the use of xml-style server tags in your HTML, the way you tell the parser which tags are server tags is to explicitly search for them first. For each server tag you find,...
Incorrectly parsing attribute values containing tags containing quotes
Hi Martin, Thanks for responding so quickly. Since my last message, I've been trying out 3.5-dev as I was hoping to take advantage of the memory consumption improvements, but have come across a behaviour difference for the Renderer between 3.4 and 3.5. For example, <p>Hello</p><p><br></p><p>There</p> used to output Hello\r\n\r\nThere But now in 3.5-dev it outputs Hello\r\n\r\n\r\nThere Is this an expected behaviour change? I have attached a screenshot of the Renderer configured
please support java modules in the next release
Hi Andrew. Version 3.5 hasn't been officially released yet because the newest feature, a web crawler API, has not been fully documented yet. The project is not dead, and minor improvements continue to make their way into the DEV version, but other time commitments have prevented the completion of the documentation and an official release for years. The 3.5-dev version (http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip) is always a release candidate and can be used as a reliable substitute...
Hello, Are there any plans to release 3.5 onto Maven? I've seen the version being developed mentioned on here a few times
Renderer Always Adds Brackets Around Links
Hi Ryan, You can customise this by overriding the renderHyperlinkURL method. An example is provided in the documentation: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Renderer.html#renderHyperlinkURL(net.htmlparser.jericho.StartTag) If you only want to remove the brackets if the URL is the same as the element contents, you can check for startTag.getAttributeValue("href").equals(startTag.getElement().getTextExtractor().toString()) or if you want it a bit more efficient and disregard...
Renderer Always Adds Brackets Around Links
Thanks for clearing this up, Martin!
Query parameter names in hyperlinks being incorrectly decoded
I just confirmed that character references are decoded inside link elements, at least in Chrome. But the character reference in your example is not terminated, meaning it is missing the final semicolon. When this parser was firs written, most browsers still decoded unterminated character references, but each browser behaved differently. So I created a CompatibilityMode class to encapsulate the decoding behaviour of unterminated character references. To configure the parser not to decode any unterminated...
Hi Remi, I'm not aware that HTML character references shouldn't be decoded inside link elements. Do you have a source for that? Cheers Martin
Query parameter names in hyperlinks being incorrectly decoded
Ok thanks, I understand and agree with your point. I'll give it some thought of how to handle this scenario with customers that have this issue and decide whether fixing this up silently for them by customising the Renderer class or leave it as is, as that is what a non styled representation of their page would look like. Thanks for your help Danny
The HTML should reflect the fact that these links are on separate lines. CSS should not be used to change the meaning of the content, only to style it. Correct HTML would wrap these links in some sort of block element such as li or div, or ideally the HTML5 menuitem element. It's an important aspect of making the web content accessible.
Hi Martin, The customer who reported this issue gave this web page as an example that showed this issue: https://leadsforward.com/generating-the-best-solar-leads-before-the-end-of-2019/ The Render class joins the words "Do" and "Lead" which are on different lines in the visible page: I've attached a screenshot that shows the visible page and the source code below. Thanks Danny
Hi Daniel, In your example you're using relative URLs (foo.com), which the default Renderer class doesn't render at all. This is documented in the Renderer.renderHyperlinkURL(String) method. Because the URL isn't rendered, the content inside the hyperlink is just rendered as normal text. Configuration options such as setHyperlinkContentDelimiters only affect the output when the URL is rendered. I think your only option is going to be to create a copy of the Renderer class and cusomise the A_ElementHandler,...
Hi Martin, Thanks for your response. The renderer.setHyperlinkContentDelimiters(null," ") method would work for us. However, this doesn't seem to work for me. We have Jericho-thml-3.5-dev-3. The following code with the above line setHyperlinkContentDelimiters call still concatenates the words "some" and "text": public void testIssue() { final String html = "\n" + " \n" + " sometext\n" + " \n" + ""; final Source source = new Source(html); final Renderer renderer = new Renderer(source); renderer.setHyperlinkContentDelimiters(null,...
Hi Daniel, It is very uncommon for CSS to style a hyperlink to add whitespace around it. Two consecutive A elements without whitespace in between will almost always render as a single word in a browser. Therefore the behaviour of the Render class is correct and your end user probably does want to know about the lack of a space between the two words when running the spell checker in your example above. Note that the current release version 3.4 contains a bug relating to the rendering of hyperlinks....
Renderer class joins words in consecutive anchor tags text
Thanks for the fix. Danny
Renderer class picks out content from within a script tag
Thanks for the bug report! Fixed in version 3.5. Although the parser was already designed to ignore other tags inside SCRIPT elements, there was a bug triggered by the presence of server tags inside the script element. In your example it was the <%- data.price %> tag causing the problem. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been 5 years since the last official release, version...
ArrayIndexOutOfBoundsException from Renderer: negative left margin.
HTML5 parsing problems - links without quotes
Renderer class picks out content from within a script tag
Thanks Martin, Dev release looks to do the job! I can use this until 3.5 is officially released.
Thanks for the bug report! Fixed in version 3.5. Negative margins and padding are now treated as zero margin. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Although it has been almost 5 years since the last official release, version 3.5 includes a new major feature that requires significant time to document, and I don't envisage having spare time in the foreseeable future. So the official 3.5 release...
ArrayIndexOutOfBoundsException from Renderer: negative left margin.
I develop text editor which tokenizes text and it has HTML inside. Next I working with it and search some references and soo on. Bacisly is same advanced fulltext filtering. I need valid HTML for it. It is quite hard to explain. And need libraby wich not use third side (api, "internet" ...). Your librabry has great marks on the internet. The HTML can by simple like i use for example to same "hell"code. Thanks for anwser and i will try it.
I develop text editor which tokenizes text and it has HTML inside. Next I working with it and search some references and soo on. Bacisly is same advanced fulltext filtering. I need valid HTML for it. It is quite hard to explain. And need libraby wich not use third side (api, "internet" ...). Your librabry has great marks on the internet. The HTML can by simple like i use for example to same "hell"code.
If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optional end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?
If you only need to check for matching start and end tags it's easy. Create a stack, iterate through each tag, if it's a start tag add it to the stack, if it's an end tag check that the tag at the top of the stack has the same name. If not, check whether the start tag has an optionan end tag. But there are many more potential problems with HTML than mismatched tags, and instead of writing your own code to detect and fix problems, why not use an existing library?
I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app. Sorry if my english is bad.
I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app. Sorry if my english is bad.
I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app. Sorry if my english is bad.
I need validation. If i use it all controll are OK but this is only "fail" i have. I have my cutsom parser for HTML, but i need something for validation and i use Jericho. End tag without start tag is big issue and crush my app.
Hi Jiří, The Source Formatter is a tool for formatting valid HTML with indentation, not for fixing broken HTML. If you need to fix broken HTML you'll need to try a library that specialises in that task, such as HtmlCleaner https://sourceforge.net/projects/htmlcleaner/ (I haven't used it myself, but it's under active development) Cheers, Martin
I need format my html. I tested same example with http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp and i find little problem. Formater skip end tag without start tag. Example: <html><body> test </test> </body></html> and formater do this: <html> <body> test </test> </body> Where is start tag for tag *test*????? </html> Is some settings turn on for this? This is big issue for me. I tested this also with version 3.4 in code and same result.
I need format my html. I tested same example with http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp and i find little problem. Formater skip end tag without start tag. Example: " test " and formater do this: " test Where is start tag for tag test????? " Is some settings turn on for this? This is big issue for me. I tested this also with version 3.4 in code and same result.
I need format my html. I tested same example with http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp and i find little problem. Formater skip end tag without start tag. Example: test and formater do this: test Where is start tag for tag test????? Is some settings turn on for this? This is big issue for me. I tested this also with version 3.4 in code and same result.
Hi Wise Mike, I'm not sure why it's not working for you, but I suspect it will work if you download the Java JDK instead of the Java JRE. If that doesn't work, download the latest development version of the parser from the link below: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip Then open a command prompt in the folder samples/console and enter the following command: Encoding.bat arabic-test-file.html > out.txt (replace arabic-test-file.html with the full pathname of your arabic HTML...
Hi Martin, Thanks for the prompt reply. I tried adding the file and restart, but nothing changed. And I tried to redownload Java (currently have Version 8 Update 211 (build 1.8.0_211-b12)) but I didn't find an option during setup or install to choose the "international versiuon", nor couldn't find a download link for it.. Any suggestions? Thanks,
Hi Wise Mike, According to this page: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html the problem is that you have a "European languages" version of Java installed on your computer, which doesn't support the Windows-1256 encoding. This seems to be the default if the Java installer "recognizes that the host operating system only supports European languages". To fix the problem, you could either install the international version (apparently no need to download again, just...
An example file attached. Thanks
An example file attached. Thanks
An example file attached
Hi Wise Mike, Could you please attach the HTML file that isn't doing what you expect? Cheers, Martin
I tried to use DocFetcher to index some of my local Arabic html files, however the indexer didn't work, and I was tild that the issue might be related to Jericho HTML Parser: https://sourceforge.net/p/docfetcher/discussion/702424/thread/1b89958212/ Any idea how to fix this issue? Or make the parser read the Arabic text that has this tag: <meta content="text/html; charset=windows-1256" http-equiv="Content-Type"> Thanks,
Thanks Martin!
I added a WebBot class for crawling/downloading websites but I want to document it before releasing. There's a fair bit to document so it's taking a while, but I'm plugging away at it in my spare time. Still probably months rather than weeks away though, I don't have a lot of spare time.
Hi Martin - do you have a timeline in place for version 3.5?
Hi Martin! Our tests are not really in a format that we can share and you could easily adapt. Our use case is kind of specialized. With Audisto we operate a service for technical website audits and most of our internal test cases work in a way, that we look at reproducable results for crawls of an internal test environement. Most of our tests refer to our so called hints which often come with additional logic on top of parsing . If you want to create a better test suite I suggest you start looking...
Hi Martin! Our tests are not really in a format that we can share and you could easily adapt. Our use case is kind of specialized. With audisto (https://audisto.com/) we operate a service for technical website audits and most of our internal test cases work in a way, that we look at reproducable results for crawls of an internal test environement. Most of our tests refer to our so called hints (https://audisto.com/help/crawler/features/hints/) which often come with additional logic on top of parsing...
You have a test suite for parsing? Would you mind sharing that with me? I only have a handful of unit tests at present.
Hi Martin. Looks great. Our test for this is green now. Thank you very much for the fast reaction.
Fixed in version 3.5. Until version 3.5 is officially released, the development version is available here: http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip
Thanks Tobias. Strange that I wrote the parser to interpret the closing slash as an empty element tag when no browsers interpret it that way. Maybe browser behaviour has changed in that respect over the years. There's a fair bit of code and documentation to update to fix this but I'll see if I can get it done tonight.
HTML5 parsing problems - links without quotes
Thanks so much
Your example code didn't make it into the post, but try something like this: OutputDocument outputDocument=new OutputDocument(source); outputDocument.remove(source.getAllElements(HTMLElementName.STYLE)); for (StartTag startTag : source.getAllStartTags("style",null)) { // iterate all tags with a style attribute outputDocument.remove(startTag.getAttributes().get("style")); } String newHTML=outputDocument.toString();
Hello, I was wondering if I can use Jericho for removing all inline CSS. The need is that if there is a HTML that comes with a tag with embeded styles then before sending it out to any other application which may want to apply its own styles remove all inline style. Ex: Ritesh Bhambhani Output: Ritesh Bhambhani
OK, if you feel this way, I deleted the GitHub project. I converted it to IntelliJ so that I could better understand it, fix bugs myself if I find any, trace under debugger what happens etc. I tend to not trust something I cannot build myself. And to me an IntellJ Idea build is simpler, not Windows .bat files. I howerver respect your wish and I'm grateful for the excellent code you so generously provide. I'm still testing its usefulness for my purpose (splitting very long HTML files for processing...
Yes use the Reader constructor for either Source or StreamedSource. For Source you can also pass a String to the constructor that accepts a CharSequence argument. The Source class always generates a String containing the entire document anyway so there's no advantage using the Reader constructor instead.
Or I guess, I could use StreamedSource(final Reader reader) constructor, where reader is an instance of InputStreamReader, which I can create with InputStreamReader(InputStream in, String charsetName)... Guess will use this method, thanks! Few minutes later: OK, it worked, created my source like: streamedSource = new StreamedSource(new InputStreamReader(new FileInputStream(sourceFileName), "Windows-1251")); and all is fine. Now need to connect to this my ICU encoding detector. Thank you again for...
Or I guess, I could use StreamedSource(final Reader reader) constructor, where reader is an instance of InputStreamReader, which I can create with InputStreamReader(InputStream in, String charsetName)... Guess will use this method, thanks!
I have some HTML files without encoding specification, written e.g. in Russian Windows encoding (Windows-1251), where Jericho parser detects cp1252 encoding, so nothing works correctly in processed output. All the Source and StreamedSource constructors where either encoding or EncodingDetector can be specified, as well as setEncoding() method, are either private or package private. My app actually uses ICU library to detect encodings in text and HTML, so I would prefer either to provide encoding...
Hi Greg, Thanks for using the library and for taking an interest in hosting it on github. I'm curious as to why you thought it necessary to put it on github just to use it in an android app. Why not just use the jar file or the public maven repository? http://repo1.maven.org/maven2/net/htmlparser/jericho/jericho-html/3.4/ I haven't done any android development for a while but I believe it's still possible to specify a jar file as a dependency in an android project. If there is no real need for the...
Don't know if this will be of any interest to the author of Jericho HTML Parser or anyone else in this community, but considering usage of the Parser in my Android app project, I converted it into IntelliJ Idea project, with the main library build under Maven. Posted it on GitHub, see: https://github.com/gregko/jericho-html-parser If the author of this Parser would like to take over this conversion - improve it and maintain, will happily transfer it or remove my port and let the author post his....
Thank you for the quick response. I have added conditions to check for degenerate...
The program works fine for me. The likely reason it is "hanging" when you run it...
getParentElement hangs
getParentElement hangs
Awsome, thanks Martin, much appreciated!