Hi Simon, I think it's a good time to talk about the future of the library in general. I'm aware that I don't have a lot of time available to maintain it, and these days I'm doing 99% Python development in my day job, so each time I have to review a patch my first task is to get a Java setup working again! Certainly moving the project over to Github is something that's been on my mind for a while as there are clear advantages there. I think in general I'd be happier if people who depend on HtmlCleaner...
Hello, I recently started trying to contribute directly to the code of HtmlCleaner for fixing a bug I reported, and I discovered that it wasn't that easy to contribute and get feedbacks on sourceforge. At least, not as easy as it can be on Github (or Gitlab). And I'm wondering if moving the project couldn't help get more contributions and have bug fixed more quickly. To give a bit of context, I'm one of the core committer of XWiki (https://www.xwiki.org) whose code in available on Github (https://github.com/xwiki)....
I forgot to mention, this is with HTML 5 tag definitions, with HTML 4 the example input is kept as-is.
Wrong children of dl incorrectly wrapped in div
Behaviour on unknown tags depends on capitalization of letters
I tried to work on that issue, I think I actually made too much changes: in particular I saw that XmlSerializer#dontEscape is used both for knowing if the content needs to be escaped and to know if CDATA should be added, which is a problem here as we still don't want to escape the content even without a CDATA. So I think same problem might apply to DomSerializer, in which case my code is probably wrong and I might miss adding a unit test somewhere.
CDATA added for any kind of scripts even for application/json ones
Various tags incorrectly not marked as phrasing content in HTML5
@scottwilson it seems you forgot to close that one: I can see a commit related to it before release 2.28, see https://sourceforge.net/p/htmlcleaner/code/595/
Sure, let me see what I can do.
PrettyHtmlSerialiser drops trailing space in tags
Duplicate of 162.
Building fails on JDK < 8 because of -Xdoclint:none flag to Maven Javadoc plugin
As Java 7 is no longer supported I guess this is no longer so important.
It's definitely something I keep coming back to, though usually there's something more pressing! If you wanted to submit a patch upgrading some of the more problematic classes that would be fantastic. I suspect most users will be directly interacting with TagNode the most, or possibly the serialisers, so those would probably be the first to apply updated patterns to.
Thank you for your quick response!
To modify my original suggestion, it's probably best to just return the object wrapped via an appropriate Collections method, e.g., Collections.unmodifiableList.
customizing javadocExecutable in pom.xml breaks the build
Fixed in 2.29
This has a fix for CVE-2023-34624: Stack overflow with excessive nested tags. Also a fix for bug 237: customizing javadocExecutable in pom.xml breaks the build Thanks to niol, PoppingSnack, and Ralf Purnhagen for bug reports and contributions for this release! Note the addition of a maxDepth cleaner property that defines the maximum nested tag depth. The default is 1000.
OK, fixed. There's a new release, 2.29, that implements an arbitrary maximum nesting depth. Its configurable via cleaner properties.
OK, fixed. There's a new release, 2.29, that implements an arbitrary maximum nesting depth. Its configurable via cleaner properties.
[maven-release-plugin] prepare for next development iteration
[maven-release-plugin] copy for tag htmlcleaner-2.29
[maven-release-plugin] prepare release htmlcleaner-2.29
fake commit
fix: Removed JavaDoc extension code - see bug #237. Thanks to niol for the report.
fix: Updated POM version in GUI
fix: Implemented a nesting limit to address CVE-2023-34624
Hi Ralf, Slightly puzzling that it points to the Github fork from 10 years ago while referring to v2.28! I'll take a look at it.
A couple of days ago CVE-2023-34624 has been published. Do you plan to release a fix for this CVE in a next version of HtmlCleaner? Thanks
A couple of days ago CVE-2023-34624 has been published. Do you plan to release a fix for this CVE in a next version of HtmlCleaner? Best regards Ralf
customizing javadocExecutable in pom.xml breaks the build
Thanks - I'll remove it and re-release
customizing javadocExecutable in pom.xml breaks the build
That was weird, I definitely uploaded them at the time! Oh well, I've redone the files.
Hi, I cannot find the release zip following the links in the download page. Am I missing something?
"<b>Hello</b>" becomes " **Hello** " using PrettyHtmlSerialiser
added whitespace when using tags
🎉 I just realised it's been nearly 10 years since I took over the reins of HtmlCleaner, making my first release as the 'new' maintainer back in May 2013. The project itself was started by Vladimir Nikic back in 2006! I've not always been as responsive as I'd like to be over the years as I've had a lot of other things going on, and I've more or less given up on the idea of doing a major architecture overhaul and rewrite for 'v3.0', but sometimes slow change is good. I'm also still using HC as a command...
April. 29, 2023: HtmlCleaner release 2.28 228 svg incorrectly not marked as phrasing content in HTML5 229 style-tag should not be allowed in body in HTML5 230 Div element wrongly filtered out from dl children when using HTML 5 231 SVG moved after <p> elements Thanks to jlacour31, Simon Urli and Michael Hamann for bug reports and contributions for this release!
[maven-release-plugin] prepare for next development iteration
[maven-release-plugin] copy for tag htmlcleaner-2.28
[maven-release-plugin] prepare release htmlcleaner-2.28
doc: completed missing javadoc comments
fix: remove accidental import!
Fix: Added newline to manifest.
Fix: Implemented test for bug 230. Thanks to Simon Urli for the report.
Fixed; will be in 2.28 release.
Fixed and will be in v2.28 release.
228 is now fixed and will be in release version 2.28. Odd coincidence!
Fix: Implemented fix for bug 228, where SVG was not marked as phrasing content. Thanks to Michael Hamann for the report.
Wrong parsing
Div element wrongly filtered out from dl children when using HTML 5
SVG moved after <p> elements
Fixed; will be in 2.28 release
Fix: Implemented fix for bug 229, removing STYLE tags from the document body in HTML5 as per WHATWG. Thanks to Michael Hamann for the report.
Fix: Implemented fix for but 234. Thanks to Michael Hamann for the report and fix.
Patches to build on openjdk 15
[Patch] CVE-2021-33813 vulnerability update
Building fails on JDK < 8 because of -Xdoclint:none flag to Maven Javadoc plugin
style-tag should not be allowed in body in HTML5
svg incorrectly not marked as phrasing content in HTML5
Wrong parsing
Support for modern JDK?
Hi both! Always happy to help. I'll take a look at these next week. If it's just a case of tweaking the tag provider and going through unit tests that shouldn't take much work. I was away from the project for a while (lots happening in the day job) and then found it hard to get back into it again as new versions of Eclipse were making things rather difficult. Since I moved over to IntelliJ I'm feeling a lot more productive, and that I can now dip back into HC and start fixing things again with a...
Similar bugs also affect the math, embed, img, data, object, picture, video, iframe, and q tags. All of them are phrasing content in the current HTML standard but not allowed as phrasing content in HtmlCleaner (at least not as children of tags like strong). See https://github.com/xwiki/xwiki-commons/blob/ce23e117d1cd1515250855eab9bcd7226e66a72f/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiHTML5TagProvider.java how we currently modify the HTML 5 tag definitions...
Hi @scottwilson. Long time no speak! How are you? We have several issues reported quite a while ago, like this one (also reported at https://sourceforge.net/p/htmlcleaner/bugs/231/) or https://sourceforge.net/p/htmlcleaner/bugs/230/ and are wondering if we could expect some fixes. Anything we could do to help out? Thank you very much, you've always been very helpful to the XWiki project. -Vincent
The open and close tags, rather than self-closing tags, are how HTML is typically serialised. If you want to omit the XML tag, use props.setOmitXmlDeclaration(true);
no, I got some like this. <p> Text for text, <custom><custom/>, <another><another/> 1>0 5<10 </p> But, I fixed this using next code: HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setTreatUnknownTagsAsContent(true); props.setNamespacesAware(false); props.setOmitHtmlEnvelope(true); props.setOmitXmlDeclaration(false); TagNode node = cleaner.clean(input); String result = new CompactHtmlSerializer(props).getAsString(node); return result.replaceAll("<\\?xml.*\\?>",...
Unless I'm missing something (Sourceforge does odd things with formatting) then I think thats a reasonable interpretation of the snippet. I presume what you get is something like this? <p> Text for text, <custom/>, <another/> 1>0 5<10 </p>
I have input some like this <p> Text for text, <custom>, <another/> 1>0 5<10 </p> But replace brackets only for "1>0 5<10" and create <custom> with close slash.</custom>
I have input some like this <p> Text for text, <custom>, <another/> 1>0 5<10 </p> But replace brackets only for "1>0 5<10"
Thanks again, @scottwilson! Yes, I can confirm that the build works fine now.
That'll teach me to read my own release docs - I'd constructed the release hierarchy incorrectly. I've been meaning to write a proper release script for some time, maybe this will give me the incentive to actually do it. Hopefully you can get a package build from src now.
Hmm, when I do a clean package from the project root all is well, but not from the releases/src generated from it. I'll investigate further.
Thanks for the quick response! Unfortunately now it's org.htmlcleaner.CommandLine that's missing. I inspected the jarfile and noticed that no classes are present. Any ideas? I noticed some warnings in the build which may be useful: [INFO] Scanning for projects... [WARNING] The project net.sourceforge.htmlcleaner:htmlcleaner:bundle:2.27 uses prerequisites which is only intended for maven-plugin projects but not for non maven-plugin projects. For such purposes you should use the maven-enforcer-plugin....
Added location of sonatype repo to release docs so I don't have to keep googling it
Update version of gui
Update license dates
OK, give it another try. Its the first time I've created a release for a long time, so I was bound to do something wrong :D
I'll fix that now!
Argh, that's entirely my fault. I included the wrong MANIFEST.MF when building the zips
That is very strange!
Hi @scottwilson, we've encountered test failure in https://github.com/Homebrew/homebrew-core/pull/126548 . Below is the output: ==> /opt/homebrew/Cellar/htmlcleaner/2.27/bin/htmlcleaner src=/private/tmp/htmlcleaner-test-20230324-62611-2muxl1/index.html Picked up _JAVA_OPTIONS: -Duser.home=/Users/brew/Library/Caches/Homebrew/java_cache -Djava.io.tmpdir=/private/tmp Error: Could not find or load main class org.htmlcleaner.GUI Caused by: java.lang.ClassNotFoundException: org.htmlcleaner.GUI We are building...
Thanks, @scottwilson! I opened https://github.com/Homebrew/homebrew-core/pull/126548 . I'll let you know how it goes.
Thanks, @scottwilson! I opened https://github.com/Homebrew/homebrew-core/pull/126548. I'll let you know how it goes.
I've updated the POM to target 1.8 and made a release (2.27). Hopefully that solves the problem!