Hi,
we discovered recently a behaviour difference on the usage of div inside a dl element when using HtmlCleaner with HTML 4 or HTML 5.
Basically this kind of hierarchy:
<dl>
<div>
<dt>Foo</dt>
<dd>Bar</dd>
</div>
</dl>
is not handled the same way in both version. In HTML4, HtmlCleaner keeps the hierarchy, in HTML5 it filtered out the div element, even though such element is accepted in a dl (reference: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/dl).
Here's a unit test snippet:
@Test
public void cleanDl() throws Exception
{
CleanerProperties cleanerProperties = new CleanerProperties();
cleanerProperties.setHtmlVersion(5);
HtmlCleaner cleaner = new HtmlCleaner(cleanerProperties);
TagNode tagNode = cleaner.clean("<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE html>\n"
+ "<html><body>"
+ "<dl>"
+ "<div>"
+ "<dt>Foo</dt>"
+ "<dd>bar</dt>"
+ "</div>"
+ "</dl>"
+ "</body></html>");
List<? extends TagNode> dlList = tagNode.getElementListByName("dl", true);
assertEquals(1, dlList.size());
List<? extends TagNode> divList = dlList.get(0).getElementListByName("div", true);
// assertion failure when using Html5
assertEquals(1, divList.size());
}
@scottwilson sorry I attached this to the wrong version, and looks like I cannot edit the ticket. I obtained the bug with HtmlCleaner 2.24
Thanks Simon, I'll take a look. I think the spec version I used only had dt/dd as permitted elements, but reading the current WHATWG we see:
https://html.spec.whatwg.org/multipage/grouping-content.html#the-dl-element
I'll add DIV to the list of permitted elements in the HC metadata and see if we can pass your test case
Fixed; will be in 2.28 release.