Menu

#230 Div element wrongly filtered out from dl children when using HTML 5

v2.30
open-accepted
nobody
None
5
2023-06-19
2022-03-03
Simon Urli
No

Hi,

we discovered recently a behaviour difference on the usage of div inside a dl element when using HtmlCleaner with HTML 4 or HTML 5.
Basically this kind of hierarchy:

<dl>
  <div>
    <dt>Foo</dt>
    <dd>Bar</dd>
  </div>
</dl>

is not handled the same way in both version. In HTML4, HtmlCleaner keeps the hierarchy, in HTML5 it filtered out the div element, even though such element is accepted in a dl (reference: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/dl).

Here's a unit test snippet:

@Test
    public void cleanDl() throws Exception
    {
        CleanerProperties cleanerProperties = new CleanerProperties();

        cleanerProperties.setHtmlVersion(5);

        HtmlCleaner cleaner = new HtmlCleaner(cleanerProperties);
        TagNode tagNode = cleaner.clean("<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE html>\n"
            + "<html><body>"
            + "<dl>"
            + "<div>"
            + "<dt>Foo</dt>"
            + "<dd>bar</dt>"
            + "</div>"
            + "</dl>"
            + "</body></html>");
        List<? extends TagNode> dlList = tagNode.getElementListByName("dl", true);
        assertEquals(1, dlList.size());

        List<? extends TagNode> divList = dlList.get(0).getElementListByName("div", true);

        // assertion failure when using Html5
        assertEquals(1, divList.size());
    }

Discussion

  • Simon Urli

    Simon Urli - 2022-03-03

    @scottwilson sorry I attached this to the wrong version, and looks like I cannot edit the ticket. I obtained the bug with HtmlCleaner 2.24

     
  • Scott Wilson

    Scott Wilson - 2022-03-04

    Thanks Simon, I'll take a look. I think the spec version I used only had dt/dd as permitted elements, but reading the current WHATWG we see:

    Either: Zero or more groups each consisting of one or more dt elements followed by one or more dd elements, optionally intermixed with script-supporting elements.
    Or: One or more div elements, optionally intermixed with script-supporting elements.

    https://html.spec.whatwg.org/multipage/grouping-content.html#the-dl-element

    I'll add DIV to the list of permitted elements in the HC metadata and see if we can pass your test case

     
  • Scott Wilson

    Scott Wilson - 2023-04-29
    • Group: v 2.7 --> v2.29
     
  • Scott Wilson

    Scott Wilson - 2023-04-29
    • status: open --> open-accepted
    • Group: v2.29 --> v2.28
     
  • Scott Wilson

    Scott Wilson - 2023-04-29

    Fixed; will be in 2.28 release.

     
  • Scott Wilson

    Scott Wilson - 2023-06-19
    • Group: v2.28 --> v2.30
     

Log in to post a comment.

MongoDB Logo MongoDB