Ad

How to sanitize full HTML 5 documents with HTMLPurifier

@machinateur Mar 01, 2024 15 min read

When it comes to sanitization of HTML content, HTMLPurifier (ezyang/htmlpurifier) is a popular library. Sanitization is commonly used in content management systems and really all kinds of software that deals with web content.

Not long ago I had to find out the hard way, that it has its drawbacks, in terms of usability. That's because it does not come with HTML5 support.

That said, let me just mention that this article is quite shopware 6 specific.

The use-case

For better understanding, here is the example case:

In shopware 6, the library is used to sanitize HTML content from the administration, among other uses, for mail templates. And when it comes to e-mails, there are scenarios, when you would want to send a complete HTML 5 document, along with its <html/>, <head/> and <body/> elements.

As I found, the purifier does not know about these tags in any way. Same goes for the <style/> or <meta/>. So if they're used in an e-mail template, they would be purged from the content, along with their content. But the library will also throw errors when there are unsupported tags in the markup in some cases.

Adding support for HTML5 elements

To my luck, there is a configuration option built-in to shopware, which allows to extend the options for the purifier. And this is what it looks like in the default yaml config:

shopware:

    # ...

    html_sanitizer:
        sets:
            - name: HTML5
              tags: ["article", "aside", "audio", "bdi", "canvas", "datalist", "details", "dialog", "embed", "figcaption", "figure", "footer", "header", "main", "mark", "meter", "nav", "progress", "rp", "rt", "ruby", "section", "summary", "time", "wbr", "output", "canvas", "svg", "track", "video", "source", "input"]
              attributes: ["controls", "open", "min", "max", "datetime", "for", "type", "kind", "srclang", "label", "value", "placeholder", "autoplay", "loop", "muted", "preload", "low", "high", "optimum", "default", "poster", "media", "maxlength", "minlength", "pattern", "required", "autocomplete", "autofocus", "disabled", "readonly", "multiple", "formaction", "formenctype", "formmethod", "formnovalidate", "formtarget", "list", "step", "checked", "accept"]
            - name: basic
              tags: ["a", "abbr", "acronym", "address", "b", "bdo", "big", "blockquote", "br", "caption", "center", "cite", "code", "col", "colgroup", "dd", "del", "dfn", "dir", "div", "dl", "dt", "em", "font", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "i", "ins", "kbd", "li", "menu", "ol", "p", "pre", "q", "s", "samp", "small", "span", "strike", "strong", "sub", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "tr", "tt", "u", "ul", "var", "img"]
              attributes: ["align", "bgcolor", "border", "cellpadding", "cellspacing", "cite", "class", "clear", "color", "colspan", "dir", "face", "frame", "height", "href", "id", "lang", "name", "noshade", "nowrap", "rel", "rev", "rowspan", "scope", "size", "span", "start", "style", "summary", "title", "type", "valign", "value", "width", "target", "src", "alt"]
              options:
                  - key: Attr.AllowedFrameTargets
                    values: ['_blank', '_self', '_parent', '_top']
                  - key: Attr.AllowedRel
                    values: ['nofollow', 'print']
                  - key: Attr.EnableID
                    value: true
            - name: media
              tags: ["img"]
              attributes: ["src", "alt"]
            - name: script
              tags: ["script"]
              options:
                  - key: HTML.Trusted
                    value: true
            - name: tidy
              options:
                  - key: Output.TidyFormat
                    value: true
            - name: bootstrap
              tags: ["a", "span"]
              attributes: ["role", "aria-label", "aria-labelledly", "aria-current", "aria-expanded", "aria-controls", "aria-hidden", "aria-describedby", "tabindex", "aria-modal", "data-bs-toggle", "data-bs-target", "data-bs-dismiss", "data-bs-slide", "data-bs-slide-to", "data-bs-parent", "data-bs-config", "data-bs-content", "data-bs-spy"]
              custom_attributes:
                  - tags: ["a", "span"]
                    attributes: ["href", "role", "aria-label", "aria-labelledly", "aria-current", "aria-expanded", "aria-controls", "aria-hidden", "aria-describedby", "tabindex", "aria-modal", "data-bs-toggle", "data-bs-target", "data-bs-dismiss", "data-bs-slide", "data-bs-slide-to", "data-bs-parent", "data-bs-config", "data-bs-content", "data-bs-spy"]
            - name: snippet
              tags: ["a"]
              attributes: ["data-url", "data-ajax-modal", "data-prev-url"]
              custom_attributes:
                  - tags: ["a"]
                    attributes: ["data-url", "data-ajax-modal", "data-prev-url"]

        fields:
            - name: product_translation.description
              sets: ["basic", "media", "HTML5"]
            - name: app_cms_block.template
              sets: ["basic", "media", "tidy", "HTML5"]
            - name: snippet.value
              sets: ["basic", "media", "bootstrap", "snippet", "HTML5"]

(source)

That lead me to the following lines of code, where the configuration is applied.

        // some processing of the yaml structure

        $config->set('HTML.AllowedElements', $allowedElements);
        $config->set('HTML.AllowedAttributes', $allowedAttributes);

        $definition = $config->getHTMLDefinition(true);

        if ($definition === null) {
            return $config;
        }

        $this->addHTML5Tags($definition);

        foreach ($customAttributes as $tag => $attributes) {
            foreach ($attributes as $attribute) {
                $definition->addAttribute($tag, $attribute, 'Text');
            }
        }

(source)

It is then used to construct the \HTMLPurifier instance.

If you're interested in what configurations exactly are necessary to support HTML5 or have a look at source code in their repository.


private function addHTML5Tags(\HTMLPurifier_HTMLDefinition $definition): \HTMLPurifier_HTMLDefinition
{
    $definition->addElement('section', 'Block', 'Flow', 'Common');
    $definition->addElement('nav', 'Block', 'Flow', 'Common');
    $definition->addElement('article', 'Block', 'Flow', 'Common');
    $definition->addElement('aside', 'Block', 'Flow', 'Common');
    $definition->addElement('header', 'Block', 'Flow', 'Common');
    $definition->addElement('footer', 'Block', 'Flow', 'Common');
    $definition->addElement('canvas', 'Block', 'Flow', 'Common', [
        'width' => 'Length',
        'height' => 'Length',
    ]);
    $definition->addElement('bdi', 'Block', 'Flow', 'Common');
    $definition->addElement('audio', 'Block', 'Flow', 'Common', [
        'src' => 'URI',
        'preload' => 'Enum#auto,metadata,none',
        'autoplay' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        'loop' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        'muted' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        'controls' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
    ]);
    $definition->addElement('datalist', 'Block', 'Flow', 'Common', [
        'id' => 'ID',
    ]);
    $definition->addElement('dialog', 'Block', 'Flow', 'Common', [
        'open' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
    ]);
    $definition->addElement('embed', 'Block', 'Flow', 'Common', [
        'src' => 'URI',
        'type' => 'Text',
        'width' => 'Length',
        'height' => 'Length',
    ]);
    $definition->addElement('main', 'Block', 'Flow', 'Common');
    $definition->addElement('menu', 'Block', 'Flow', 'Common');
    $definition->addElement('meter', 'Block', 'Flow', 'Common', [
        'form' => 'ID',
        'value' => 'Text',
        'min' => 'Length',
        'max' => 'Length',
        'low' => 'Text',
        'high' => 'Text',
        'optimum' => 'Text',
    ]);
    $definition->addElement('progress', 'Block', 'Flow', 'Common', [
        'value' => 'Number',
        'max' => 'Number',
    ]);
    $definition->addElement('rp', 'Block', 'Flow', 'Common');
    $definition->addElement('rt', 'Block', 'Flow', 'Common');
    $definition->addElement('ruby', 'Block', 'Flow', 'Common');
    $definition->addElement('summary', 'Block', 'Flow', 'Common');
    $definition->addElement('time', 'Block', 'Flow', 'Common', [
        'datetime' => 'Text',
    ]);
    $definition->addElement('output', 'Block', 'Flow', 'Common', [
        'for' => 'ID',
        'form' => 'ID',
        'name' => 'CDATA',
    ]);
    $definition->addElement('svg', 'Block', 'Flow', 'Common', [
        'width' => 'Length',
        'height' => 'Length',
    ]);
    $definition->addElement('track', 'Block', 'Flow', 'Common', [
        'default' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        'kind' => 'Enum#subtitles,captions,descriptions,chapters,metadata',
        'label' => 'Text',
        'src' => 'URI',
        'srclang' => 'LanguageCode',
    ]);
    $definition->addElement(
        'details',
        'Block',
        'Flow',
        'Common',
        [
            'open' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        ]
    );

    $definition->addElement('figure', 'Block', 'Optional: (figcaption, Flow) | (Flow, figcaption) | Flow', 'Common');
    $definition->addElement('figcaption', 'Inline', 'Flow', 'Common');
    $definition->addElement('video', 'Block', 'Optional: (source, Flow) | (Flow, source) | Flow', 'Common', [
        'src' => 'URI',
        'width' => 'Length',
        'height' => 'Length',
        'poster' => 'URI',
        'preload' => 'Enum#auto,metadata,none',
        'controls' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        'autoplay' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        'loop' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
        'muted' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
    ]);
    $definition->addElement('source', 'Block', 'Flow', 'Common', [
        'src' => 'URI',
        'type' => 'Text',
        'media' => 'Text',
        'sizes' => 'Text',
        'srcset' => 'Text',
        'crossorigin' => 'Enum#anonymous,use-credentials',
    ]);

    $definition->addElement('mark', 'Inline', 'Inline', 'Common');
    $definition->addElement('wbr', 'Inline', 'Empty', 'Core');

    // Add new HTML5 input types
    $definition->addElement(
        'input',
        'Form',
        'Empty',
        'Common',
        [
            'accept' => 'Text',
            'alt' => 'Text',
            'autocomplete' => 'Enum#on,off',
            'autofocus' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
            'checked' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
            'dirname' => 'Text',
            'disabled' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
            'form' => 'ID',
            'formaction' => 'URI',
            'formenctype' => 'Enum#application/x-www-form-urlencoded,multipart/form-data,text/plain',
            'formmethod' => 'Enum#get,post',
            'formnovalidate' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
            'formtarget' => 'Enum#_blank,_self,_parent,_top',
            'height' => 'Length',
            'list' => 'ID',
            'max' => 'Text',
            'maxlength' => 'Number',
            'min' => 'Text',
            'minlength' => 'Number',
            'multiple' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
            'name' => 'CDATA',
            'pattern' => 'Text',
            'placeholder' => 'Text',
            'readonly' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
            'required' => new \HTMLPurifier_AttrDef_HTML_Bool(true),
            'size' => 'Number',
            'src' => 'URI',
            'step' => 'Text',
            'type' => 'Enum#text,password,checkbox,radio,submit,reset,file,hidden,image,button,date,time,datetime-local,week,month,number,email,url,search,tel,color,range',
            'value' => 'Text',
            'width' => 'Length',
        ]
    );

    return $definition;
}

Support for full documents

Following the documentation on the matter of adding custom elements available on the website, I came up with the following approach:

        $definition = $config->getHTMLDefinition(true);

        \assert($definition instanceof \HTMLPurifier_HTMLDefinition);

        // Add dummy document root element
        $definition->addElement('__document_root', false,
            'required: html',
            null,
        );
        // Add <html/> element (document root)
        $definition->addElement('html', 'Document',
            // The head/body are required in the <html/> tag
            'required: head | body',
            null,
            [
                'lang' => 'Text',
                // 'charset' => 'Charsets',
            ],
        );
        // Add <head/> element
        $definition->addElement('head', false,
            // These three elements are allowed as children, but not required
            'optional: meta | link | style',
            null,
        );
        // Add <body/> element
        $definition->addElement('body', false,
            'Flow',
            // Add common attributes to the <body/> tag
            'Common',
        );

        // Add <meta/> element
        $definition->addElement('meta', false,
            'Empty',
            null,
            [
                'name'    => 'NMTOKENS',
                'content' => 'Text',
            ],
        );
        // Add <link/> element
        $definition->addElement('link', false,
            'Empty',
            null,
            [
                // The "href" is a URL
                'href' => 'URI',
                // Force "rel" to stylesheet
                'rel' => 'Enum#stylesheet',
                // Force the "type" value to text/css
                'type' => 'Enum#text/css',
            ],
        );
        // Add <style/> element
        $definition->addElement('style', 'Inline',
            // Not Empty to not allow to auto-close the <style/> tag
            'required: #PCDATA',
            null,
            [
                // Force the "type" attribute value to text/css
                'type' => 'Enum#text/css',
            ],
        );

As you can see, I've added a __document_root element, which is a dummy used for the parent of html, which is the required child of that element. The html in turn has head and body as required children (one of 'em). The body then again allows all Flow content and so on. It's safe, since it is not allowed in any of the other elements. It will only be used as HTML.Parent option value for the lexer to understand our bidding.

In the above example, I also defined style, link and meta tags, which should suffice for most use-cases.

What I've not shown here, is how these settings are injected into the configuration. It's impossible to use inheritance, which leaves reflection and closure scopes. There are lots of examples on the internet on how that can be done.

At last, the \Shopware\Core\Framework\Util\HtmlSanitizer should be decorated in the service definition like this:

services:
    # ...

    Machinateur\Shopware\Core\Framework\Util\HtmlSanitizer:
        decorates: 'Shopware\Core\Framework\Util\HtmlSanitizer'

See the symfony docs for more information on how to decorate services.

But this is only one part of the puzzle...

Shopware configuration for full documents

Utilizing the YAML configuration, we can now allow a new set of elements. And this is a crucial part for this extension to work, since the options are very specific for different use-cases. You can read about the options at the official documentation.

See the configuration below:

shopware:

    # ...

    html_sanitizer:
        sets:
            # Allow wrapping document elements in mail header/footer content
            - name: full_document
              tags: ['html', 'head', 'body', 'meta', 'link']
              attributes: ['lang', 'name', 'content', 'rel', 'href', 'type']
              options:
                  # All options are documented at http://htmlpurifier.org/live/configdoc/plain.html
                  - key: CSS.AllowImportant
                    value: true
                  # We are more permissive regarding the allowed "rel" values, to properly support links in head
                  - key: Attr.AllowedRel
                    values: ['nofollow', 'print', 'stylesheet']
                  # Avoid removing the outer document, to keep that content intact
                  - key: Core.ConvertDocumentToFragment
                    value: false
                  # Set up the dummy element as top-most parent element
                  - key: HTML.Parent
                    value: '__document_root'
                  # We do not wish to loose any unclosed tags, as header/footer are separated
                  - key: Core.EscapeInvalidTags
                    value: true
                  # Set the lexer implementation explicitly, as the wrong lexer might break full documents
                  - key: Core.LexerImpl
                    value: 'DirectLex'
            # Allow inline style tags in mail body (template) content
            - name: style
              tags: ['style']
              attributes: ['type']
              options:
                  - key: CSS.Trusted
                    value: true

        fields:
            # Set up special rule for mail template
            - name: mail_template_translation.contentHtml
              sets: ['basic', 'style', 'HTML5', 'full_document']

Setting this up for e-mail header/footer

So, as you might imagine, it could be a bad idea to use this with header and footer on mails. That's, to state the obvious, because the template will be enclosed in header and footer template.

So you should take measures to ensure there is no possibility to set these templates on sales-channels using full document e-mails. Everything else on your own risk.

It's a bit tricky to make full documents work for the header-footer use-case. You'd want the mail template itself to only allow basic HTML content, and the header-footer set to full documents.

The configuration or that would be this:

shopware:

    # ...

    html_sanitizer:
        fields:
            # Set up special rule for mail header/footer template
            - name: mail_header_footer_translation.contentHtml
              sets: ['basic', 'style'] # maybe also HTML5

That field does not exist, but I'll come to that shortly.

Apart from that, there's a not insignificant part involved in sticking the both parts of the document template together.

To do so, first the validation has to be switched off for the definition. SUch thing can be achieved by changing the class of the service definition for the entity definition in the service container. Override the fields by just re-adding the exiting ones to the defined fields from parent. The new class should inherit all other parts of the definition. The new fields for header/footer HTML should also get the AllowHtml flag, with parameter false. That way, no validation is done through the sanitizer (which gets called by the field serializer).

<?php

declare(strict_types=1);

namespace Machinateur\Shopware\Core\Content\MailTemplate\Aggregate;

use Shopware\Core\Content\MailTemplate\Aggregate\MailHeaderFooterTranslation\MailHeaderFooterTranslationDefinition as MailHeaderFooterTranslationDefinitionBase;
use Shopware\Core\Framework\DataAbstractionLayer\Field\Flag\AllowHtml;
use Shopware\Core\Framework\DataAbstractionLayer\Field\LongTextField;
use Shopware\Core\Framework\DataAbstractionLayer\FieldCollection;

class MailHeaderFooterTranslationDefinition extends MailHeaderFooterTranslationDefinitionBase
{
    public function defineFields(): FieldCollection
    {
        $def = parent::defineFields();
        $def->add(
            (new LongTextField('header_html', 'headerHtml'))
                ->addFlags(new AllowHtml(false))
        );
        $def->add(
            (new LongTextField('footer_html', 'footerHtml'))
                ->addFlags(new AllowHtml(false))
        );

        return $def;
    }
}

With the following YAML service definition:

services:
    Machinateur\Shopware\:
        # the path is relative to `src/Resources/config/services.yaml`
        resource: '../../'
        exclude:
            - '../../Core/Content/MailTemplate/Aggregate/MailHeaderFooterTranslation/MailHeaderFooterTranslationDefinition.php'
            # ...

    Shopware\Core\Content\MailTemplate\Aggregate\MailHeaderFooterTranslation\MailHeaderFooterTranslationDefinition:
        class: 'Machinateur\Shopware\Core\Content\MailTemplate\Aggregate\MailHeaderFooterTranslationDefinition'

The definition class is used instead of the original one from shopware, while the same new class is excluded from the resource definition at the top. That is necessary to avoid issues with duplicate entity names, as the new definition would otherwise be discovered, tagged and registered (autowiring process). This is a general way to overcome the strict entity extension rules, put up by shopware. Doing this should be the last resort, when the usual entity extension capabilities are not sufficient, like in this case.

Now we have to account for the validation ourselves, whenever the entity data of the MailHeaderFooterTranslationEntity is updated. This in turn can be achieved using the event system of shopware. The docs explain how it's possible to access and change data.

<?php

declare(strict_types=1);

namespace Machinateur\Shopware\Core\Content\MailTemplate\Aggregate;

// ...

class MailHeaderFooterWriteSubscriber implements EventSubscriberInterface
{
    public function __construct(
        protected readonly HtmlSanitizer $sanitizer,
        protected readonly EntityRepository $mailHeaderFooterTranslationRepository,
    ) {
    }

    public static function getSubscribedEvents(): array
    {
        return [
            PreWriteValidationEvent::class => 'beforeWrite',
        ];
    }

    public function beforeWrite(PreWriteValidationEvent $event): void
    {
        foreach ($event->getCommands() as $command) {
            $definition = $command->getDefinition();

            // Only apply to the specific header/footer entity.
            if ('mail_header_footer_translation' !== $definition->getEntityName()) {
                continue;
            }

            \assert($definition instanceof MailHeaderFooterTranslationDefinition);

            // Skip if no relevant fields is changed.
            if (!$command->hasField('header_html')
                && !$command->hasField('footer_html')) {
                continue;
            }

            // Load the current entity data.
            $entity = $this->getExistingEntity($command, $event->getContext());
            if (null === $entity) {
                continue;
            }

            // Random divider UUID.
            $divider = Uuid::randomHex();

            $payload = $command->getPayload();
            $payload['header_html'] ??= $entity->get('headerHtml');
            $payload['footer_html'] ??= $entity->get('footerHtml');

            $text = sprintf('%s%s%s',
                $payload['header_html'], $divider, $payload['footer_html']
            );
            $text = $this->sanitizer->sanitize($text, field: 'mail_header_footer_translation.contentHtml');
            // This is where the dummy name comes in again.  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

            [$headerHtml, $footerHtml] = explode($divider, $text, 2);

            $command->addPayload('header_html', $headerHtml);
            $command->addPayload('footer_html', $footerHtml);
        }
    }

    // ...
}

The key element is, that our custom logic should concatenate both parts, with a unique separator in between, then sanitize the contents, which should make a full document, when valid. That separator should be as random and unique as possible to avoid wrong splitting, when the sanitized text is taken apart again. A good separator could be a random Uuid in HEX format.

The HtmlSanitizer::sanitize() method allows to supply the $field, which should be a field we've previously set up in the YAML config. In the example above, that's mail_header_footer_translation.contentHtml, a non-existent field on that entity, but existence does not really matter here. We could also abuse the specific header/footer fields, but since both will be parsed as one document, it makes absolutely no difference, except more config boilerplate would be required.

When all that is done and in place, we should be able to handle full documents in the e-mail template generation, which on top of that can contain twig directives. Let me know what you think.

Happy coding!

Topics

php html shopware sanitization