How to easily parse and alter HTML markup with WP_HTML_Tag_Processor

By Ian Svoboda

When you’re building with WordPress you will almost definitely need to modify HTML output at some point or other. This type of thing is pretty easy to do with JavaScript (which is pre-built with tools for manipulating the DOM) but historically it’s been much more challenging when using PHP. In this post I’ll show you how you can use the WP_HTML_Tag_Processor class to easily make changes to an HTML string in PHP.

In my career, I’ve often seen developers reach for strpos (find a string in another string) or preg_match (regex) to pattern match the things they’re looking for. This can be ok for simpler use cases (ex: change all occurrences of the same string no matter where it appears). But the moment your needs get more complicated that approach becomes increasingly unreliable.

Additionally, I’ve seen (and personally used) DOMDocument which is a native PHP class for parsing and manipulating HTML markup. DOMDocument is powerful, full featured markup parser that even has methods like getElementById that are a bit more comparable to JavaScript. However, all that power does come with slower speeds and higher memory usage and for most use cases the simplicity of the changes isn’t worth the performance hit.1

Thankfully WordPress has a new PHP class available since version 6.2 that makes this much easier (and faster) to do: WP_HTML_Tag_Processor. Let’s explore a real world reason you might want to use this and how you can do it.

Modifying block markup

Let’s examine a hypothetical situation where we want to change the markup a Details block. The markup for this block would normally look like this:

<details class="wp-block-details is-layout-flow wp-container-core-details-is-layout-49beafc5 wp-block-details-is-layout-flow">    
  <summary>What languages does MindSpace support?</summary>
  <p>MindSpace currently supports English, Spanish, French, German, Italian, Portuguese, and Japanese. We’re continuously adding more <a href="https://example.com">languages</a> based on user demand. The AI transcription and analysis work in the primary language of your meeting, with automatic language detection for mixed-language conversations.
  </p>
    <p>You can read more about this in our <a href="https://thebomb.com">support</a> documentation.
  </p>
</details>
HTML

But let’s say we need to add an attribute to the summary tag (such as a data-* attribute or class). This is a fantastic use case for the tag processor and we can easily do this using the render_block filter.

The render_block filter applies to any block that renders and you can use some simple conditional logic to only target the block you’re looking to modify. Here’s how that might look:

functions.php
<?php

/**
 * Modify the core/details block markup
 * 
 * @param string $block_content The block HTML markup to be rendered.
 * @param array $block The parsed block.
 * 
 * @return $block_content The modified or original block content.
 */
function update_details_block_markup( $block_content, $block ) {

	// Return the original content if the block name doesn't match.
	if ( $block['blockName'] !== 'core/details' ) {
		return $block_content;
	}

	// ....

	return $block_content;
}
add_action( 'render_block', 'update_details_block_markup', 10, 2 );
PHP

Since this filter is called on every block on the page we need to do some checking to see if it’s a core/details block. The second parameter of our callback function is the parsed block, which is an associative array with several keys, including blockName.

So this code first checks if the block name of the filtered block content matches the one we want. If it doesn’t match, we just return the $block_content without any changes. This is important for a few reasons:

  1. You should always avoid modifying things you don’t mean to modify.
  2. A filter must always return a value.

In computer science terms, this kind of condition is called a guard clause and is an alternative to wrapping all of your code in a condition. While either one is valid, I’d encourage you to use a guard clause pattern where possible as it makes the code much easier to read and understand.

So now that we are sure we’re working with the right block, let’s start using the WP_HTML_Tag_Processor class.

Loading the markup to process

To start using the tag processor, you need to give it the string of HTML you want it to parse. In our example, the $block_content param is a string that contains all of the HTML the block is going to render. So we just need to instantiate the class like so:

functions.php
<?php 
/**
 * Modify the core/details block markup
 * 
 * @param string $block_content The block HTML markup to be rendered.
 * @param array $block The parsed block.
 * 
 * @return $block_content The modified or original block content.
 */
function update_details_block_markup( $block_content, $block ) {

	// Return the original content if the block name doesn't match.
	if ( $block['blockName'] !== 'core/details' ) {
		return $block_content;
	}


	// Load the HTML we want the tag processor to use.
	$html = new WP_HTML_Tag_Processor( $block_content );


	return $block_content;
}
add_action( 'render_block', 'update_details_block_markup', 10, 2 );
PHP

That was easy! Now the tag processor is ready to look through that HTML and find the element(s) we want. Since we want to change the summary element, we need to find that element in the provided HTML.

Selecting elements

To find a given element in the provided HTML, we’ll use the next_tag method and give it the name of the tag we want, like so:

functions.php
<?php 
/**
 * Modify the core/details block markup
 * 
 * @param string $block_content The block HTML markup to be rendered.
 * @param array $block The parsed block.
 * 
 * @return $block_content The modified or original block content.
 */
function update_details_block_markup( $block_content, $block ) {

	// Return the original content if the block name doesn't match.
	if ( $block['blockName'] !== 'core/details' ) {
		return $block_content;
	}


	// Load the HTML we want the tag processor to use.
	$html = new WP_HTML_Tag_Processor( $block_content );

	// Select the summary tag.
	$html->next_tag( 'summary' );

	return $block_content;
}
add_action( 'render_block', 'update_details_block_markup', 10, 2 );
PHP

The next_tag method is roughly analogous to querySelector in JavaScript, except the way you select elements is a bit different. For a basic query like this where we just need to find a given HTML tag, you can just pass a string like in the above example.

But if you need to do more advanced things, like find an element by class name, you can use an array for the first param of next_tag instead and pass the class_name key. While you can’t use CSS selectors like you can with document.querySelector, you can look for elements in a few helpful ways:

  • tag_name Match the tag name (‘div’, ‘details’, ‘summary’, ‘button’, etc)
  • class_name – Match the class name
  • match_offset Match by order of appearance in the markup (note: this includes nested tags)

Note:
The last bullet there is especially important to understand so I want to take a quick moment to explain that before we move on.

Expand details

The tag processor is going to parse all of the tags it finds into a flat array. In the context of our example, that might look like this (using only the tag names for brevity):

['div', 'summary', 'p', 'a', 'p', 'a']
PHP

So now let’s say we want to target the second link tag. Here we’d need to use the tag_name and match_offset like so:

// Load the HTML we want the tag processor to use.
$html = new WP_HTML_Tag_Processor( $block_content );

// Tell the tag processor to find the tag that matches our selector/query criteria.
$html->next_tag( [ 'tag_name' => 'a', 'match_offset' => 2 );
PHP

The match_offset key lets us say “I want the second link tag you find” even if that tag is inside another tag. If you were trying to target the second link in CSS, you’d have to do something like p:nth-child(2) a or similar to indicate you wanted the anchor inside the last p tag.

In our case we just need to select the summary tag and there’s only ever going to be one of those in a details element. So we can use the simple selector in the original example.

Adding attributes to an element

When you call the next_tag method, the tag processor will designate the matching element as the current tag being processed if it exists. So once we call next_tag, the $html variable is now pointing to the summary element we selected.

This means that once we want to change the element we’ve selected with next_tag we need to call the method(s) to change it on the instance of the tag processor we made (i.e. the $html variable).

So if we wanted to add a class to the summary we can change up our previous code a little to first check if the summary was found and then add a class to it if it exists.

functions.php
<?php 
/**
 * Modify the core/details block markup
 * 
 * @param string $block_content The block HTML markup to be rendered.
 * @param array $block The parsed block.
 * 
 * @return $block_content The modified or original block content.
 */
function update_details_block_markup( $block_content, $block ) {

	// Return the original content if the block name doesn't match.
	if ( $block['blockName'] !== 'core/details' ) {
		return $block_content;
	}


	// Load the HTML we want the tag processor to use.
	$html = new WP_HTML_Tag_Processor( $block_content );

	// Check if a summary tag is found.
	$has_summary = $html->next_tag( 'summary' );

	if( $has_summary ) {
		$html->add_class( 'bazinga' );

		// Return the modified HTML if we changed it.
		return $html;
	}

	return $block_content;
}
add_action( 'render_block', 'update_details_block_markup', 10, 2 );
PHP

And voila! The markup is now modified for every instance of the block.

Getting fancy

The above is a pretty straightforward example and with the above code we are modifying every details block on a page and giving all of the summary elements the same class name. But let’s say Stu the content editor really needs a way to add any class name he wants to the summary elements for each block individually.

We can’t let Stu down, so let’s explore an example of doing this.

For the sake of our example, let’s assume you’ve already added an additional attribute to the core/details block called summaryClass. If that attribute is registered to this block and doesn’t have a source of attribute2 then the attribute value will be visible inside the attrs key of the $block param.

You can use some simple conditional logic to get that attribute’s value and then only modify the block markup if there’s actually a value set. Here’s how that might look:

functions.php
<?php
/**
 * Modify the core/details block markup
 * 
 * @param string $block_content The block HTML markup to be rendered.
 * @param array $block The parsed block.
 * 
 * @return $block_content The modified or original block content.
 */
function update_details_block_markup( $block_content, $block ) {
  // Get the value of the summaryClass attribute.
	$summary_class = $block['attrs']['summaryClass'] ?? '';

	/**
	 * Return the original content if the block name doesn't match or there's no 
	 * summary class to add.
	 */ 
	if ( $block['blockName'] !== 'core/details' || !$summary_class ) {
		return $block_content;
	}

	// Load the HTML we want the tag processor to use.
	$html = new WP_HTML_Tag_Processor( $block_content );

	// Check if a summary tag is found.
	$has_summary = $html->next_tag( 'summary' );

	if( $has_summary ) {
		$html->add_class( $summary_class );

		// Return the modified HTML if we changed it.
		return $html;
	}

	return $block_content;
}
add_action( 'render_block', 'update_details_block_markup', 10, 2 );
PHP

Next Steps

Now that you’ve gotten a taste of what the tag processor can do, I encourage you to look at the documentation to learn more about how you can use this handy class. There are additional considerations to make if you’re looping over elements or doing other more involved selection and manipulation and you can see some more examples and details about how that works there.

Happy coding!

Further Reading

  1. See the WP_HTML_Tag_Processor docs section Design and Limitations for more details on how the class differs from DOMDocument. ↩︎
  2. Blocks with a source value of attribute get their value from an HTML tag attribute (ex the href of a link). This is a pattern most commonly used when you’re authoring your own blocks and you’re unlikely to ever need to use this if you’re adding attributes to an existing block like core/details ↩︎


Leave a Reply

Your email address will not be published. Required fields are marked *