php – Find text offset into document for DOM attribute-ThrowExceptions

Exception or error:

How can I find the offset of a particular node or attribute using the PHP DOM extension (or another extension or library if necessary).

For example, say I have this HTML document:

<html><a href="/foo">bar</a></html>

And using the following code (with appropriate modifications):

$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
    // Find start of $href attribute here
    echo $href->something;

I’d expect to see the output 15 or something to that effect, to indicate that the attribute starts at character 15 into the document.

There seems to be the method DOMNode::getLineNo() which returns the line number – this is similar to what I want but I can’t find an alternative for the general offset into the text.

How to solve:

After finding the attribute you want,

  • replace its value to a unique value, that you will never see in the document
  • dump the DomDocument to html again
  • search for the unique value `s position in the string
$html = <<<HTML
<html><a href="/foo">bar</a></html>

$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');

$mySecretId = 'abc123';
foreach($nodes as $href) {
    $href->value = $mySecretId;

$html = $dom->saveHTML();
echo strpos($html, $mySecretId) . "\n";

strpos” will give you the first occurrence of the replaced value, which is the position you want.


If you want to find all positions of the matched elements, do:

foreach($nodes as $href) {
    $previousValue = $href->value;
    $href->value = $mySecretId;
    $html = $dom->saveHTML();
    echo strpos($html, $mySecretId) . "\n";
    $href->value = $previousValue;



The following is based on some assumptions:

  • a.href attributes are the only candidates that shall be handled – in case it shall be more the used regular expression pattern might become (too) complicated
  • a.href attributes are always encapsulated in double quotes " and the value of the attribute node must not be empty
  • in case a.href attributes occur multiple times in the very same node, the last occurrence takes precedence

Code using preg_match_all with offset-capture

// define some HTML, could be retrieved by e.g. file_get_contents() as well
$html = <<< HTML
<!DOCTYPE html>
<html lang="en">
<a href="">Google</a><div><a href=
<A HREF="" href="">
Google URL</a>

// search href attributes in anchor tags (case insensitive & multi-line)

$positions = array_map(
    function (array $match) {
        $length = mb_strlen($match[0]);
        return [
            'value' => $match[0],
            'length' => $length,
            'start' => $match[1],
            'end' => $match[1] + $length,


will output the position information like the following (note: the last position uses the second href attribute which has been defined twice for the very same anchor tag)

array(3) {
  [0] => array(4) {
    'value' => string(19) ""
    'length' => int(19)
    'start' => int(49)
    'end' => int(68)
  [1] => array(4) {
    'value' => string(26) ""
    'length' => int(26)
    'start' => int(95)
    'end' => int(121)
  [2] => array(4) {
    'value' => string(15) ""
    'length' => int(15)
    'start' => int(183)
    'end' => int(198)

Leave a Reply

Your email address will not be published. Required fields are marked *