php – Find text offset into document for DOM attribute-ThrowExceptions

Exception or error:

How can I find the offset of a particular node or attribute using the PHP DOM extension (or another extension or library if necessary).

For example, say I have this HTML document:

<html><a href="/foo">bar</a></html>

And using the following code (with appropriate modifications):

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
    // Find start of $href attribute here
    echo $href->something;
}

I’d expect to see the output 15 or something to that effect, to indicate that the attribute starts at character 15 into the document.

There seems to be the method DOMNode::getLineNo() which returns the line number – this is similar to what I want but I can’t find an alternative for the general offset into the text.

How to solve:

After finding the attribute you want,

  • replace its value to a unique value, that you will never see in the document
  • dump the DomDocument to html again
  • search for the unique value `s position in the string
$html = <<<HTML
<html><a href="/foo">bar</a></html>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');

$mySecretId = 'abc123';
foreach($nodes as $href) {
    $href->value = $mySecretId;
}

$html = $dom->saveHTML();
echo strpos($html, $mySecretId) . "\n";

strpos” will give you the first occurrence of the replaced value, which is the position you want.

Note flags “LIBXML_HTML_NOIMPLIED” and “LIBXML_HTML_NODEFDTD”, more here.

If you want to find all positions of the matched elements, do:

foreach($nodes as $href) {
    $previousValue = $href->value;
    $href->value = $mySecretId;
    $html = $dom->saveHTML();
    echo strpos($html, $mySecretId) . "\n";
    $href->value = $previousValue;
}

Answer:

Assumptions

The following is based on some assumptions:

  • a.href attributes are the only candidates that shall be handled – in case it shall be more the used regular expression pattern might become (too) complicated
  • a.href attributes are always encapsulated in double quotes " and the value of the attribute node must not be empty
  • in case a.href attributes occur multiple times in the very same node, the last occurrence takes precedence

Code using preg_match_all with offset-capture

<?php
// define some HTML, could be retrieved by e.g. file_get_contents() as well
$html = <<< HTML
<!DOCTYPE html>
<html lang="en">
<body>
<a href="https://google.com/">Google</a><div><a href=
"https://stackoverflow.com/">StackOverflow</a></div>
<A HREF="https://google.com/" href="https://goo.gl/">
Google URL</a>
</body>
</html>
HTML;

// search href attributes in anchor tags (case insensitive & multi-line)
preg_match_all(
    '#<a[^>]*\s+href\s*=\s*"(?P<value>[^"]*)"[^>]*>#mis',
    $html,
    $matches,
    PREG_OFFSET_CAPTURE
);

$positions = array_map(
    function (array $match) {
        $length = mb_strlen($match[0]);
        return [
            'value' => $match[0],
            'length' => $length,
            'start' => $match[1],
            'end' => $match[1] + $length,
        ];
    },
    $matches['value']
);

var_dump($positions);

will output the position information like the following (note: the last position uses the second href attribute which has been defined twice for the very same anchor tag)

array(3) {
  [0] => array(4) {
    'value' => string(19) "https://google.com/"
    'length' => int(19)
    'start' => int(49)
    'end' => int(68)
  }
  [1] => array(4) {
    'value' => string(26) "https://stackoverflow.com/"
    'length' => int(26)
    'start' => int(95)
    'end' => int(121)
  }
  [2] => array(4) {
    'value' => string(15) "https://goo.gl/"
    'length' => int(15)
    'start' => int(183)
    'end' => int(198)
  }
}

Leave a Reply

Your email address will not be published. Required fields are marked *