PHP multibyte preg_split() with PREG_SPLIT_OFFSET_CAPTURE-ThrowExceptions

Exception or error:

I want to use preg_split() with its PREG_SPLIT_OFFSET_CAPTURE option to capture both the word and the index where it begins in the original string.

However my string contains multibyte characters which is throwing off the counts. There doesn’t seem to be a mb_ equivalent to this. What are my options?

Example:

$text = "Hello world — goodbye";

$words = preg_split("/(\w+)/x",
                    $text,
                    -1,
                    PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);

foreach($words as $word) {
    print("$word[0]: $word[1]<br>");
}

This outputs:

Hello: 0
: 5
world: 6
— : 11
goodbye: 16

Because the dash is is an em-dash, rather than a standard hyphen, it’s a multibyte character – so “goodbye”s offset comes out as 16 instead of 14.

How to solve:

This is kind of a hack, but seems to work. Use str_replace() to replace the multi-byte character with a non-multi-byte character and then run the preg_split() on the string.

$text = 'Hello world — goodbye';
$mb = '—';
$rplmnt = "X";

function chkPlc($text, $mb, $rplmnt){
    if(strpos($text, $mb) !== false){ 
        $rpl = str_replace($mb, $rplmnt, $text);
        $words = preg_split("/(\w+)/x",
                        $rpl,
                        -1,
                        PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);

        foreach($words as $word) {    
            $stmt = print("$word[0]: $word[1]<br>");
        }
    }

    $stmt .= 'New String with replaced md char with non mb char: '.$rpl.'<br>';
    return $stmt;
}

chkPlc($text, $mb, $rplmnt);

OUTPUTS:

Hello: 0
: 5
world: 6
X : 11
goodbye: 14

A more in depth function could be written to check if a non-multi-byte character is not present within the string first, then used as a replacement for the multi-byte character defined. Again, kind of a hack but it works.

Answer:

Here’s another not-ideal solution: convert the text to something like ISO-8859-1 using mb_convert_encoding() that will get rid of the multibyte characters. They’ll either be turned to a similar ASCII character or a question-mark.

So transforming $text before doing the preg_split() using this:

$text = mb_convert_encoding($text, "ISO-8859-1", "UTF-8");

Results in:

Hello: 0
: 5
world: 6
? : 11
goodbye: 14

Although it makes a mess of the text, you can still keep a copy of the original of course.

I found it via this comment about the iconv() function.

Leave a Reply

Your email address will not be published. Required fields are marked *