PHPDeveloper: PHP News, Views and Community

Matthew Weir O'Phinney's Blog:
mbstring comes to the rescue

byChris Cornutt May 17, 2006 @ 10:49:23

Character encodings, especially when dealing with XML, in PHP can be a pain to say the least. Matthew Weir O'Phinney found this out first-hand when a script he was working with had a mixed character set in one of its strings, giving the XML parser in the SimpleXML functionality problems.

I tried a number of solutions, hoping actually to automate it via mbstring INI settings; these schemes all failed. iconv didn't work properly. The only thing that did work was to convert the encoding to latin1 -- but this wreaked havoc with actual UTF-8 characters.

Then, through a series of trial-and-error, all-or-nothing shots, I stumbled on a simple solution.

The discovery was to detect the encoding of the string itself (not really the content) and convert eveything in it to that encoding. How, you might ask? With the handy mb_detect_encoding and mb_convert_encoding functions. Of course, this functionality has to be compiled into PHP, but it's well worth it if it's exactly what you need.

Matthew Weir O'Phinney's Blog:
mbstring comes to the rescue

byChris Cornutt May 17, 2006 @ 10:49:23

Character encodings, especially when dealing with XML, in PHP can be a pain to say the least. Matthew Weir O'Phinney found this out first-hand when a script he was working with had a mixed character set in one of its strings, giving the XML parser in the SimpleXML functionality problems.

I tried a number of solutions, hoping actually to automate it via mbstring INI settings; these schemes all failed. iconv didn't work properly. The only thing that did work was to convert the encoding to latin1 -- but this wreaked havoc with actual UTF-8 characters.

Then, through a series of trial-and-error, all-or-nothing shots, I stumbled on a simple solution.

The discovery was to detect the encoding of the string itself (not really the content) and convert eveything in it to that encoding. How, you might ask? With the handy mb_detect_encoding and mb_convert_encoding functions. Of course, this functionality has to be compiled into PHP, but it's well worth it if it's exactly what you need.

tagged: mbstring xml simplexml encoding utf-8 detect convert mbstring xml simplexml encoding utf-8 detect convert

Link:

David's Blog:
PHP, XML, and Unicode

byChris Cornutt Mar 02, 2006 @ 00:22:42

David has posted the results from some of his testing with the Unicode support that's available in PHP, specifically in the context of XML.

A couple of weeks ago, Tim Bray posted about PHP and received a firestorm of comments.

As Tim updated his posting with comments, he linked to a two-year-old posting by Steve Minutillo about PHP4's inability to detect character encodings in XML files and other Unicode bugs. That caught me by surprise â€” after all, PHP uses the venerable Expat as its XML parsing engine (the same engine used in most programming environments other than Java), and if Expat wasnâ€™t getting things right, then the PHP people must have gone way out of their way to misconfigure it.

So, in the rest of the post, he sets about testing these results for himself, using PHP version 4.4.0 and 5.0.5 to test with. He shows the code that he used to create the tests - it produces UTF-8 encoded text regardless of what type it's outputted as. There were some issues that he ran up against, but some of that it just due to the large abiguity that XML creation/handling has in PHP.

tagged: xml unicode utf-8 test support version 4.4.0 5.0.5 xml unicode utf-8 test support version 4.4.0 5.0.5

Link:

David's Blog:
PHP, XML, and Unicode

byChris Cornutt Mar 02, 2006 @ 00:22:42

David has posted the results from some of his testing with the Unicode support that's available in PHP, specifically in the context of XML.

A couple of weeks ago, Tim Bray posted about PHP and received a firestorm of comments.

As Tim updated his posting with comments, he linked to a two-year-old posting by Steve Minutillo about PHP4's inability to detect character encodings in XML files and other Unicode bugs. That caught me by surprise â€” after all, PHP uses the venerable Expat as its XML parsing engine (the same engine used in most programming environments other than Java), and if Expat wasnâ€™t getting things right, then the PHP people must have gone way out of their way to misconfigure it.

So, in the rest of the post, he sets about testing these results for himself, using PHP version 4.4.0 and 5.0.5 to test with. He shows the code that he used to create the tests - it produces UTF-8 encoded text regardless of what type it's outputted as. There were some issues that he ran up against, but some of that it just due to the large abiguity that XML creation/handling has in PHP.

tagged: xml unicode utf-8 test support version 4.4.0 5.0.5 xml unicode utf-8 test support version 4.4.0 5.0.5

Link:

SitePoint PHP Blog:
PHP UTF-8 0.1

byChris Cornutt Feb 28, 2006 @ 12:54:57

In this post from the SitePoint PHP Blog, Harry Fuecks talks about a new package of software he's worked up to make it possible for PHP to handle UTF-8 encoded strings - PHP UTF-8.

Been messing around with bits of this code for a long time, in fact since first really getting to grips with Dokuwiki, but finally got the first release out.

PHP UTF-8 is intended to make it possible to handle UTF-8 encoded strings in PHP, without requiring the mbstring extension (although it uses mbstring if it's available). In short, it provides versions of PHP's string functions (pretty much everything you'll find on this list), prefixed with utf_ and aware of UTF-8 encoding (that 1character >= 1 byte). It also gives you some tools to help check UTF-8 strings for "well formedness", strip bad sequences and some "ASCII helpers".

He continues the post, mentioning where some of the code for it was pulled from and a note about the documentation (there, but scarce). He also includes a warning for the use of it - not to use it "blindly" and only to use it when you need it, not to replace the standard PHP str_* functions.

tagged: sitepoint utf-8 mbstring handle string encoded sitepoint utf-8 mbstring handle string encoded

Link:

SitePoint PHP Blog:
PHP UTF-8 0.1

byChris Cornutt Feb 28, 2006 @ 12:54:57

In this post from the SitePoint PHP Blog, Harry Fuecks talks about a new package of software he's worked up to make it possible for PHP to handle UTF-8 encoded strings - PHP UTF-8.

Been messing around with bits of this code for a long time, in fact since first really getting to grips with Dokuwiki, but finally got the first release out.

PHP UTF-8 is intended to make it possible to handle UTF-8 encoded strings in PHP, without requiring the mbstring extension (although it uses mbstring if it's available). In short, it provides versions of PHP's string functions (pretty much everything you'll find on this list), prefixed with utf_ and aware of UTF-8 encoding (that 1character >= 1 byte). It also gives you some tools to help check UTF-8 strings for "well formedness", strip bad sequences and some "ASCII helpers".

He continues the post, mentioning where some of the code for it was pulled from and a note about the documentation (there, but scarce). He also includes a warning for the use of it - not to use it "blindly" and only to use it when you need it, not to replace the standard PHP str_* functions.

tagged: sitepoint utf-8 mbstring handle string encoded sitepoint utf-8 mbstring handle string encoded

Link:

SitePoint PHP Blog:
Living Dangerously with PHP and UTF-8

byChris Cornutt Dec 07, 2005 @ 13:45:38

In this new post on the SitePoint PHP Blog today, Harry looks at why it's "living dangerously" to use PHP with UTF-8.

Quick oneâ€”knocked up a list of "dangerous" functions and functionality in PHP, in relation to the use of UTF-8, available at http://www.phpwact.org/php/i18n/utf-8. These are for a "default" PHP setup without the mbstring overloading or PHP6 (where charset problems "magically vanish" ;) ).

This follows on from (unfinished) stuff here on charsets (tending towards UTF-8), which should help explain some of this.

He also notes that you can't rely on mbstring to be there, so he offers an alternative...

tagged: utf-8 mbstring functions utf-8 mbstring functions

Link: