Sean Coates has reposted an article that was originally published in php|architect magazine covering UTF-8 and proper Unicode encoding.
If I had to guess, I would estimate that I've spent somewhere in the range of 40 hours wrangling UTF-8 in the past 3 months, which is not only expensive for my employer, but also disheartening as a developer who's got real work to do. Admittedly, this number is inflated, due to the heavy development cycle we completed with the launch of our new site.
Sean goes on to talk about Unicode issues in general (partially supported in some places, too many points of failure) and some of his other experiences with "the UTF-8 monster" that have given him trouble over time.
In a new post to his blog, Vinu Thomas talks about a set of functions that can make your life easier when handling unicode strings - the mb_* methods of the mbstring extension.
When dealing with multiple languages and internalization in PHP, some of the default functions in PHP end up mangling up the unicode characters in PHP. This is evident when you have a lot of funny looking characters coming up on your web page instead of the actual characters. [...] There is an extensions called mbstring which you can install in PHP which gives you a set of functions which are unicode ( actually multibyte ) ready.
He mentions some of the replacements like mb_send_mail instead o fmail and mb_strlen instead of the usual strlen. Thankfully, there's a simple way to make use of these functions without having to replace a lot of code - a setting in your php.ini (mbstring.func_overload) that tells your application to seamlessly replace things behind the scenes.
Wen Huang has made a quick post to his blog about some of the comments Andrei Zmievski about the future of PHP, specifically on internationalization and UTF-8's place in it.
I attended the SF PHP Meetup last night where Andrei Zmievski (PHP 6 release manager and PHP core team member) gave a talk on PHP 6 and internationalization (i18n). [...] It was evident that Andrei and team have given quite a bit of thought into what i18n means for the PHP world, and as a result, PHP developers everywhere will soon be enjoying a new set of tools to enable faster development of multi-lingual sites.
He also mentions the back-port that several of these features will get into the upcoming PHP 5.3 release (along with the much-hyped namespace support). You can check out Andrei's talk on his website.
On the ThinkPHP blog, Florian Eibeck has posted an overview of some key things to consider when internationalizing your application/website.
The biggest problem is that most developers lack knowledge about Internationalisation, Localisation, Character encodings, Unicode and all those terms connected with multilingualism. The following article should give you a basic understanding and show you how to avoid those funny characters.
He defines a few terms - internationalization, ASCII, unicode and the UTF-8/ISO-8859 character sets. He mentions how to accept the utf-8 string into your application and how to use it in both PHP and store it in a MySQL database.
In a new post on the IBM developerWorks page, Nathan Good takes a look at some of the features of the up and coming versions of the PHP language including things like namespaces, changes in the XML handling and a few things taken out.
PHP's next edition, V6, includes new features and syntax improvements that will make it easier to use from an object-oriented standpoint. Other important features, such as Unicode support in many of the core functions, mean that PHP V6 is positioned for better international support and robustness.
New features he mentions include namespace support, improvements to the native Unicode support as well as a few of the things that will be permanently retired like the php.ini settings for magic_quotes and register_globals.
The main PHP.net website has posted a list of people participating in this year's Google Summer of Code project on various PHP projects. These include:
PHP Optimizer by Samuel Graham Kelly IV, mentored by Derick Rethans
On the Make Me Pulse blog, there's a look at PHP6's support of Unicode in the SPL (Standard PHP Library) TextIterator handler.
I've just install the last version of PHP6 dev and I've decided to test the famous new feature, the PHP Unicode Support. I will not explain new things about PHP6 or Unicode or TextIterator, it's just my discoveries test on this features.
He steps through the process he followed - enabling Unicode support, testing various output methods (including just an echo and using the TextIterator) as well as some of the manipulation methods (next/first/current) that can be used to get certain characters out of a string.
Internationalizing a website can bring all sorts of challenges, as Markus Wolff found out when working on a recent project:
When you're building international websites, there's always something new to learn. Especially if one of the languages your website is available in uses a character set different from anything you're used to. For jimdo.com, the greatest challenge as of yet is the chinese version.
His focus isn't so much on the content of the page but on one small character that caused him headaches - the comma. Unfortunately, it seems that Unicode has its own commas that don't quite adhere to the "normal" rules to make them easy to work with (and, in his case, split with a regular expression). The fix to the situation was simple, though - adding a "u" modifier after the expression made it Unicode-aware and split the information correctly.
Matthew Turland, like many in the PHP community these days, has been pondering PHP6 and what it might mean to both the developers and to the language itself:
Be that as it may, as a user with about five years under his belt as of now who has seen all qualities of PHP code ranging from pristine to pedantic, a thought or two has crossed my mind on the subject of the latest upcoming incarnation of the language.
Specifically, he talks about some of the hot topics like namespaces, the mysqlnd driver, Unicode support, and named parameters. Overall, though, his comments are positive, looking toward a brighter future for PHP with this upcoming edition.
The folks over at php|architect magazine has updated the free issue they're offering to anyone looking to get a taste of the great content inside each issue. Sean Coateswrites:
We've recently updated our web site to offer a new free issue of php|architect magazine! The May 2007 edition of php|architect has proven to be extremely popular, and with PHP 6 on the horizon, we thought everyone should read the cover article on Unicode, so we're releasing it completely free (and without obligation) to registered users of our web site.
Other topics covered in the issue include working with server/client-side validation, preventing SQL injections, a look at the Model View Controller design pattern and dictionary attacks.