Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

David Sklar:
Fixing Broken UTF-8
Aug 27, 2015 @ 10:48:29

David Sklar has a post to his site showing you how to fix broken UTF-8 characters in content being passed through the normal string functions.

When working on the i18n bits of Learning PHP 7, I had a problem. My example showing how plain string functions such as strtolower() and strtoupper() mangle multibyte UTF-8 characters was making the book formatting/rendering pipeline barf. The processing tools are expecing nicely formatted, valid, UTF-8 encoded HTMLBook files. It didn’t like the mangled invalid UTF-8 characters in my example output.

To fix this, I wrote the following function to replace invalid UTF-8 sequences with the Unicode Replacement Character (U+FFFD).

He includes the code for this method that walks through the string, character by character, and checks the bytes it contains to see how it needs to be translated. There's plenty of comments in it too, explaining what it's doing as it goes along.

tagged: fix broken utf8 character function example unicode replacement

Link: http://www.sklar.com/php/2015/08/25/fixing-broken-utf8/

Engine Yard Blog:
What to Expect When You're Expecting: PHP 7, Part 2
Apr 08, 2015 @ 11:07:08

The Engine Yard blog has posted the second part of Davey Shafik's "What to Expect with You're Expecting: PHP7" series. In this new post he gets into the details of a few more of the upcoming PHP7 features including generator improvements and engine exceptions.

As you probably already know, PHP 7 is a thing, and it’s coming this year! Which makes this as good a time as any to go over what’s new and improved. In the first part of this series, we looked at the some of the most important inconsistency fixes coming up in PHP 7 as well as two of the biggest new features. In this post, we take a look another six big features to land in PHP 7 that you’ll want to know about.

The features he talks about this time are:

  • Unicode Codepoint Escape Syntax
  • Null Coalesce Operator
  • Bind Closure on Call
  • Group Use Declarations
  • Generator return expressions and delegation
  • Engine Exceptions

He also includes three things you can do to help/get prepared for this upcoming release including testing your code on a PHP7 VM or help out with writing tests and documentation for PHP and its extensions.

tagged: engineyard php7 feature list major unicode coalesceoperator bindclosure groupuse generator engineexception

Link: https://blog.engineyard.com/2015/what-to-expect-php-7-2

Halls of Valhalla:
From PHP 5 to 7
Sep 22, 2014 @ 10:56:32

On the "Halls of Valhalla" site there's a new post the tries to explain the jump from PHP5 to PHP7 and what all that means for the language (and community around it).

Since around 2005 we've heard talk about PHP 6 development. There have even been books sold about it. But where is it? As of July of this year it was decided that there won't be one and that PHP will skip directly to version 7. Why is it skipping to the next major version, and what ever happened with PHP 6? And if we're already jumping to PHP 7, what kinds of features will it have?

They start with a "brief history" of PHP since its inception back in the mid 1990s and follow its evolution at a high level through the years. Then comes the topic of PHP6 and the work that was already being put towards it and integrated Unicode support. It talks about some of the difficulties of this conversion and the delays that ended up happening. Instead, it was decided that things would stay in the PHP 5.x series and 5.3, 5.4 and 5.5 have been created since. The jump to PHP7 came from this vote with several different reasons influencing the decision.

The post finishes with a look at some of the new things that will be coming in PHP7 including major performance improvements, abstract syntax tree functionality and asynchronous programming, allowing for the execution of parallel tasks in the same request.

tagged: php5 php6 php7 community unicode language history features

Link: http://halls-of-valhalla.org/beta/news/from-php-5-to-7,146/

Three Devs & A Maybe Podcast:
Understanding Character Sets and Encodings
May 14, 2014 @ 13:12:06

The Three Devs & A Maybe podcast (with hosts Michael Budd, Fraser Hart, Lewis Cains and Edd Mann) has posted their latest episode (#24) talking about character sets and encodings.

Having only just recently been bit by the character encoding issue again, we thought it would be a good time to bring it up on the podcast. Starting from the beginning with ASCII, we move on to discuss how 8-bit compatible machines brought way to the ISO-8859-* standards. This leads us on to Unicode, with the goal to develop a single character-set encoding standard that could support all of the world's scripts. Finally, we discuss the de-factor character encoding implementation used on the web today 'UTF-8', and reasons why this is the case.

Lots of different topics are mentioned including reversing a Unicode String in PHP using UTF-16BE/LE, portable UTF-8 and a YouTube video covering Pragmatic Unicode. You can listen to this new episode though the in-page player, by downloading the mp3 or subscribing to their feed.

tagged: threedevsandamaybe podcast ep24 unicode character set encoding utf8

Link: http://threedevsandamaybe.com/posts/understanding-character-sets-and-encodings/

Edd Mann:
Reversing a Unicode String in PHP using UTF-16BE/LE
May 12, 2014 @ 10:55:00

Edd Mann looks at an issue in his latest post that caused him problems in a recent project, reversing a Unicode string with UTF-16BE/LE.

Last week I was bit by the Unicode encoding issue when trying to naively manipulate a user's input using PHP's built-in string functions. PHP simply assumes that all characters are a single byte (octet) and the provided functions use this assumption when processing a string. [...] You should be aware that in 'Western Europe' we commonly only use the basic ASCII character-set (consisting of 7 bytes). This makes the transition to the popular 'UTF-8' Unicode representation almost seamless, as the two map one-to-one. I wish to however, discuss how to reverse a Unicode string (UTF-8) using a combination of endianness magic and the 'strrev' function.

He provides two different approaches to the problem. The first he calls the "naive" approach because it corrupts characters needing more than the two-byte representation. His second solution, the "endianness" method, converts the string to big-endian first (UTF-16) and then back to UTF-8 for more correct handling.

tagged: unicode string utf8 utf16 bigendian endian convert reverse string

Link: http://eddmann.com/posts/reversing-a-unicode-string-in-php-using-utf-16-be-le

SitePoint PHP Blog:
Bringing Unicode to PHP with Portable UTF-8
Sep 10, 2013 @ 11:19:05

On the SitePoint PHP blog there's a new tutorial showing you how to bring portable UT-8 support to PHP with the Portable-UTF8 library. UTF-8 handling has long been one thing desired in the core of PHP, but hasn't been introduced quite yet.

PHP’s lack of Unicode/multibyte support means that the standard string handling functions treat strings as a sequence of single-byte characters. In fact, the official manual defines a string in PHP as a “series of characters, where a character is the same as a byte.” PHP supports only 8-bit characters, while Unicode (and many other character sets) may require more than one byte to represent a character. This limitation of PHP affects almost all aspects of string manipulation, including (but not limited to) substring extraction, determining string lengths, string splitting, shuffling etc.

The article mentions some of the efforts in the past that have been made to try to introduce this functionality into the core, but was shelved at the time. Instead of waiting on this feature to be introduced, they show you how to use the library to do things like check for UTF-8 strings, "cleaning" the UTF-8 strings and do some validation on the string's contents.

tagged: unicode portable utf8 library tutorial

Link: http://www.sitepoint.com/bringing-unicode-to-php-with-portable-utf8/

PHPMaster.com:
Working with Multibyte Strings
Jul 18, 2013 @ 10:12:55

On PHPMaster.com there's a tutorial posted that helps you understand how to work with multibyte strings in PHP. Multibyte strings could be a set of characters from a non-English language. They have to be treated differently than normal strings using the mbstring functionality.

A written language, whether it’s English, Japanese, or whatever else, consists of a number of characters, so an essential problem when working with a language digitally is to find a way to represent each character in a digital manner. Back in the day we only needed to represent English characters, but it’s a whole different ball game today and the result is a bewildering number of character encoding schemes used to represent the characters of many different languages. How does PHP relate to and deal with these different schemes?

He goes through a bit of introduction to multibyte strings - how they're represented internally, character schemes and Unicode. He also talks about the PHP support for the strings, noting that it's not really made to deal with them by default and the two methods you might use - iconv and mbstring. He shows how to enable the latter and introduces some of the most common functions you'll use with it (complete with some code examples).

tagged: multibyte strings tutorial mbstring introduction unicode

Link: http://phpmaster.com/working-with-multibyte-strings

Phil Sturgeon:
PHP 6: Pissing in the Wind
Jan 28, 2013 @ 10:42:16

With some of the recent talk about the consistency of naming methods in PHP (or lack thereof) Phil Sturgeon has put together some ideas about why this (and unicode) changes aren't happing in the language.

PHP is well known for having an inconsistent API when it comes to PHP functions. Anyone with an anti-PHP point of view will use this as one of their top 3 arguments for why PHP sucks, while most PHP developers will point out that they don't really care. [...] Another big thing that anti-PHP folks laugh about is the lack of scalar objects, so instead of $string->length() you have to do strlen($string). ANOTHER thing that people often joke about is how PHP 6.0 just never happened, because the team were trying to bake in Unicode support but just came across so many issues that it never happened.

He shares an "obvious answer" to the problems and shares a theory as to why it's not happening - that no one is really working on out (outisde of this POC) and some of the handling with the recent property accessors RFC. He finishes off the post with three more points, all related to the results of the voting - little points seem to get voted in easier, the representation of developers in the process and that at least one of the "no" votes had to do with not wanting to maintain the results.

Making changes to this language should not be blocked just because a quiet minority of the core team don't like the idea of being asked to do stuff.

Be sure to check out the comments on the post - there's lots of them, so be sure you have some good time to read.

tagged: opinion php6 unicode property accessors rfc voting

Link:

Reddit.com:
Let's talk Character Encoding
Mar 15, 2012 @ 11:07:07

On Reddit.com there's a recent post with a growing discussion about character encodings in PHP applications (with some various recommendations).

I would rather not have to convert these weird characters to the HTML character entities, if possible. I'd rather be able to use these characters directly on the web page. If this is for some reason a bad idea, let me know. This might be more of a general web design question (i already posted it there), but I figured it is still appropriate to post here as well since PHP is used to pull an entry from the database, and I figured a lot of you here would know the answer to the question.

The general consensus is to use UTF8 in this case, but there's a few reminders for the poster too:

  • Don't forget to make the database UTF8 too
  • Be sure you're sending the right Content-Type for the UTF8 data
  • an link to an article about what "developers must know about unicode/charactersets"
tagged: character encoding advice reddit utf8 contenttype unicode

Link:

Leonid Mamchenkov's Blog:
PHP regular expression to match English/Latin characters only
Aug 18, 2011 @ 12:11:44

Leonid Mamchenkov has a quick new post to his blog sharing a regular expression that can be used to check that a string contains only English or Latin characters (no Unicode allowed).

Today at work I came across a task which turned out to be much easier and simpler than I originally thought it would. We have have a site with some user registration forms. The site is translated into a number of languages, but due to the regulatory procedures, we have to force users to input their registration details in English only. Using Latin characters, numbers, and punctuation.

Thankfully the PCRE regular expression engine bundled with PHP makes it simple - it uses a standard regular expression without anything special to accommodate for Unicode characters. He notes that adding the "/u" modifier to the expression makes it "totally malfunction" (where strings are treated as UTF-8). If you'd like an example of some of the tricks that go into supporting Unicode in a regex, see this comment in the PHP manual.

tagged: regular expression example english latin unicode

Link: