Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

David Sklar:
Fixing Broken UTF-8
Aug 27, 2015 @ 10:48:29

David Sklar has a post to his site showing you how to fix broken UTF-8 characters in content being passed through the normal string functions.

When working on the i18n bits of Learning PHP 7, I had a problem. My example showing how plain string functions such as strtolower() and strtoupper() mangle multibyte UTF-8 characters was making the book formatting/rendering pipeline barf. The processing tools are expecing nicely formatted, valid, UTF-8 encoded HTMLBook files. It didn’t like the mangled invalid UTF-8 characters in my example output.

To fix this, I wrote the following function to replace invalid UTF-8 sequences with the Unicode Replacement Character (U+FFFD).

He includes the code for this method that walks through the string, character by character, and checks the bytes it contains to see how it needs to be translated. There's plenty of comments in it too, explaining what it's doing as it goes along.

tagged: fix broken utf8 character function example unicode replacement

Link: http://www.sklar.com/php/2015/08/25/fixing-broken-utf8/

Three Devs & A Maybe Podcast:
Understanding Character Sets and Encodings
May 14, 2014 @ 13:12:06

The Three Devs & A Maybe podcast (with hosts Michael Budd, Fraser Hart, Lewis Cains and Edd Mann) has posted their latest episode (#24) talking about character sets and encodings.

Having only just recently been bit by the character encoding issue again, we thought it would be a good time to bring it up on the podcast. Starting from the beginning with ASCII, we move on to discuss how 8-bit compatible machines brought way to the ISO-8859-* standards. This leads us on to Unicode, with the goal to develop a single character-set encoding standard that could support all of the world's scripts. Finally, we discuss the de-factor character encoding implementation used on the web today 'UTF-8', and reasons why this is the case.

Lots of different topics are mentioned including reversing a Unicode String in PHP using UTF-16BE/LE, portable UTF-8 and a YouTube video covering Pragmatic Unicode. You can listen to this new episode though the in-page player, by downloading the mp3 or subscribing to their feed.

tagged: threedevsandamaybe podcast ep24 unicode character set encoding utf8

Link: http://threedevsandamaybe.com/posts/understanding-character-sets-and-encodings/

Christian Weiske:
PHP: Cannot access property started with '\o'
Nov 08, 2013 @ 09:59:13

Christian Weiske had an interesting situation pop up in one of his applications around a call to a variable with an interesting name.

Some days ago I saw the following fatal error for the first time in my life:

Fatal error: Cannot access property started with '\o' in file.php

After some debugging, I found out that the source of the error was not some strange BOM or UTF-8 characters in PHP source code files. No, it was a combination of protected class properties, object-to-array casting and automatic template property mappings.

As it turns out, there was a change in how object-to-array casting was done in PHP 5.3 that made this break (related to things appended to private and protected variable names). He includes a bit of sample code to illustrate the problem - a simple class converted from object to array with direct casting. He does point out that it doesn't happen with get_object_vars, though, as that doesn't do the casting, just extraction.

tagged: class property special character private protected casting

Link: http://cweiske.de/tagebuch/php-property-started-nul.htm

An Introduction to Ctype Functions
Apr 30, 2013 @ 11:38:32

On PHPMaster.com today David Shirey has a written up a new tutorial introducing the ctype functions in PHP. This set of functions provides a handy way to more correctly check values to ensure they're valid (and contain what they should).

If you have a background in C, then you’re probably already familiar with the character type functions because that is where they come from (don’t forget that PHP is actually written in C). But if you’re into Python, then it’s only fair to point out that the PHP Ctype functions have absolutely nothing to do with the Python’s ctypes library. It’s just one of those tragic and totally unavoidable naming similarities.

He briefly explains how the functions work and at least one "gotcha" to watch out for if you're using them for input validation. He then goes through the list of the eleven ctype functions and briefly describes what they do. Some example code is also included showing how you can use them to validate a value based on the true/false return from the function call.

tagged: ctype function introduction tutorial character type

Link: http://phpmaster.com/an-introduction-to-ctype-functions

Gareth Heyes:
PHP nonalpha tutorial
Aug 22, 2012 @ 08:53:02

Gareth Heyes has another post to his site on the topic of "non-alpha PHP code", this time getting a bit more into the process and how his examples are parsed by PHP into more familiar functionality.

My first post on PHP non-alpha numeric code was a bit brief, in the excitement of the discovery I failed to detail in depth the process. I’ve decided to follow up with a tutorial and hopefully explain the process better for anyone wanting to learn or improve the technique. The basis of PHP non-alphanumeric code is to take advantage of the fact that PHP automatically converts Arrays into a string “Array” when using in a string context.

He includes some basic examples showing how, with just a combination of things like "+", "_" and "[" or "]" you can reproduce similar output to echoing out an array and use that "Array" output string to get to other strings (like the letter "B"). There's also a more lengthy example showing how to build up the string "print 1+1" and have it execute using this technique.

tagged: interpretation nonalpha tutorial special character


Special characters in Regular Expressions - Part 1
Jun 12, 2012 @ 08:39:11

On the Refulz.com site they've posted the first part of a series about the basics of using special characters regular expressions (both in PHP and outside of it).

With this post, we continue to explore the Regular expressions. The first post of the Learning Regular Expression series introduced Regular Expressions. The first post covers the regular expression delimiters and the “i” pattern modifier. In the language of regular expression, there is a special meaning of certain characters.

In this article they show the use of characters like the caret, asterisk, dot and dollar symbol to modify your expressions to handle special cases, matching for more than one character and the start and end of strings.

tagged: regularexpression tutorial introduction special character


Let's talk Character Encoding
Mar 15, 2012 @ 11:07:07

On Reddit.com there's a recent post with a growing discussion about character encodings in PHP applications (with some various recommendations).

I would rather not have to convert these weird characters to the HTML character entities, if possible. I'd rather be able to use these characters directly on the web page. If this is for some reason a bad idea, let me know. This might be more of a general web design question (i already posted it there), but I figured it is still appropriate to post here as well since PHP is used to pull an entry from the database, and I figured a lot of you here would know the answer to the question.

The general consensus is to use UTF8 in this case, but there's a few reminders for the poster too:

  • Don't forget to make the database UTF8 too
  • Be sure you're sending the right Content-Type for the UTF8 data
  • an link to an article about what "developers must know about unicode/charactersets"
tagged: character encoding advice reddit utf8 contenttype unicode


James Cohen's Blog:
How to Avoid Character Encoding Problems in PHP
Apr 25, 2011 @ 14:13:14

James Cohen has a recent post to his blog looking at a way you can avoid some of the character encoding problems in PHP that can come with working with multiple character sets.

Character sets can be confusing at the best of times. This post aims to explain the potential problems and suggest solutions. Although this is applied to PHP and a typical LAMP stack you can apply the same principles to any multi-tier stack.

He includes a "boring history" session (and recommends skipping if you just want the good stuff) that talks a bit about character sets and their history in computer system handling. All that said, he recommends using UTF-8 to ease your character encoding woes. He talks about configuring your editor to support it, making sure your browsers understand it and setting up your MySQL database connection to use it.

tagged: character encoding issue mysql browser editor ide


Create a Localized Web Page with PHP
Oct 21, 2010 @ 13:21:23

On WebReference.com there's a new tutorial posted about localizing your website by defining a character set to use for your content.

The process of making your applications/websites usable in many different locales is called internationalization, While customizing your code for different locales is called localization. Localization is the process of making your applications or websites local to where it is being viewed. For example, you can make a website more local to a particular place by converting its text to the predominate language of that location and by displaying the local time (e.g. German for people living in Germany or French for people living in France).

They show how to define constants that can be used in your application for the character set and language encoding. They use two major encodings - UTF-8 and ISO-8859-1 - in their examples of showing a sample "welcome" message in different languages. There's also a simple page to show you how to switch between languages if you'd like to give your visitors the option.

tagged: localize tutorial language encoding character


Kevin Schroeder's Blog:
You want to do WHAT with PHP? Chapter 3
Aug 31, 2010 @ 13:44:32

Kevin Schroeder has posted another excerpt from his "You Want to Do WHAT with PHP?" book to his blog today. This time it's from the third chapter that looks at character encodings like UTF-8 or ISO-8859-1.

I realized that while this 3.5-year PHP consultant knew Unicode, UTF-8, character encodings such as ISO-8859-1 or ISO-8859-7, I didn't understand them as well as I thought I had. With that I threw this chapter in the book. Knowing about character encoding is what many developers have. Not as many truly understand it. In this chapter I try to de-mystify character encoding as a whole.

The excerpt introduces character encoding and what it really is - a translation for the computer to be able to handle the human language. The problem comes in when multiple tools try to define the same sort of letters/chatacters in different ways. He gives an example of a "hello world" string in a normal ASCII format versus one from the EBCDIC format and how it would be rendered by an ASCII-understanding browser.

tagged: character encoding book excerpt ascii example