Full-service Internet Marketing & Web Development
Recent Posts

Sponsors
![]() |
Detecting the UTF-8 Signature in text filesMike Peters, 06-21-2009 |
One common trap a lot of new developers fall into is editing php source files with an editor that saves text-files as Unicode.
Notepad for example, will automatically save a text-file as Unicode as soon as it detects the first character in the file that is a Unicode character.
Other editors can be configured to either save files as Unicode or plain Ascii text.
What's wrong with Unicode PHP source files?
A unicode text file will load up normally in your text editor, but there are 3 hidden characters added at the top of the file, to indicate this is a Unicode file.
These characters are: xEF xBB xBF
The big problem here is that php doesn't recognize these characters and will simply display them on the screen. Because these characters are the very first thing displayed on the screen, unless you have output-buffering turned on, your Unicode source file will be unable to store any session variables.
Example
Consider this innocent PHP script, that displays a message if this is a user's first visit to the site -
session_start();
if (!isset($_SESSION['beenhere']))
{
echo "Welcome! This is your first visit to the site";
}
$_SESSION['beenhere'] = "true";
Saving this PHP script in an editor that saves text-files as Unicode, will result in something that looks like this -
^xEF^xBB^xBF
session_start();
if (!isset($_SESSION['beenhere']))
{
echo "Welcome! This is your first visit to the site";
}
$_SESSION['beenhere'] = "true";
The end-user will see 3 strange looking characters at the top of the page, but more importantly since these characters are output before the call to session_start(), session_start will fail and no session variables will ever be stored.
If you're using session variables widely, this simple thing can completely break the functionality of your pages.
How to detect UTF-8 Signatures in source files
To detect the UTF-8 signature, save this bash script as 'checkutf.sh':
for i in `find ./ -type f -name '*.php'`; do hexdump -C $i | head -n1 | grep -i 'ef bb bf' && echo $i; done
Be sure to chmod 755 checkutf.sh
Then you can run checkutf.sh from any folder where you'd like to verify no files have the UTF-8 signature. This bash script will recursively check all files under the current directory and all subdirectories, displaying all instances where files contain the UTF-8 signature at the top of the file.
Notepad for example, will automatically save a text-file as Unicode as soon as it detects the first character in the file that is a Unicode character.
Other editors can be configured to either save files as Unicode or plain Ascii text.
What's wrong with Unicode PHP source files?
A unicode text file will load up normally in your text editor, but there are 3 hidden characters added at the top of the file, to indicate this is a Unicode file.
These characters are: xEF xBB xBF
The big problem here is that php doesn't recognize these characters and will simply display them on the screen. Because these characters are the very first thing displayed on the screen, unless you have output-buffering turned on, your Unicode source file will be unable to store any session variables.
Example
Consider this innocent PHP script, that displays a message if this is a user's first visit to the site -
session_start();
if (!isset($_SESSION['beenhere']))
{
echo "Welcome! This is your first visit to the site";
}
$_SESSION['beenhere'] = "true";
Saving this PHP script in an editor that saves text-files as Unicode, will result in something that looks like this -
^xEF^xBB^xBF
session_start();
if (!isset($_SESSION['beenhere']))
{
echo "Welcome! This is your first visit to the site";
}
$_SESSION['beenhere'] = "true";
The end-user will see 3 strange looking characters at the top of the page, but more importantly since these characters are output before the call to session_start(), session_start will fail and no session variables will ever be stored.
If you're using session variables widely, this simple thing can completely break the functionality of your pages.
How to detect UTF-8 Signatures in source files
To detect the UTF-8 signature, save this bash script as 'checkutf.sh':
for i in `find ./ -type f -name '*.php'`; do hexdump -C $i | head -n1 | grep -i 'ef bb bf' && echo $i; done
Be sure to chmod 755 checkutf.sh
Then you can run checkutf.sh from any folder where you'd like to verify no files have the UTF-8 signature. This bash script will recursively check all files under the current directory and all subdirectories, displaying all instances where files contain the UTF-8 signature at the top of the file.
![]() |
Mike Peters, 06-21-2009 |
Note - If you have output_buffering turned on in your php.ini, the Unicode header will not get in the way.
I believe PHP5 also has built-in support for ignoring it.
However, for backward compatibility and in the purpose of keeping things clean, I suggest ensuring your favorite editor saves all files as plain-text and avoiding the UTF-8 signature altogether.
I believe PHP5 also has built-in support for ignoring it.
However, for backward compatibility and in the purpose of keeping things clean, I suggest ensuring your favorite editor saves all files as plain-text and avoiding the UTF-8 signature altogether.
|
|
Subscribe Now to receive new posts via Email as soon as they come out.
Comments
Post your comments

