PHP - Encoding

Tags: charset

A PHP string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support.

1. Encoding: string will be encoded in whatever fashion it is encoded in the script file

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string “á” equivalent to “\xE1″ (ISO-8859-1), “\xC3\xA1″ (UTF-8, C form), “\x61\xCC\x81″ (UTF-8, D form) or any other possible representation? The answer is that string will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. However, this does not apply if Zend Multibyte is enabled;

2. BOM issue

BOM(Byte-Order-Marker) is at the beginning of some encoding file/stream. e.g., UTF-8′s BOM is “EFBBBF”.

PHP doesn’t like the BOM( e.g. session_start function). You can use tools like UltraEdit save-as to remove BOM.

3. string functions

$chars = htmlentities($chars, ENT_QUOTES, ‘utf-8′);
$chars = htmlspecialchars($chars, ENT_QUOTES, ‘utf-8′);
$chars = html_entity_decode($chars, ENT_QUOTES, ‘utf-8′);

$chars = iconv(‘Big5′,’utf-8′,$chars); //convert string from Big5 to UTF-8
iconv_strlen — Returns the character count of string
iconv_strpos — Finds position of first occurrence of a needle within a haystack
iconv_strrpos — Finds the last occurrence of a needle within a haystack
iconv_substr — Cut out part of a string

4. UTF-8 get_utf8_sub_string() if it contains Chinese characters

The length of a Chinese character is 3 in UTF-8

echo get_utf8_sub_string("a我是标题", 8);

function get_utf8_sub_string($str, $max_length)
{
  if(strlen($str) > $max_length)
  {
    $check_num = 0;
    for($i=0; $i < $max_length; $i++)
    {
      if (ord($str[$i]) > 128)
        $check_num++;
    }

    if($check_num % 3 == 0)
      $str = substr($str, 0, $max_length)."...";
    else if($check_num % 3 == 1)
      $str = substr($str, 0, $max_length + 2)."...";
    else if($check_num % 3 == 2)
      $str = substr($str, 0, $max_length + 1)."...";
  }
  return $str;
}