Text truncating function that secures both HTML tags and full words

Akshaya K Sahu’s answer to this question at Stack Overflow is a great example of text parsing function that can truncate any HTML-encoded string at given length, taking care of all the needed aspects, i.e.:

  • full words,
  • properly closed HTML tags and
  • respected UTF-8 encoding (double-byte characters!)

I have actually nothing to add to it, so I keep a copy of this code only for my own reference. And only because the original answer lacks some comments.

/**
 * Truncates string at given length with both HTML tags and full words secured.
 *
 * Note that process of "securing" both HTML tags and full words in truncated
 * string is quite complicated process. So, don't expect that your string will
 * be truncated at exactly given length. Treat this value as an approximate one.
 *
 * Source: http://stackoverflow.com/a/8741240/1469208
 * 
 * @param  string  $html      Input string containing HTML code.
 * @param  integer $maxLength Approximate length of string. See above notice.
 * 
 * @return string             Parsed string, truncated at given length.
 */
public static function htmlSafeStringTruncate($html, $maxLength = 150)
{
    mb_internal_encoding("UTF-8");
    $printedLength = 0;
    $position = 0;
    $tags = array();
    $newContent = '';

    $html = $content = preg_replace("/<img [^/>]+>/i", "", $html);

    while($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        /**
         * Print text leading up to the tag.
         */
        $str = mb_strcut($html, $position, $tagPosition - $position);

        if($printedLength + mb_strlen($str) > $maxLength)
        {
            $newstr = mb_strcut($str, 0, $maxLength - $printedLength);
            $newstr = preg_replace('~s+S+$~', '', $newstr);  
            $newContent .= $newstr;
            $printedLength = $maxLength;

            break;
        }

        $newContent .= $str;
        $printedLength += mb_strlen($str);

        if($tag[0] == '&')
        {
            /**
             * Handle the entity.
             */
            $newContent .= $tag;

            $printedLength++;
        }
        else
        {
            /**
             * Handle the tag.
             */
            $tagName = $match[1][0];

            if($tag[1] == '/')
            {
                /**
                * This is a closing tag.
                */
                $openingTag = array_pop($tags);

                assert($openingTag == $tagName);

                $newContent .= $tag;
            }
            else if($tag[mb_strlen($tag) - 2] == '/')
            {
                /**
                 * This is a self-closing tag.
                 */
                $newContent .= $tag;
            }
            else
            {
                /**
                * This is an opening tag.
                */
                $newContent .= $tag;

                $tags[] = $tagName;
            }
        }

      /**
       * Continue after the tag.
       */
      $position = $tagPosition + mb_strlen($tag);
    }

    /**
     * Print any remaining text.
     */
    if ($printedLength < $maxLength && $position < mb_strlen($html))
    {
        $newstr = mb_strcut($html, $position, $maxLength - $printedLength);
        $newstr = preg_replace('~s+S+$~', '', $newstr);

        $newContent .= $newstr;
    }

    /**
     * Close any remaining open tags.
     */
    while (!empty($tags))
    {
        $newContent .= sprintf('</%s>', array_pop($tags));
    }

    return $newContent;
}

Leave a Reply