Text truncating function that secures both HTML tags and full words
Akshaya K Sahu’s answer to this question at Stack Overflow is a great example of text parsing function that can truncate any HTML-encoded string at given length, taking care of all the needed aspects, i.e.:
- full words,
- properly closed HTML tags and
- respected UTF-8 encoding (double-byte characters!)
I have actually nothing to add to it, so I keep a copy of this code only for my own reference. And only because the original answer lacks some comments.
/**
* Truncates string at given length with both HTML tags and full words secured.
*
* Note that process of "securing" both HTML tags and full words in truncated
* string is quite complicated process. So, don't expect that your string will
* be truncated at exactly given length. Treat this value as an approximate one.
*
* Source: http://stackoverflow.com/a/8741240/1469208
*
* @param string $html Input string containing HTML code.
* @param integer $maxLength Approximate length of string. See above notice.
*
* @return string Parsed string, truncated at given length.
*/
public static function htmlSafeStringTruncate($html, $maxLength = 150)
{
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
$newContent = '';
$html = $content = preg_replace("/<img [^/>]+>/i", "", $html);
while($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
/**
* Print text leading up to the tag.
*/
$str = mb_strcut($html, $position, $tagPosition - $position);
if($printedLength + mb_strlen($str) > $maxLength)
{
$newstr = mb_strcut($str, 0, $maxLength - $printedLength);
$newstr = preg_replace('~s+S+$~', '', $newstr);
$newContent .= $newstr;
$printedLength = $maxLength;
break;
}
$newContent .= $str;
$printedLength += mb_strlen($str);
if($tag[0] == '&')
{
/**
* Handle the entity.
*/
$newContent .= $tag;
$printedLength++;
}
else
{
/**
* Handle the tag.
*/
$tagName = $match[1][0];
if($tag[1] == '/')
{
/**
* This is a closing tag.
*/
$openingTag = array_pop($tags);
assert($openingTag == $tagName);
$newContent .= $tag;
}
else if($tag[mb_strlen($tag) - 2] == '/')
{
/**
* This is a self-closing tag.
*/
$newContent .= $tag;
}
else
{
/**
* This is an opening tag.
*/
$newContent .= $tag;
$tags[] = $tagName;
}
}
/**
* Continue after the tag.
*/
$position = $tagPosition + mb_strlen($tag);
}
/**
* Print any remaining text.
*/
if ($printedLength < $maxLength && $position < mb_strlen($html))
{
$newstr = mb_strcut($html, $position, $maxLength - $printedLength);
$newstr = preg_replace('~s+S+$~', '', $newstr);
$newContent .= $newstr;
}
/**
* Close any remaining open tags.
*/
while (!empty($tags))
{
$newContent .= sprintf('</%s>', array_pop($tags));
}
return $newContent;
}