Example of using cURL for converting a webpage to a pure text

The preg-replace function takes a URL and returns a plain-text version of the page. It uses cURL to retrieve the page and a combination of regular expressions to strip all unwanted whitespace.

This function will even strip the text from <style> and <script> tags, which are ignored by PHP functions such as strip_tags (which strips only tags, leaving the text in the middle intact).

Regular expressions were split in two stages, to avoid deleting single carriage returns, but still delete all blank lines and multiple linefeeds or spaces, trimming operations took place in 2 stages.

function webpage2txt($url)
{
$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";

$ch = curl_init(); // initialize curl handle
curl_setopt($ch, CURLOPT_URL, $url); // set URL
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // Fail on errors
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
curl_setopt($ch, CURLOPT_PORT, 80); //Set the port number
curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s

curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

$document = curl_exec($ch);

$search = array
(
'@<script [^>]*?>.*?</script>@si', // Strip out javascript
'@<style [^>]*?>.*?</style>@siU', // Strip style tags properly
'@< [/!]*?[^<>]*?>@si', // Strip out HTML tags
'@< ![sS]*?–[ tnr]*>@', // Strip multi-line comments including CDATA
'/s{2,}/',
);

$text = preg_replace($search, "n", html_entity_decode($document));

$pat[0] = "/^s+/";
$pat[2] = "/s+$/";
$rep[0] = "";
$rep[2] = " ";

$text = preg_replace($pat, $rep, trim($text));

return $text;
}

Potential uses of this function are extracting keywords from a webpage, counting words and things like that. If you find it useful, drop us a comment and let us know where you used it.

Source: one of comments for preg-replace function in PHP manual

Leave a Reply