Example of using cURL for converting a webpage to a pure text

The preg-replace function takes a URL and returns a plain-text version of the page. It uses cURL to retrieve the page and a combination of regular expressions to strip all unwanted whitespace.

This function will even strip the text from <style> and <script> tags, which are ignored by PHP functions such as strip_tags (which strips only tags, leaving the text in the middle intact).

Regular expressions were split in two stages, to avoid deleting single carriage returns, but still delete all blank lines and multiple linefeeds or spaces, trimming operations took place in 2 stages.

function webpage2txt($url)
{ 
    $user_agent = “Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)”; 

    $ch = curl_init();    // initialize curl handle 
    curl_setopt($ch, CURLOPT_URL, $url); // set url to post to 
    curl_setopt($ch, CURLOPT_FAILONERROR, 1);              // Fail on errors 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);    // allow redirects 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable 
    curl_setopt($ch, CURLOPT_PORT, 80);            //Set the port number 
    curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s 

    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); 

    $document = curl_exec($ch); 

    $search = array
    (
        ’@<script [^>]*?>.*?</script>@si’,  // Strip out javascript 
        ‘@<style [^>]*?>.*?</style>@siU’,    // Strip style tags properly 
        ‘@< [/!]*?[^<>]*?>@si’,            // Strip out HTML tags 
        ‘@< ![sS]*?–[ tnr]*>@’,         // Strip multi-line comments including CDATA 
        ‘/s{2,}/’,
    ); 

    $text = preg_replace($search, “n”, html_entity_decode($document)); 

    $pat[0] = “/^s+/”; 
    $pat[2] = “/s+$/”; 
    $rep[0] = “”; 
    $rep[2] = ” “; 

    $text = preg_replace($pat, $rep, trim($text)); 

    return $text; 
}

Potential uses of this function are extracting keywords from a webpage, counting words and things like that. If you find it useful, drop us a comment and let us know where you used it.

Source: one of comments for preg-replace function in PHP manual

Leave a Reply