Example of using cURL for converting a webpage to a pure text

May72013

Example of using cURL for converting a webpage to a pure text

2 min. reading

The preg-replace function takes a URL and returns a plain-text version of the page. It uses cURL to retrieve the page and a combination of regular expressions to strip all unwanted whitespace.

This function will even strip the text from <style> and <script> tags, which are ignored by PHP functions such as strip_tags (which strips only tags, leaving the text in the middle intact).

Regular expressions were split in two stages, to avoid deleting single carriage returns, but still delete all blank lines and multiple linefeeds or spaces, trimming operations took place in 2 stages.

function webpage2txt($url)
{
    $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";

    $ch = curl_init(); // initialize curl handle
    curl_setopt($ch, CURLOPT_URL, $url); // set URL
    curl_setopt($ch, CURLOPT_FAILONERROR, 1); // Fail on errors
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
    curl_setopt($ch, CURLOPT_PORT, 80); //Set the port number
    curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s

    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

    $document = curl_exec($ch);

    $search = array
    (
        '@<script [^>]*?>.*?</script>@si', // Strip out javascript
        '@<style [^>]*?>.*?</style>@siU', // Strip style tags properly
        '@< [/!]*?[^<>]*?>@si', // Strip out HTML tags
        '@< ![sS]*?–[ tnr]*>@', // Strip multi-line comments including CDATA
        '/s{2,}/',
    );

    $text = preg_replace($search, "n", html_entity_decode($document));

    $pat[0] = "/^s+/";
    $pat[2] = "/s+$/";
    $rep[0] = "";
    $rep[2] = " ";

    $text = preg_replace($pat, $rep, trim($text));

    return $text;
}

Potential uses of this function are extracting keywords from a webpage, counting words and things like that. If you find it useful, drop us a comment and let us know where you used it.

Source: one of comments for preg-replace function in PHP manual

onezeronull.com

Example of using cURL for converting a webpage to a pure text

Example of using cURL for converting a webpage to a pure text

Leave a ReplyCancel reply