You're viewing all posts tagged with html parsing

How to extract html page title by URL

Actually the subject can be divided into two tasks:

  • retreive data
  • extract information from it

There’s standard library urllib2 in Python for retreiving data over HTTP and a number of libraries for parsing HTML data. I’ll use html5lib in this example.

First iteration of retrieving data

import urllib2

def read_url(url):
    try:
        response = urllib2.urlopen(url)
    except urllib2.URLError:
        return u''
    encoding = get_charset(response.headers)
    return unicode(data, encoding)

We need extra utility function get_charset:

def get_charset(headers, default='utf-8'):
    try:
        content_type = headers['content-type'].lower()
        if content_type.find('charset=') > 0:
            return content_type.split('charset=')[-1].lower()
    except KeyError:
        pass
    return default

Now we can get data!

>>> d = read_url('http://python.org')
>>> d[:50]
u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'

Seems like that’s what wee need.

Extracting title with html5lib

There are examples for it: http://www.sal.ksu.edu/faculty…

Here’s extractor function based on that examples:

from html5lib import HTMLParser, treebuilders, treewalkers

parser = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
walker = treewalkers.getTreeWalker("dom")

def extract_title(html):
    domtree = parser.parse(html)
    titleNode = False
    title = u''
    for token in walker(domtree):
        if token['type'] == 'StartTag' and token['name'] == 'title':
            titleNode = True
        elif titleNode:
            if token['type'] == 'EndTag' and token['name'] == 'title':
                break
            elif token.has_key('data'):
                title += token['data']
    return title.strip()

Let’s try!

>>> extract_title(d)
u'Python Programming Language -- Official Website'

Amazing! That’s working!

Optimization, possibly

The one drawback of extraction method above is that page has to be completely downloaded and parsed for title extraction. I’ve tried to optimize it: read HTTP data just until title data is read.

Here’s read_url function revisited. It’s designed to read data by chunks until specified string is met.

import re
import urllib2

def read_url(url, until=None, chunk=100):
    try:
        response = urllib2.urlopen(url)
    except urllib2.URLError:
        return u''

    encoding = get_charset(response.headers)

    if until:
        next, data, trunk_at = True, '', None
        while next:
            next = response.read(chunk)
            data += next
            until_match = re.search(until, data, re.IGNORECASE)
            if until_match:
                response.close()
                data = unicode(data, encoding)
                return data[:data.find(until) + len(until)]
    else:
        data = response.read()
    return unicode(data, encoding)

So, we can now read until </title>!

>>> d = read_url('http://python.org/', until='</title>')
>>> d
u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html xmlns
="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=utf-8" />\
n  <title>Python Programming Language -- Official Website</title>'

Let’s test perfomance. The very-very basic test looks like:

def test():
    from time import time

    t1 = time()
    d1 = read_url('http://python.org/', until='</title>')
    t2 = time()

    t3 = time()
    d2 = read_url('http://python.org/')
    t4 = time()

    print t2-t1
    print t4-t3

Results:

>>> test()
0.131000041962
0.31500005722
>>> test()
0.12700009346
0.318000078201
>>> test()
0.125999927521
0.31299996376

Optimized extractor shown considerable faster results.

That’s it

Other HTML parsing libraries are mentioned here.

Comments: 45