Wed, 11 Jun 2008

Encoding ampersands with Python

I need to replace ampersands in a text file with the HTML entity '&'. I could simply use Python's string replace method, however, this will mess up my text if some of the ampersands have already been turned into HTML entities. The same is true if I use regular expressions to match a single '&'. What I really need to do is replace an ampersand providing it is not followed by 'amp;'.

Using negative lookahead assertion with our regular expression is the answer. Negative lookahead is used when you want to match something not followed by something else. It starts with (?! and finishes at the ).

Our expression now becomes: &(?!amp;) and means the text it contains, amp;, must not follow the expression that preceeds it.

In this example I also added an expression to not match any HTML entity numbers as well.

>>> import re
>>> s = "<Title>Eugene&#039;s Software Emporium & Arcade</Title>"
>>> pattern = re.compile('&(?!#)(?!amp;)')
>>> if pattern.search(s):
...   iterator = pattern.finditer(s)
...   for match in iterator:
...     print match.span()
... 
(38, 39)
>>> s[match.start():match.end()]
'&'
>>> 


posted: 22:13 | 0 comments | tags: , ,


Comments

Name:


E-mail:


URL:


Comment:


© 2008 PlatosCave.net