Encoding ampersands with Python

Published on June 11, 2008

I need to replace ampersands in a text file with the HTML entity '&'. I could simply use Python's s tring replace method, however, this will mess up my text if some of the ampersands have already been turned into HTML entities. The same is true if I use regular expressions to match a single '&'. What I really need to do is replace an ampersand providing it is not followed by 'amp;'.

Using negative lookahead assertion with our regular expression is the answer. Negative lookahead is used when you want to match something not followed by something else. It starts with (?! and finishes at the ).

Our expression now becomes: &(?!amp;) and means the text it contains, amp;, must not follow the expression that preceeds it.

In this example I also added an expression to not match any HTML entity numbers as well.

>>> import re
>>> s = "<Title>Eugene&#039;s Software Emporium & Arcade</Title>"
>>> pattern = re.compile('&(?!#)(?!amp;)')
>>> if pattern.search(s):
...   iterator = pattern.finditer(s)
...   for match in iterator:
...     print match.span()
... 
(38, 39)
>>> s[match.start():match.end()]
'&'
>>> 

Tags: programming, python

Comments are closed.

Comments have been closed for this post.