Encoding ampersands with Python
I need to replace ampersands in a text file with the HTML entity '&'. I could simply use Python's string replace
method, however, this will mess up my text if some of the ampersands have already been turned into HTML entities. The same is true
if I use regular expressions to match a single '&'. What I really need to do is replace an ampersand providing it is
not followed by 'amp;'.
Using negative lookahead assertion with our regular
expression is the answer. Negative lookahead is used when you want to match something not followed by something else. It starts
with (?! and finishes at the ).
Our expression now becomes: &(?!amp;) and means the text it contains, amp;, must not follow the
expression that preceeds it.
In this example I also added an expression to not match any HTML entity numbers as well.
>>> import re
>>> s = "<Title>Eugene's Software Emporium & Arcade</Title>"
>>> pattern = re.compile('&(?!#)(?!amp;)')
>>> if pattern.search(s):
... iterator = pattern.finditer(s)
... for match in iterator:
... print match.span()
...
(38, 39)
>>> s[match.start():match.end()]
'&'
>>>
