Adam's blog: Replacing more than two occurences with Python RegEx
Recently, I was faced with a fairly easy task – to use Python’s re.sub
to replace all occurrences of one text with another content. The only thing was that I was able to replace only the first two of them. But why?
Of course, the situation was a little more complex – the project that I was modifying takes a Markdown text as an input and, with the help of Pandoc, returns its content as HTML code. After a few hours of debugging the whole transformation process, I have tracked the issue to a single function. Can you spot the mistake?
def format_custom_tags(source: str) -> str:
"""
Replaces all custom-defined tags as divs
e.g. <ksi-tip> is replaced with <div class="ksi-custom ksi-tip">
:param source: HTML to adjust
:return: adjusted HTML
"""
tags = ('ksi-tip',)
for tag in tags:
tag_escaped = re.escape(tag)
source = re.sub(fr'<{tag_escaped}(.*?)>', fr'<div class="ksi-custom {tag}"\1>', source, re.IGNORECASE)
source = re.sub(fr'</{tag_escaped}>', r"</div>", source, re.IGNORECASE)
return source
Side note: Why is this function required? During the conversion, I want to replace the occurrences of a custom tag <ksi-tip>
replace to <div class="ksi-custom ksi-tip">
. If I did not perform this action, the Pandoc would sometimes take my custom tag as a plain text, which broke the formatting in certain cases. When the tag is specified as a class of div
, everything works as expected.
Solution
Funnily enough, after search for Why does Python re.sub replace only two occurences
I have found out that I am not the only one who was loosing mind about this issue - there already was an exactly same question. In Python, re.IGNORECASE
equals to 2
and the fourth parameter of re.sub
is count
, not flags
. That’s it. The whole issue was caused by a wrong order of parameters passed to the function, so, in the end, the fix was quite straightforward.