Adam's blog: Replacing more than two occurences with Python RegEx

14 Aug 2022, 433 words

Recently, I was faced with a fairly easy task – to use Python’s re.sub to replace all occurrences of one text with another content. The only thing was that I was able to replace only the first two of them. But why?

Of course, the situation was a little more complex – the project that I was modifying takes a Markdown text as an input and, with the help of Pandoc, returns its content as HTML code. After a few hours of debugging the whole transformation process, I have tracked the issue to a single function. Can you spot the mistake?

def format_custom_tags(source: str) -> str:
    """
    Replaces all custom-defined tags as divs
    e.g. <ksi-tip> is replaced with <div class="ksi-custom ksi-tip">
    :param source: HTML to adjust
    :return: adjusted HTML
    """
    tags = ('ksi-tip',)
    for tag in tags:
        tag_escaped = re.escape(tag)
        source = re.sub(fr'<{tag_escaped}(.*?)>', fr'<div class="ksi-custom {tag}"\1>', source, re.IGNORECASE)
        source = re.sub(fr'</{tag_escaped}>', r"</div>", source, re.IGNORECASE)
    return source

Side note: Why is this function required? During the conversion, I want to replace the occurrences of a custom tag <ksi-tip> replace to <div class="ksi-custom ksi-tip">. If I did not perform this action, the Pandoc would sometimes take my custom tag as a plain text, which broke the formatting in certain cases. When the tag is specified as a class of div, everything works as expected.

Solution

Funnily enough, after search for Why does Python re.sub replace only two occurences I have found out that I am not the only one who was loosing mind about this issue - there already was an exactly same question. In Python, re.IGNORECASE equals to 2 and the fourth parameter of re.sub is count, not flags. That’s it. The whole issue was caused by a wrong order of parameters passed to the function, so, in the end, the fix was quite straightforward.