Skip to content

Improve speed of textwrap.dedent #131791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Marius-Juston opened this issue Mar 27, 2025 · 19 comments
Closed

Improve speed of textwrap.dedent #131791

Marius-Juston opened this issue Mar 27, 2025 · 19 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@Marius-Juston
Copy link
Contributor

Marius-Juston commented Mar 27, 2025

Feature or enhancement

Proposal:

Current code:

_whitespace_only_re = re.compile('^[ \t]+$', re.MULTILINE)
_leading_whitespace_re = re.compile('(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)

def dedent(text):
    """Remove any common leading whitespace from every line in `text`.

    This can be used to make triple-quoted strings line up with the left
    edge of the display, while still presenting them in the source code
    in indented form.

    Note that tabs and spaces are both treated as whitespace, but they
    are not equal: the lines "  hello" and "\\thello" are
    considered to have no common leading whitespace.

    Entirely blank lines are normalized to a newline character.
    """
    # Look for the longest leading string of spaces and tabs common to
    # all lines.
    margin = None
    text = _whitespace_only_re.sub('', text)
    indents = _leading_whitespace_re.findall(text)
    for indent in indents:
        if margin is None:
            margin = indent

        # Current line more deeply indented than previous winner:
        # no change (previous winner is still on top).
        elif indent.startswith(margin):
            pass

        # Current line consistent with and no deeper than previous winner:
        # it's the new winner.
        elif margin.startswith(indent):
            margin = indent

        # Find the largest common whitespace between current line and previous
        # winner.
        else:
            for i, (x, y) in enumerate(zip(margin, indent)):
                if x != y:
                    margin = margin[:i]
                    break

    # sanity check (testing/debugging only)
    if 0 and margin:
        for line in text.split("\n"):
            assert not line or line.startswith(margin), \
                   "line = %r, margin = %r" % (line, margin)

    if margin:
        text = re.sub(r'(?m)^' + margin, '', text)
    return text

Can speed up the process for large files up to 4x the speed:

def dedent_faster(text: str, only_whitespace: bool = True) -> str:
    """Remove any common leading whitespace from every line in `text`.

    This can be used to make triple-quoted strings line up with the left
    edge of the display, while still presenting them in the source code
    in indented form.

    Note that tabs and spaces are both treated as whitespace, but they
    are not equal: the lines "  hello" and "\\thello" are
    considered to have no common leading whitespace.
    
    If `only_whitespace` is `True`, the leading whitespaces are removed from the text. Otherwise, all the common leading text is removed.

    Entirely blank lines are normalized to a newline character.
    """
    # Early return for empty input
    if not text:
        return text

    # Split into lines
    lines = text.splitlines(True)

    # Fast path for single line - but make sure we still dedent!
    if len(lines) == 1:
        line = lines[0]
        stripped = line.strip()
        if not stripped:  # Blank line
            return "\n" if line.endswith("\n") else ""

        # Find leading whitespace for a single line
        if only_whitespace:
            i = 0
            while i < len(line) and line[i] in " \t":
                i += 1
            if i > 0:  # Has leading whitespace to remove
                return line[i:]
        else:
            lead_size = len(line) - len(line.lstrip())
            if lead_size > 0:  # Has leading whitespace to remove
                return line[lead_size:]
        return line  # No whitespace to remove

    # Cache method lookups for faster access
    _strip = str.strip
    _startswith = str.startswith
    _endswith = str.endswith

    # Find first two non-blank lines
    non_blank = []
    for line in lines:
        if _strip(line):
            non_blank.append(line)
            if len(non_blank) == 2:
                break

    # All lines are blank
    if not non_blank:
        result = []
        append = result.append
        for line in lines:
            append("\n" if _endswith(line, "\n") else "")
        return "".join(result)

    # Calculate margin length efficiently
    if len(non_blank) == 1:
        # Single non-blank line
        line = non_blank[0]
        if only_whitespace:
            # Manually find leading whitespace (faster than regex)
            i = 0
            line_len = len(line)
            while i < line_len and line[i] in " \t":
                i += 1
            margin_len = i
        else:
            # Use built-in lstrip for non-whitespace case
            margin_len = len(line) - len(line.lstrip())
    else:
        # Find common prefix of first two non-blank lines
        a, b = non_blank
        min_len = min(len(a), len(b))
        i = 0

        if only_whitespace:
            # Manual loop is faster than character-by-character comparison
            while i < min_len and a[i] == b[i] and a[i] in " \t":
                i += 1
        else:
            while i < min_len and a[i] == b[i]:
                i += 1

        margin_len = i

    # No margin to remove - return original with blank line normalization
    if margin_len == 0:
        result = []
        append = result.append
        for line in lines:
            if _strip(line):  # Non-blank line
                append(line)
            else:  # Blank line
                append("\n" if _endswith(line, "\n") else "")
        return "".join(result)

    # Get margin string once for repeated comparison
    margin = non_blank[0][:margin_len]

    # Pre-allocate result list with a size hint for better memory efficiency
    result = []
    append = result.append

    # Process all lines with optimized operations
    for line in lines:
        if not _strip(line):  # Blank line (including whitespace-only lines)
            append("\n" if _endswith(line, "\n") else "")
        elif _startswith(line, margin):  # Has margin
            # Slice operation is very fast in Python
            append(line[margin_len:])
        else:  # No matching margin
            append(line)

    # Single join is faster than incremental string building
    return "".join(result)

which has the following speed outputs:

if __name__ == '__main__':
    with open("../Objects/unicodeobject.c") as f:
        raw_text = f.read()


    tests = []
    test_names = []
    test_names = ['', "    ", "\t", "abc  \t", ' \t  abc', ' \t  abc  \t  ']

    index = 0

    temp = dedent(raw_text)

    for indent_v in test_names:
        text = indent(raw_text, indent_v)

        tests.append(text)


        output = dedent_faster(text, only_whitespace=False)

        print("Validating large text with not only whitespace indentation:", indent_v.encode("ascii"), output == temp)


    # Basic indented text with empty lines
    text = """
        def hello():
            print("Hello, world!")


        """

    tests.append(text)
    test_names.append("Basic indented text with empty lines")

    # Text with mixed indentation and blank lines
    text = """
        This is a test.

        Another line.

    """

    tests.append(text)
    test_names.append("Text with mixed indentation and blank lines")

    # No indentation (edge case)
    text = """No indents here.
    Just normal text.

    With empty lines."""

    tests.append(text)
    test_names.append("No indentation (edge case)")

    # Only blank lines (should preserve them)
    text = """


    """

    tests.append(text)
    test_names.append("Only blank lines")

    # Edge case: No common prefix to remove
    text = """hello
        world
    """

    tests.append(text)
    test_names.append("Edge case: No common prefix to remove")

    # Edge case: Single indented line
    text = """    Only one indented line"""

    tests.append(text)
    test_names.append("Edge case: Single indented line")

    # Edge case: Single indented line
    text = """    """

    tests.append(text)
    test_names.append("Edge case: Single indented line only")

    # Edge case: Single indented line
    text = """"""

    tests.append(text)
    test_names.append("Edge case: Empty text")

    for text, name in zip(tests, test_names):
        print(f"========= Case: {name.encode('ascii')} =========")

        a = dedent(text)

        for func in [dedent_faster, dedent]:
            single_test = func(text)

            print(func.__name__, a == single_test)

            it = timeit.Timer(lambda: func(text))
            result = it.repeat(number=10_000)
            result.sort()
            print(f"{func.__name__} Min: {result[0]:.4f}msec")

Returning the following:

Validating large text with not only whitespace indentation: b'' True
Validating large text with not only whitespace indentation: b'    ' True
Validating large text with not only whitespace indentation: b'\t' True
Validating large text with not only whitespace indentation: b'abc  \t' True
Validating large text with not only whitespace indentation: b' \t  abc' True
Validating large text with not only whitespace indentation: b' \t  abc  \t  ' True
========= Case: b'' =========
dedent_faster True
dedent_faster Min: 1.5848msec
dedent True
dedent Min: 6.6143msec
========= Case: b'    ' =========
dedent_faster True
dedent_faster Min: 2.5275msec
dedent True
dedent Min: 10.6884msec
========= Case: b'\t' =========
dedent_faster True
dedent_faster Min: 2.3215msec
dedent True
dedent Min: 9.9588msec
========= Case: b'abc  \t' =========
dedent_faster True
dedent_faster Min: 1.5026msec
dedent True
dedent Min: 6.7075msec
========= Case: b' \t  abc' =========
dedent_faster True
dedent_faster Min: 2.5997msec
dedent True
dedent Min: 10.6693msec
========= Case: b' \t  abc  \t  ' =========
dedent_faster True
dedent_faster Min: 2.6204msec
dedent True
dedent Min: 11.7285msec
========= Case: b'Basic indented text with empty lines' =========
dedent_faster True
dedent_faster Min: 0.0016msec
dedent True
dedent Min: 0.0022msec
========= Case: b'Text with mixed indentation and blank lines' =========
dedent_faster True
dedent_faster Min: 0.0016msec
dedent True
dedent Min: 0.0020msec
========= Case: b'No indentation (edge case)' =========
dedent_faster True
dedent_faster Min: 0.0008msec
dedent True
dedent Min: 0.0013msec
========= Case: b'Only blank lines' =========
dedent_faster True
dedent_faster Min: 0.0006msec
dedent True
dedent Min: 0.0005msec
========= Case: b'Edge case: No common prefix to remove' =========
dedent_faster True
dedent_faster Min: 0.0007msec
dedent True
dedent Min: 0.0008msec
========= Case: b'Edge case: Single indented line' =========
dedent_faster True
dedent_faster Min: 0.0004msec
dedent True
dedent Min: 0.0013msec
========= Case: b'Edge case: Single indented line only' =========
dedent_faster True
dedent_faster Min: 0.0002msec
dedent True
dedent Min: 0.0003msec
========= Case: b'Edge case: Empty text' =========
dedent_faster True
dedent_faster Min: 0.0001msec
dedent True
dedent Min: 0.0002msec

Which thus returns a 4x performance boost for large files. Another advantage is this this method allows for people to just remove all common prefixes to the file as an option instead of just whitespaces.

( This was optimized iteratively using Claude + ChatGPT models )

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

@Marius-Juston Marius-Juston added the type-feature A feature request or enhancement label Mar 27, 2025
@Marius-Juston
Copy link
Contributor Author

#131792

@StanFromIreland
Copy link
Contributor

StanFromIreland commented Mar 27, 2025

Firstly, it seems this issue and pr are fully generated by ai, please see the devguide.

Unacceptable uses
Maintainers may close issues and PRs that are not useful or productive, including those that are fully generated by AI. If a contributor repeatedly opens unproductive issues or PRs, they may be blocked.

Secondly, it seems you have both a proposal to optimize and a proposal for new functionality? These generally should not be mixed together. Discussions for new functionality should be held in the discuss python-ideas.

Lastly, changes like this will need tests, I also doubt they pass existing ones, for some reason the CI has not run on your PR.

When one applies provided patch:

Traceback (most recent call last):
  File "/home/stan/PycharmProjects/cpython/Lib/test/test_textwrap.py", line 884, in test_dedent_preserve_margin_tabs
    self.assertEqual(expect, dedent(text))
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
AssertionError: " \thello there\n  \thow are you?\n\tI'm fine, thanks" != "\thello there\n \thow are you?\n \tI'm fine, thanks"
-  	hello there
? -
+ 	hello there
-   	how are you?
? -
+  	how are you?
- 	I'm fine, thanks
+  	I'm fine, thanks
? +



Ran 66 tests in 0.011s


FAILED (failures=1)

@donBarbos
Copy link
Contributor

looks like related to #130167

@Marius-Juston
Copy link
Contributor Author

Yeah, I was clearly overeager for the performance improvement, and I clearly did not do my due diligence. I will probably just close everything and work on it offline until a proper solution is developed. In the linked PR I was made aware of the these comments.

@donBarbos
Copy link
Contributor

i don't think you should close this issue (as completed), i also think your PR could be split, for example send something that replaces the use of re in #130167

if you want to add a feature, for example a new argument, then first write about it separately and don't send it in the same PR since it's a separate topic for discussion

but your changes are good, i would be glad to see them
it's better to start with small changes :)

@Marius-Juston
Copy link
Contributor Author

Marius-Juston commented Mar 27, 2025

Appreciate the comment!

def dedent(text):
    if not text:
        return text

    lines = text.split("\n")

    splitting = os.path.commonprefix(tuple(filter(lambda x: x.lstrip(), lines)))

    margin_len = 0

    for split in splitting:
        if split in ' \t':
            margin_len += 1
        else:
            break

    # Apply margin removal (most common case) with minimal operations
    return "\n".join([line[margin_len:] if line.strip() else "\n" if line.endswith("\n") else "" for line in lines])

Another promising approach is simply this (non AI generated this time lol ) which greatly simplifies everything with the same-ish performance gains

@Marius-Juston Marius-Juston reopened this Mar 28, 2025
@Marius-Juston
Copy link
Contributor Author

Marius-Juston commented Mar 28, 2025

@donBarbos thank you for the suggestion, I will be reopen the issue.

In addition:

margin_len = 0

for split in splitting:
    if split in ' \t':
        margin_len += 1
    else:
        break

has been now further optimized to just

margin_len = len(margin) - len(margin.lstrip())

with no noticeable performance impacts

@Marius-Juston
Copy link
Contributor Author

Marius-Juston commented Mar 28, 2025

I feel like there should be a way to good way cache the line.lstrip() since it is computed twice, once in the list comprehension and once in the filter. After trying a bunch of things I have not been able to get the performance to increase from this new implementation.

My current best implementation of that:

def dedent(text):
    if not text:
        return text

    lines = text.split("\n")

    if len(lines)  == 1:
        return text.lstrip()

    early_split = False

    min_margin = None
    max_margin = None

    first = True

    only_empty = True

    for i in range(len(lines)):
        line = lines[i]

        lstripped = line.lstrip()

        if lstripped:
            only_empty = False
            if early_split:
                continue

            margin = len(line) - len(lstripped)

            if margin == 0:
                early_split = True
            else:
                value = line[:margin]

                if first:
                    first = False
                    min_margin = value
                    max_margin = value
                else:
                    min_margin = min(min_margin, value)
                    max_margin = max(max_margin, value)
        else:
            lines[i] = ''

    if early_split or only_empty:
        return "\n".join(lines)

    common_prefix_len = 0
    for c1, c2 in zip(min_margin, max_margin):
        if c1 == c2:
            common_prefix_len += 1
        else:
            break

    return "\n".join(line[common_prefix_len:] for line in lines)

I also tried to encode the margins through a binary encoding since the lstrip only extracts and \t, so I thoughts using a numerical encoding rather than using string comparisons might help? But sadly that did not become faster.

def magic_strip(text):
    if not text:
        return 1, 0  # Empty string case

    encoding = 1
    num_removed = 0

    for c in text:
        if c == ' ':
            encoding <<= 1
            num_removed += 1
        elif c == '\t':
            encoding = (encoding << 1) | 1
            num_removed += 1
        else:
            break

    return encoding, num_removed

def common_margin(a, a_length, b, b_length):
    if a == b and a_length == b_length:
        return a, a_length

    if a_length == 0 or b_length == 0:
        return 1, 0

    min_len = min(a_length, b_length)

    a_shifted = a >> (a_length - min_len)
    b_shifted = b >> (b_length - min_len)

    xor_result = a_shifted ^ b_shifted

    if xor_result == 0:
        return a_shifted, min_len

    common_length = 0
    mask = 1 << (min_len - 1)

    while mask and not (xor_result & mask):
        common_length += 1
        mask >>= 1

    if common_length == 0:
        return 1, 0
    else:
        return a >> (a_length - common_length), common_length


def dedent(text: str) -> str:
    if not text:
        return text

    lines = text.split("\n")

    if len(lines)  == 0:
        return text.lstrip()

    early_split = False
    first = True
    only_empty = True
    min_encoding = 0
    min_length = 0

    for i, line in enumerate(lines):
        encoding, margin = magic_strip(line)

        if margin != len(line):
            only_empty = False
            if early_split:
                continue

            if margin == 0:
                early_split = True
            else:
                if first:
                    first = False
                    min_encoding = encoding
                    min_length = margin
                else:
                    min_encoding, min_length = common_margin(min_encoding, min_length, encoding, margin)

                    if min_length == 0:
                        early_split = True
        else:
            lines[i] = ''

    if early_split or only_empty:
        return "\n".join(lines)

    return "\n".join(line[min_length:] for line in lines)

I guess currently the C compiled lstrip is so fast that it does not really matter. This is; however, twice as fast for the case in large files where there is no white spaces.

( I have validated that both set of functions above do pass all the test cases inside the test suite )

I thought that the bit method would be interesting to use but that is probably best suited for a C implementation?

@Marius-Juston
Copy link
Contributor Author

========= Case: b'abc  \t' =========
dedent_binary True
dedent_binary Min: 2.8362msec
dedent_cached_lstrip True
dedent_cached_lstrip Min: 1.3299msec
dedent_faster True
dedent_faster Min: 2.5255msec
========= Case: b'' =========
dedent_binary True
dedent_binary Min: 7.3707msec
dedent_cached_lstrip True
dedent_cached_lstrip Min: 1.5261msec
dedent_faster True
dedent_faster Min: 2.9654msec
========= Case: b'    ' =========
dedent_binary True
dedent_binary Min: 13.4516msec
dedent_cached_lstrip True
dedent_cached_lstrip Min: 4.7276msec
dedent_faster True
dedent_faster Min: 3.5121msec
========= Case: b'\t' =========
dedent_binary True
dedent_binary Min: 12.1062msec
dedent_cached_lstrip True
dedent_cached_lstrip Min: 4.6290msec
dedent_faster True
dedent_faster Min: 3.4816msec
========= Case: b' \t  abc' =========
dedent_binary True
dedent_binary Min: 8.0713msec
dedent_cached_lstrip True
dedent_cached_lstrip Min: 4.7934msec
dedent_faster True
dedent_faster Min: 3.5026msec

Speed comparisons

@picnixz picnixz added performance Performance or resource usage stdlib Python modules in the Lib dir labels Mar 28, 2025
@StanFromIreland
Copy link
Contributor

This is a duplicate of #130167 (well, the pr is, and the feature request needs its own issue)

@donBarbos
Copy link
Contributor

This is a duplicate of #130167 (well, the pr is, and the feature request needs its own issue)

yeah, we can rename current issue to somethink like Add only_whitespace option to textwrap.dedent and reassign the existing PR to gh-130167

@Marius-Juston
Copy link
Contributor Author

Marius-Juston commented Mar 28, 2025

yeah, we can rename current issue to somethink like Add only_whitespace option to textwrap.dedent and reassign the existing PR to gh-130167

I have actually removed that option now since @ZeroIntensity informed me to remove that flag. So this PR is now purely on speed optimizations for the function. I can add it back to the new current code as it should be easy to do if so desired.

@Marius-Juston Marius-Juston changed the title Optimize textwrap.dedent Improve speed of textwrap.dedent by replacing re Mar 28, 2025
@Marius-Juston Marius-Juston changed the title Improve speed of textwrap.dedent by replacing re gh-gh-131791: Improve speed of textwrap.dedent by replacing re Mar 28, 2025
@Marius-Juston Marius-Juston changed the title gh-gh-131791: Improve speed of textwrap.dedent by replacing re gh-131791: Improve speed of textwrap.dedent by replacing re Mar 28, 2025
@donBarbos
Copy link
Contributor

donBarbos commented Mar 28, 2025

I have actually removed that option now since @ZeroIntensity informed me to remove that flag. So this PR is now purely on speed optimizations for the function. I can add it back to the new current code as it should be easy to do if so desired.

no :-) I mean change title of current issue to Add only_whitespace option to textwrap.dedent and change your PR title to gh-130167: Improve speed of `textwrap.dedent` by replacing `re`

P.S.: i'm taking only about title, please, don't add more functionality to your code. you can do it later in a separate PR

@picnixz
Copy link
Member

picnixz commented Mar 28, 2025

This is a duplicate of #130167 (well, the pr is, and the feature request needs its own issue)

IMO, I wouldn't put it under the same issue as this one is really a change of algorithm. 130167 was meant to track issues that are "easy" to review (not those for which the code base is changed).

I thought that the bit method would be interesting to use but that is probably best suited for a C implementation?

FTR, we don't like adding C implementations when it adds maintenance burden. Even if we end up being much faster, the cost of a C implementation to maintain is usually not worth.


I'm continuing a review on the PR side, especially concerning the shown benchmarks and the implementation.

@Marius-Juston Marius-Juston changed the title gh-131791: Improve speed of textwrap.dedent by replacing re Add only_whitespace option to textwrap.dedent Mar 28, 2025
@Marius-Juston
Copy link
Contributor Author

Sounds good, done!

@Marius-Juston
Copy link
Contributor Author

FTR, we don't like adding C implementations when it adds maintenance burden. Even if we end up being much faster, the cost of a C implementation to maintain is usually not worth.

Makes sense!

I'm continuing a review on the PR side, especially concerning the shown benchmarks and the implementation.

Sounds good, if I did something is stupid, please let me know!

@picnixz
Copy link
Member

picnixz commented Mar 28, 2025

Adding only_whitespace is a new feature that probably requires a DPO thread IMO. I can see some use when a file mixes both spaces and tabs, in which case one would be happy to dedent the entire text without bothering about them, but OTOH, I'm afraid that this isn't useful if we're only considering removing whitespaces (so I'm more +0). I think it would make sense to provide a new function to remove a common prefix (say, textwrap.remove_common_prefix()), however this also requires a DPO discussion.

@Marius-Juston
Copy link
Contributor Author

I can see some use when a file mixes both spaces and tabs, in which case one would be happy to dedent the entire text without bothering about them

This is already what dedent does. It removes all common whitespace so all the prefixed and \t and any combination of them as indentation.

For the remove_common_prefix I was thinking it's use could be for things like this (which is something I remember coming across):

>>> echo "Hello"
>>> ./python do_thing

etc...

So I can see the this function changing it to

echo "Hello"
./python do_thing

instead. But yeah, overall, it is not incredibly useful and very nichel.

@picnixz
Copy link
Member

picnixz commented Apr 5, 2025

I'm going to close this one as the original purpose was to make textwrap.dedent() faster.

@picnixz picnixz closed this as completed Apr 5, 2025
@picnixz picnixz changed the title Add only_whitespace option to textwrap.dedent Improve speed of textwrap.dedent Apr 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants