-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Improve speed of textwrap.dedent
#131791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Firstly, it seems this issue and pr are fully generated by ai, please see the devguide.
Secondly, it seems you have both a proposal to optimize and a proposal for new functionality? These generally should not be mixed together. Discussions for new functionality should be held in the discuss python-ideas. Lastly, changes like this will need tests, I also doubt they pass existing ones, for some reason the CI has not run on your PR. When one applies provided patch:
|
looks like related to #130167 |
Yeah, I was clearly overeager for the performance improvement, and I clearly did not do my due diligence. I will probably just close everything and work on it offline until a proper solution is developed. In the linked PR I was made aware of the these comments. |
i don't think you should close this issue (as completed), i also think your PR could be split, for example send something that replaces the use of if you want to add a feature, for example a new argument, then first write about it separately and don't send it in the same PR since it's a separate topic for discussion but your changes are good, i would be glad to see them |
Appreciate the comment! def dedent(text):
if not text:
return text
lines = text.split("\n")
splitting = os.path.commonprefix(tuple(filter(lambda x: x.lstrip(), lines)))
margin_len = 0
for split in splitting:
if split in ' \t':
margin_len += 1
else:
break
# Apply margin removal (most common case) with minimal operations
return "\n".join([line[margin_len:] if line.strip() else "\n" if line.endswith("\n") else "" for line in lines]) Another promising approach is simply this (non AI generated this time lol ) which greatly simplifies everything with the same-ish performance gains |
@donBarbos thank you for the suggestion, I will be reopen the issue. In addition: margin_len = 0
for split in splitting:
if split in ' \t':
margin_len += 1
else:
break has been now further optimized to just margin_len = len(margin) - len(margin.lstrip()) with no noticeable performance impacts |
I feel like there should be a way to good way cache the line.lstrip() since it is computed twice, once in the list comprehension and once in the filter. After trying a bunch of things I have not been able to get the performance to increase from this new implementation. My current best implementation of that: def dedent(text):
if not text:
return text
lines = text.split("\n")
if len(lines) == 1:
return text.lstrip()
early_split = False
min_margin = None
max_margin = None
first = True
only_empty = True
for i in range(len(lines)):
line = lines[i]
lstripped = line.lstrip()
if lstripped:
only_empty = False
if early_split:
continue
margin = len(line) - len(lstripped)
if margin == 0:
early_split = True
else:
value = line[:margin]
if first:
first = False
min_margin = value
max_margin = value
else:
min_margin = min(min_margin, value)
max_margin = max(max_margin, value)
else:
lines[i] = ''
if early_split or only_empty:
return "\n".join(lines)
common_prefix_len = 0
for c1, c2 in zip(min_margin, max_margin):
if c1 == c2:
common_prefix_len += 1
else:
break
return "\n".join(line[common_prefix_len:] for line in lines) I also tried to encode the margins through a binary encoding since the lstrip only extracts def magic_strip(text):
if not text:
return 1, 0 # Empty string case
encoding = 1
num_removed = 0
for c in text:
if c == ' ':
encoding <<= 1
num_removed += 1
elif c == '\t':
encoding = (encoding << 1) | 1
num_removed += 1
else:
break
return encoding, num_removed
def common_margin(a, a_length, b, b_length):
if a == b and a_length == b_length:
return a, a_length
if a_length == 0 or b_length == 0:
return 1, 0
min_len = min(a_length, b_length)
a_shifted = a >> (a_length - min_len)
b_shifted = b >> (b_length - min_len)
xor_result = a_shifted ^ b_shifted
if xor_result == 0:
return a_shifted, min_len
common_length = 0
mask = 1 << (min_len - 1)
while mask and not (xor_result & mask):
common_length += 1
mask >>= 1
if common_length == 0:
return 1, 0
else:
return a >> (a_length - common_length), common_length
def dedent(text: str) -> str:
if not text:
return text
lines = text.split("\n")
if len(lines) == 0:
return text.lstrip()
early_split = False
first = True
only_empty = True
min_encoding = 0
min_length = 0
for i, line in enumerate(lines):
encoding, margin = magic_strip(line)
if margin != len(line):
only_empty = False
if early_split:
continue
if margin == 0:
early_split = True
else:
if first:
first = False
min_encoding = encoding
min_length = margin
else:
min_encoding, min_length = common_margin(min_encoding, min_length, encoding, margin)
if min_length == 0:
early_split = True
else:
lines[i] = ''
if early_split or only_empty:
return "\n".join(lines)
return "\n".join(line[min_length:] for line in lines) I guess currently the C compiled ( I have validated that both set of functions above do pass all the test cases inside the test suite ) I thought that the bit method would be interesting to use but that is probably best suited for a C implementation? |
Speed comparisons |
This is a duplicate of #130167 (well, the pr is, and the feature request needs its own issue) |
yeah, we can rename current issue to somethink like |
I have actually removed that option now since @ZeroIntensity informed me to remove that flag. So this PR is now purely on speed optimizations for the function. I can add it back to the new current code as it should be easy to do if so desired. |
textwrap.dedent
by replacing re
textwrap.dedent
by replacing re
textwrap.dedent
by replacing re
textwrap.dedent
by replacing re
textwrap.dedent
by replacing re
no :-) I mean change title of current issue to P.S.: i'm taking only about title, please, don't add more functionality to your code. you can do it later in a separate PR |
IMO, I wouldn't put it under the same issue as this one is really a change of algorithm. 130167 was meant to track issues that are "easy" to review (not those for which the code base is changed).
FTR, we don't like adding C implementations when it adds maintenance burden. Even if we end up being much faster, the cost of a C implementation to maintain is usually not worth. I'm continuing a review on the PR side, especially concerning the shown benchmarks and the implementation. |
textwrap.dedent
by replacing re
only_whitespace
option to textwrap.dedent
Sounds good, done! |
Makes sense!
Sounds good, if I did something is stupid, please let me know! |
Adding |
This is already what For the >>> echo "Hello"
>>> ./python do_thing etc... So I can see the this function changing it to echo "Hello"
./python do_thing instead. But yeah, overall, it is not incredibly useful and very nichel. |
I'm going to close this one as the original purpose was to make |
only_whitespace
option to textwrap.dedent
textwrap.dedent
Uh oh!
There was an error while loading. Please reload this page.
Feature or enhancement
Proposal:
Current code:
Can speed up the process for large files up to 4x the speed:
which has the following speed outputs:
Returning the following:
Which thus returns a 4x performance boost for large files. Another advantage is this this method allows for people to just remove all common prefixes to the file as an option instead of just whitespaces.
( This was optimized iteratively using Claude + ChatGPT models )
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs
textwrap.dedent
by replacingre
#131792textwrap.dedent()
#131919The text was updated successfully, but these errors were encountered: