-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Surprising list overallocation from .split() #91146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When looking into a StackOverflow question about surprisingly high memory use, I stumbled into this (under 3.10.1, Win64): >>> import sys
>>> s = "1 2 3 4 5".split()
>>> s
['1', '2', '3', '4', '5']
>>> sys.getsizeof(s)
152
>>> _ - sys.getsizeof([])
96
>>> 96 / 8
12.0 That is, we allocated enough space in the list to store 12(!) elements, despite that only 5 are used. Other ways of building a 5-element list I've tried overallocate by at most 3 slots: >>> sys.getsizeof([ch for ch in "12345"])
120
>>> sys.getsizeof([1, 2, 3, 4, 5])
120 (and 120 - 56 = 64, room for 8 pointers) Then there's this curiosity, which allocates space for exactly the 5 needed: >>> sys.getsizeof(list(tuple("1 2 3 4 5".split())))
96 (and 96 - 56 = 40, room for the 5 pointers needed) I don't expect this to be consistent, but allocating space for 12 when only 5 are needed is unreasonable. Even allocating space for 8 is pushing it ;-) |
The value 12 is hardcoded here: cpython/Objects/stringlib/split.h Line 14 in a89c29f
The comment there says that this is because most .split() calls are on lines of human-readable text, which has about 11 words per line. I don't know if I believe that. |
Well, that's annoying ;-) In context, the OP was saving a list of 10 million splits. So each overallocation by a single element burned 80 million bytes of RAM. Overallocating by 7 burned 560 million bytes. Which is unusual. Usually a split result is short-lived, consumed once then thrown away. OTOH, the overwhelming motivation for overallocating at all is to acheive O(1) amortized time after a long _sequence_ of appends, and split results typically aren't appended to at all. split() appears to be using it as a timing micro-optimization for tiny lists instead. So, like I said, it's annoying ;-) For "small" lists, split() really shouldn't overallocate at all (because, as before, split results are rarely appended to). A compromise could be to save pointers to the first N (12, whatever) instances of the splitting string in a stack ("auto") vector, before any list object (or result string object) is created. If it's out of stuff to do before reaching N, fine, build a result out of exactly what was found. If there's more to do, build a result from the first N, and go on as currently (letting PyList_Append deal with it - overallocation is huge in percentage terms when the list is short, but not so much as the list gets longer). |
…plit Co-authored-by: Inada Naoki <[email protected]>
…h-95493) Co-authored-by: Inada Naoki <[email protected]>
Now the size is a lot reduced. I would like to close the issue until we need to squeeze more.
Feel free to reopen the issue if we need :) |
Thanks for this! I wasn't paying attention and missed this while it was in progress. I'm happy with the result 😄 . |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: