Surprising list overallocation from .split() #91146

tim-one · 2022-03-11T22:37:20Z

BPO	46990
Nosy	@tim-one, @JelleZijlstra

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-03-11.22:37:20.039>
labels = ['interpreter-core', '3.11', 'performance']
title = 'Surprising list overallocation from .split()'
updated_at = <Date 2022-03-12.03:24:49.892>
user = 'https://github.com/tim-one'

bugs.python.org fields:

activity = <Date 2022-03-12.03:24:49.892>
actor = 'tim.peters'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2022-03-11.22:37:20.039>
creator = 'tim.peters'
dependencies = []
files = []
hgrepos = []
issue_num = 46990
keywords = []
message_count = 3.0
messages = ['414942', '414958', '414968']
nosy_count = 2.0
nosy_names = ['tim.peters', 'JelleZijlstra']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue46990'
versions = ['Python 3.11']

tim-one · 2022-03-11T22:37:20Z

When looking into a StackOverflow question about surprisingly high memory use, I stumbled into this (under 3.10.1, Win64):

>>> import sys
>>> s = "1 2 3 4 5".split()
>>> s
['1', '2', '3', '4', '5']
>>> sys.getsizeof(s)
152
>>> _ - sys.getsizeof([])
96
>>> 96 / 8
12.0

That is, we allocated enough space in the list to store 12(!) elements, despite that only 5 are used. Other ways of building a 5-element list I've tried overallocate by at most 3 slots:

>>> sys.getsizeof([ch for ch in "12345"])
120
>>> sys.getsizeof([1, 2, 3, 4, 5])
120

(and 120 - 56 = 64, room for 8 pointers)

Then there's this curiosity, which allocates space for exactly the 5 needed:

>>> sys.getsizeof(list(tuple("1 2 3 4 5".split())))
96

(and 96 - 56 = 40, room for the 5 pointers needed)

I don't expect this to be consistent, but allocating space for 12 when only 5 are needed is unreasonable. Even allocating space for 8 is pushing it ;-)

JelleZijlstra · 2022-03-12T00:56:51Z

The value 12 is hardcoded here:

cpython/Objects/stringlib/split.h

Line 14 in a89c29f

#define MAX_PREALLOC 12

The comment there says that this is because most .split() calls are on lines of human-readable text, which has about 11 words per line. I don't know if I believe that.

tim-one · 2022-03-12T03:24:50Z

Well, that's annoying ;-) In context, the OP was saving a list of 10 million splits. So each overallocation by a single element burned 80 million bytes of RAM. Overallocating by 7 burned 560 million bytes.

Which is unusual. Usually a split result is short-lived, consumed once then thrown away.

OTOH, the overwhelming motivation for overallocating at all is to acheive O(1) amortized time after a long _sequence_ of appends, and split results typically aren't appended to at all. split() appears to be using it as a timing micro-optimization for tiny lists instead.

So, like I said, it's annoying ;-) For "small" lists, split() really shouldn't overallocate at all (because, as before, split results are rarely appended to). A compromise could be to save pointers to the first N (12, whatever) instances of the splitting string in a stack ("auto") vector, before any list object (or result string object) is created. If it's out of stuff to do before reaching N, fine, build a result out of exactly what was found. If there's more to do, build a result from the first N, and go on as currently (letting PyList_Append deal with it - overallocation is huge in percentage terms when the list is short, but not so much as the list gets longer).

…h-95473)

…plit Co-authored-by: Inada Naoki <[email protected]>

…h-95493) Co-authored-by: Inada Naoki <[email protected]>

corona10 · 2022-08-01T13:19:35Z

@tim-one cc @methane

Now the size is a lot reduced. I would like to close the issue until we need to squeeze more.

Python 3.12.0a0 (heads/main:fb75d015f4, Aug  1 2022, 22:16:13) [Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> s = "1 2 3 4 5".split()
>>> sys.getsizeof(s)
104

Feel free to reopen the issue if we need :)

tim-one · 2022-08-01T17:35:19Z

Thanks for this! I wasn't paying attention and missed this while it was in progress. I'm happy with the result 😄 .

tim-one added type-bug An unexpected behavior, bug, or error 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage and removed type-bug An unexpected behavior, bug, or error labels Mar 11, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

bedevere-bot mentioned this issue Jul 30, 2022

gh-91146: Reduce allocation size of list from str.split()/rsplit() #95473

Merged

corona10 added a commit to corona10/cpython that referenced this issue Jul 30, 2022

pythongh-91146: Reduce allocation size of list from str.split()

5b66b83

corona10 added 3.12 only security fixes and removed 3.11 only security fixes labels Jul 31, 2022

corona10 added a commit that referenced this issue Jul 31, 2022

gh-91146: Reduce allocation size of list from str.split()/rsplit() (g…

50b2261

…h-95473)

corona10 mentioned this issue Jul 31, 2022

gh-91146: More reduce allocation size of list from str.split/rsplit #95493

Merged

corona10 added a commit to corona10/cpython that referenced this issue Jul 31, 2022

pythongh-91146: More reduce allocation size of list from str.split/rs…

48ad552

…plit Co-authored-by: Inada Naoki <[email protected]>

corona10 added a commit that referenced this issue Aug 1, 2022

gh-91146: More reduce allocation size of list from str.split/rsplit (g…

fb75d01

…h-95493) Co-authored-by: Inada Naoki <[email protected]>

corona10 closed this as completed Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Surprising list overallocation from .split() #91146

Surprising list overallocation from .split() #91146

tim-one commented Mar 11, 2022

tim-one commented Mar 11, 2022

Uh oh!

JelleZijlstra commented Mar 12, 2022

Uh oh!

tim-one commented Mar 12, 2022

Uh oh!

corona10 commented Aug 1, 2022 •

edited

Loading

Uh oh!

tim-one commented Aug 1, 2022

Uh oh!

Uh oh!

Surprising list overallocation from .split() #91146

Surprising list overallocation from .split() #91146

Comments

tim-one commented Mar 11, 2022

tim-one commented Mar 11, 2022

Uh oh!

JelleZijlstra commented Mar 12, 2022

Uh oh!

tim-one commented Mar 12, 2022

Uh oh!

corona10 commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented Aug 1, 2022

Uh oh!

corona10 commented Aug 1, 2022 •

edited

Loading