Skip to content

tokenize in 3.12 makes copies of each line, 3.11 does not #119654

Closed as not planned
@nedbat

Description

@nedbat

Bug report

Bug description:

The tokenize module creates TokenInfo objects with a .line attribute. In Python 3.11, each token on a line used the same string object for .line. In 3.12, each token has a new copy of the same string.

This is part of a memory issue reported against coverage.py: nedbat/coveragepy#1791

# tok.py

import io
import sys
import tokenize

print(f"{sys.version = }")

text = "lorem ipsum quia dolor sit amet consectetur adipisci velit"
readline = io.StringIO(text).readline
toks = list(tokenize.generate_tokens(readline))

print(f"{toks[0].line = }")
print(f"{(toks[0].line == toks[1].line) = }")
print(f"{(toks[0].line is toks[1].line) = }")

3.11 re-uses string objects:

% python3.11 /tmp/tok.py
sys.version = '3.11.9 (main, Apr  8 2024, 14:01:56) [Clang 15.0.0 (clang-1500.3.9.4)]'
toks[0].line = 'lorem ipsum quia dolor sit amet consectetur adipisci velit'
(toks[0].line == toks[1].line) = True
(toks[0].line is toks[1].line) = True

3.12 (and above) makes new string objects:

% python3.12 /tmp/tok.py
sys.version = '3.12.3 (main, Apr  9 2024, 15:45:14) [Clang 15.0.0 (clang-1500.3.9.4)]'
toks[0].line = 'lorem ipsum quia dolor sit amet consectetur adipisci velit'
(toks[0].line == toks[1].line) = True
(toks[0].line is toks[1].line) = False

CPython versions tested on:

3.11, 3.12, 3.13, CPython main branch

Operating systems tested on:

macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions