Skip to content

GH-104996: Implement path joining algorithm in pathlib #105484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Jun 8, 2023

Copy the ntpath.join() algorithm into pathlib and adjust it to remove string concatenation. The resulting drive, root and tail are stored on the path object without creating an intermediate joined path.

Timings in microseconds:

Before After Change
PurePosixPath().root 1.13 1.03 n/c
PurePosixPath("/a").root 1.93 1.99 n/c
PurePosixPath("/a", "b").root 2.81 2.45 15% faster
PurePosixPath("/a", "b", "c").root 3.29 2.86 15% faster
PureWindowsPath().root 1.1 1.02 n/c
PureWindowsPath("/a").root 2.11 2.14 n/c
PureWindowsPath("/a", "b").root 3.62 2.79 30% faster
PureWindowsPath("/a", "b", "c").root 4.26 3.31 30% faster

Copy the `ntpath.join()` algorithm into pathlib and adjust it to remove
string concatenation. The resulting drive, root and tail are stored on the
path object without creating an intermediate joined path.
@barneygale
Copy link
Contributor Author

@eryksun do you think you could you review this please? It's a variant of the ntpath.join() algorithm, but we skip joining parts in the tail, and skip combining the the drive, root and tail.

@eryksun
Copy link
Contributor

eryksun commented Jul 24, 2023

I'm not thrilled about duplicating the code from ntpath.join(). Have you considered adding an ntpath.splitseq(sequence) function that returns (drive, root, path_list)? Then ntpath.join() could be implemented as follows:

def join(path, *paths):
    paths = [os.fspath(path), *paths]
    if isinstance(paths[0], bytes):
        path = b''
        sep = b'\\'
        seps = b'\\/'
        colon = b':'
    else:
        path = ''
        sep = '\\'
        seps = '\\/'
        colon = ':'
    try:
        drive, root, path_list = splitseq(paths)
        for p in path_list:
            if path and path[-1] not in seps:
                path += sep
            path += p
        # If needed, add a separator between a UNC drive and path.
        if path and not root and drive and drive[-1:] != colon:
            return drive + sep + path
        return drive + root + path
    except (TypeError, AttributeError, BytesWarning):
        genericpath._check_arg_types('join', *paths)
        raise

@barneygale
Copy link
Contributor Author

The only problem I see there is that os.path.splitseq() might be quite difficult to explain in documentation!

@eryksun
Copy link
Contributor

eryksun commented Jul 27, 2023

The only problem I see there is that os.path.splitseq() might be quite difficult to explain in documentation!

I think a better function name would help. Maybe mergepaths() is better.

path_list should conserve the slashes in each segment. Empty pathname strings should be retained.

@barneygale
Copy link
Contributor Author

Maybe partialjoin()? After all, it's 80% of the join() method, and probably easiest to explain as "like join(), but returns a 3-tuple"

@barneygale
Copy link
Contributor Author

I'm withdrawing this PR as it bakes elements of pathlib's current normalisation logic into the path parsing/joining, and the overlap precludes user customisation and some other optimisations I have in mind.

@barneygale barneygale closed this Nov 22, 2023
@hugovk hugovk removed the needs backport to 3.12 only security fixes label Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants