Skip to content

gzip module writes file with bad CRC when saving large files #100260

Closed
@thomasf1

Description

@thomasf1

Bug report

When trying to write a large amount of data (2.5 GB uncompressed, 250 MB compressed) with the gzip library, the CRC WRITTEN seems to be off. Smaller sizes of data did work fine. When reading the file with gzip.open it throws a gzip.BadGzipFile, with any other program, it basically says the file is corrupt. When circumventing the CRC check, the file unzips fine.

import ejson
import gzip

# users.json is about 2.5 GB
with open('users.json', 'r', encoding='utf-8') as file:
	contents = ejson.loads(file.read())

# the resulting file is about 250 MB big which seems right and decompresses fine when suppressing the CRC check
with gzip.open('users_compressed.json.gz', 'w') as file:
	file.write(ejson.dumps(contents).encode('utf-8'))

opening the newly written File '', I get the following: gzip.BadGzipFile: CRC check failed.

Trying to only put in half the data with the following seems to work, too:

import ejson
import gzip

with open('users.json', 'r', encoding='utf-8') as file:
	contents = ejson.loads(file.read())

# Produces a file with about 250 MB that has a bad CRC
with gzip.open('users_compressed.json.gz', 'w') as file:
	file.write(ejson.dumps(dict(list(contents.items()))).encode('utf-8'))

# Produces a file with about 125 MB that opens fine
with gzip.open('users_compressed_1.json.gz', 'w') as file:
	file.write(ejson.dumps(dict(list(contents.items())[len(contents)//2:])).encode('utf-8'))

# Produces a file with about 125 MB that opens fine
with gzip.open('users_compressed_2.json.gz', 'w') as file:
	file.write(ejson.dumps(dict(list(contents.items())[:len(contents)//2])).encode('utf-8'))

environment

Python 3.10.8 on a M1 mac (macos 12.6)

Not sure how to debug it further at this point. Has anyone had similar problems?

Metadata

Metadata

Assignees

No one assigned

    Labels

    OS-macstdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions