All The Ways to Compress and Archive Files in Python

Python standard library provides great modules and tools for pretty much any task you can think of and modules for working with compressed files are no exception. Whether it's basics like tar and zip, specific tools or formats such as gzip and bz2 or even more exotic formats like lzma, Python has it all. With all these options, deciding what might be the right tool for the task at hand might not be so obvious, though. So, to help you navigate through all the available options, we will in this article explore all of these modules and learn how to compress, decompress, verify, test and secure our archives of all kinds of formats with help of Python's standard library.

All The Formats

As mentioned above, Python has library for (almost) every tool/format imaginable. So, let's first take a look at each of them and see why you might want to use them:

zlib is a library and Python module that provides code for working with Deflate compression and decompression format which is used by zip, gzip and many others. So, by using this Python module, you're essentially using gzip compatible compression algorithm without the convenient wrapper. More about this library can be found on Wikipedia.

bz2 is a module that provides support for bzip2 compression. This algorithm is generally more effective than the deflate method, but might be slower. It also works only on individual files and therefore can't create archives.

lzma is both name of the algorithm and Python module. It can produce higher compression ratio than some older methods and is the algorithm behind the xz utility (more specifically LZMA2).

gzip is a utility most of us are familiar with. It's also a name of a Python module. This module uses the already mentioned zlib compression algorithm and serves as an interface similar to the gzip and gunzip utilities.

shutils is a module we generally don't associate with compression and decompression, but it provides utility methods for working with archives and can be a convenient way for producing tar, gztar, zip, bztar or xztar archives.

zipfile - as the name suggests - allows us to work with zip archives in Python. This module provides all the expected methods for creating, reading, writing or appending to ZIP files as well as classes and objects for easier manipulation of such files.

tarfile - as with zipfile above, you can probably guess that this module is used for working with tar archives. It can read and write gzip, bz2 and lzma files or archives. It also has support for other features we know from tar utility - list of those is available at the top of above linked docs page.

Compress & Decompress

We've got a plenty of libraries to choose from. Some of them more basic, some of them with a lot of extra features, but what they all have in common is that they (obviously) include functions for compression. So, let's see how we can perform these basic operations with each of them:

First up, zlib. This is fairly low level library and therefore might not be so commonly used so let's just look at the basic compression/decompression of whole file at once:

import zlib, sys

filename_in = "data"
filename_out = "compressed_data"

with open(filename_in, mode="rb") as fin, open(filename_out, mode="wb") as fout:
    data = fin.read()
    compressed_data = zlib.compress(data, zlib.Z_BEST_COMPRESSION)
    print(f"Original size: {sys.getsizeof(data)}")
    # Original size: 1000033
    print(f"Compressed size: {sys.getsizeof(compressed_data)}")
    # Compressed size: 1024

    fout.write(compressed_data)

with open(filename_out, mode="rb") as fin:
    data = fin.read()
    compressed_data = zlib.decompress(data)
    print(f"Compressed size: {sys.getsizeof(data)}")
    # Compressed size: 1024
    print(f"Decompressed size: {sys.getsizeof(compressed_data)}")
    # Decompressed size: 1000033

In the above code we use input file that was generated with head -c 1MB </dev/zero > data, which gives us 1MB of zeroes. We open and read this file into memory and then use the compress function to create the compressed data. This data is then written into output file. To demonstrate that we are able to recover the data, we then again open the compressed file and use decompress function on it. From the print statements we can see that the sizes of both compressed and decompressed data match.

Next format and library you can use is bz2. It can be used in very similar fashion as the zlib above:

import bz2, os, sys

filename_in = "data"
filename_out = "compressed_data.bz2"

with open(filename_in, mode="rb") as fin, bz2.open(filename_out, "wb") as fout:
    fout.write(fin.read())

print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 1000000
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 48

with bz2.open(filename_out, "rb") as fin:
    data = fin.read()
    print(f"Decompressed size: {sys.getsizeof(data)}")
    # Decompressed size: 1000033

Unsurprisingly, the interface for these modules is pretty much identical, so to show something different, in the above example we simplified and reduced the compression step to pretty much single line and used os.stat to inspect the size of files.

The last of these low level modules is lzma and to avoid showing the same code over and over again, let's do an incremental compression this time:

import lzma, os
lzc = lzma.LZMACompressor()

# cat /usr/share/dict/words | sort -R | head -c 1MB > data
filename_in = "data"
filename_out = "compressed_data.xz"

with open(filename_in, mode="r") as fin, open(filename_out, "wb") as fout:
    for chunk in fin.read(1024):
        compressed_chunk = lzc.compress(chunk.encode("ascii"))
        fout.write(compressed_chunk)
    fout.write(lzc.flush())

print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 972398
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 736

with lzma.open(filename_out, "r") as fin:
    words = fin.read().decode("utf-8").split()
    print(words[:5])
    # ['dabbing', 'hauled', "seediness's", 'Iroquoian', 'vibe']

We start by creating an input file consisting of a bunch of words extracted from dictionary provided in /usr/share/dict/words. This is so that we can actually confirm that the decompressed data is identical with original.

We then open the input and output files as in previous examples. This time around however, we iterate over the random data in 1024 bit chunks and compress them using LZMACompressor.compress. These chunks are then written into an output file. After whole file is read and compressed we need to call flush to finish the compression process and flush out any remaining data from the compressor.

To confirm that this worked, we open and decompress the file the usual way and print first a couple of words from the file.

Moving on to higher level modules - let's now use gzip for the same tasks:

import os, sys, shutil, gzip

filename_in = "data"
filename_out = "compressed_data.tar.gz"

with open(filename_in, "rb") as fin, gzip.open(filename_out, "wb") as fout:
    # Reads the file by chunks to avoid exhausting memory
    shutil.copyfileobj(fin, fout)

print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 1000000
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 1023

with gzip.open(filename_out, "rb") as fin:
    data = fin.read()
    print(f"Decompressed size: {sys.getsizeof(data)}")
    # Decompressed size: 1000033

In this example we combined both gzip and shutils. It might seem like we did the same bulk compression as with zlib or bz2 earlier, but thanks to shutil.copyfileobj we get the chunked incremental compression without having to loop over the data like we did with lzma.

One advantage of gzip module is that it also provides commandline interface, and I'm not talking about the Linux gzip and gunzip but about Python integration:

python3 -m gzip -h
usage: gzip.py [-h] [--fast | --best | -d] [file [file ...]]
...

ls -l data*
-rw-rw-r-- 1 martin martin 1000000 aug 22 18:48 data

# Use fast compression on file "data"
python3 -m gzip --fast data

# File named "data.gz" was generated:
ls -l data*
-rw-rw-r-- 1 martin martin 1000000 aug 22 18:48 data
-rw-rw-r-- 1 martin martin    1008 aug 22 20:50 data.gz

Bring The Bigger Hammer

If you're more comfortable with either zip or tar, or you need archives in formats provided by one of these, then this section will show you how to use them. Apart from the basic compression/decompression operations, these 2 modules also include some other utility methods, such as testing checksums, using passwords or listing files in archives. So, let's dive in and see all these in action.

import zipfile

# shuf -n5 /usr/share/dict/words > words.txt
files = ["words1.txt", "words2.txt", "words3.txt", "words4.txt", "words5.txt"]
archive = "archive.zip"
password = b"verysecret"

with zipfile.ZipFile(archive, "w") as zf:
    for file in files:
        zf.write(file)

    zf.setpassword(password)

with zipfile.ZipFile(archive, "r") as zf:
    crc_test = zf.testzip()
    if crc_test is not None:
        print(f"Bad CRC or file headers: {crc_test}")

    info = zf.infolist()  # also zf.namelist()
    print(info)  # See all attributes at https://docs.python.org/3/library/zipfile.html#zipinfo-objects
    # [ <ZipInfo filename='words1.txt' filemode='-rw-r--r--' file_size=37>,
    #   <ZipInfo filename='words2.txt' filemode='-rw-r--r--' file_size=47>,
    #   ... ]

    file = info[0]
    with zf.open(file) as f:
        print(f.read().decode())
        # Olav
        # teakettles
        # ...

    zf.extract(file, "/tmp", pwd=password)  # also zf.extractall()

This is a fairly long piece of code, but covers all the important features of zipfile module. In this snippet we start by creating ZIP archive using ZipFile context manager in "write" (w) mode and then add the files to this archive. You will notice that we didn't actually need to open the files that we're adding - all we needed to do is call write passing in the file name. After adding all the files, we also set archive password using setpassword method.

Next, to demonstrate that it worked, we open the archive. Before reading any files we check CRC and file headers, afterwards we retrieve information about all files present in the archive. In this example we just print the list of ZipInfo objects, but you could also inspect its attributes to get CRC, size, compression type, etc.

After checking all the files we open and read one of them. We see that it has the expected content, so we can go ahead and extract it to file specified by path (/tmp/).

In addition to creating a reading archives/files, ZIP allows us to also append files to existing archives. To do this, all we need to change is access mode to "append" ("a"):

with zipfile.ZipFile(archive, "a") as zf:
    zf.write("words6.txt")
    print(zf.namelist())
    # ['words1.txt', 'words2.txt', 'words3.txt', 'words4.txt', 'words5.txt', 'words6.txt']

Same as with gzip module, Python's zipfile and tarfile also provide CLI. To perform basic archiving and extracting use the following:

python3 -m zipfile -c arch.zip words1.txt words2.txt  # Create
python3 -m zipfile -t arch.zip  # Test
Done testing

python3 -m zipfile -e arch.zip /tmp  # Extract
ls /tmp/words*
/tmp/words1.txt  /tmp/words2.txt

Last but not least, tarfile module. This module is similar to zipfile, but also implements some extra features:

import tarfile

files = ["words1.txt", "words2.txt", "words3.txt", "words4.txt"]
archive = "archive.tar.gz"

with tarfile.open(archive, "w:gz") as tar:
    for file in files:
        tar.add(file)  # can also be dir (added recursively), symlink, etc

    print(f"archive contains: {tar.getmembers()}")
    # [<TarInfo 'words1.txt' at 0x7f71ed74f8e0>,
    #  <TarInfo 'words2.txt' at 0x7f71ed74f9a8>
    #  ... ]

    info = tar.gettarinfo("words1.txt")  # Other Linux attributes - https://docs.python.org/3/library/tarfile.html#tarinfo-objects
    print(f"{tar.name} contains {info.name} with permissions {oct(info.mode)[-3:]}, size: {info.size} and owner: {info.uid}:{info.gid}")
    # .../archive.tar contains words1.txt with permissions 644, size: 37 and owner: 500:500

    def change_permissions(tarinfo):
        tarinfo.mode = 0o100600  # -rw-------.
        return tarinfo

    tar.add("words5.txt", filter=change_permissions)

    tar.list()
    # -rw-r--r-- martin/martin   37 2021-08-23 09:01:56 words1.txt
    # -rw-r--r-- martin/martin   47 2021-08-23 09:02:06 words2.txt
    # ...
    # -rw------- martin/martin   42 2021-08-23 09:02:22 words5.txt

We start with the basic creation of archive, but here we use access mode "w:gz" which specifies that we want to use GZ compression. After that we add all our files to the archive. With tarfile module we can also pass in for example symlinks or whole directories that would be recursively added.

Next, to confirm that all the files are really there, we use getmembers method. To get insight about individual files we can use gettarinfo, which provides all the Linux file attributes.

tarfile provides one cool feature that we haven't seen with other modules and that is ability to modify attributes of files when they're being added to archive. In the above snippet we change permission of a file by supplying filter argument which modifies the TarInfo.mode. This value has to be provided as octal number, here 0o100600 sets the permissions to 0600 or -rw-------..

To get the complete overview of files after doing this change we can run list method, which gives us output similar to ls -l.

Final thing to do with tar archive is to open it and extract it. To do this, we open it with "r:gz" mode, retrieve an info object (member) using file name, check whether it really is a file and extract it to desired location:

with tarfile.open(archive, "r:gz") as tar:
    member = tar.getmember("words3.txt")
    if member.isfile():
        tar.extract(member, "/tmp/")

Conclusion

As you can see, Python's modules provide a lot of options, both low and high level, both specific and generic modules, both simple and more complicated interfaces. What you choose depends on your use case and requirements, but in general I would recommend going with the general purpose modules, such as zipfile or tarfile and resorting to the ones like lzma only if you really have to.

I tried to cover all the common use cases of these modules to give you complete overview, but there are obviously more functions, objects, attributes, etc. in each of these modules, so be sure to check out docs linked in the first section to find some other useful bits and pieces.