20
loading...
This website collects cookies to deliver better user experience
tar
and zip
, specific tools or formats such as gzip
and bz2
or even more exotic formats like lzma
, Python has it all. With all these options, deciding what might be the right tool for the task at hand might not be so obvious, though. So, to help you navigate through all the available options, we will in this article explore all of these modules and learn how to compress, decompress, verify, test and secure our archives of all kinds of formats with help of Python's standard library.zlib
is a library and Python module that provides code for working with Deflate compression and decompression format which is used by zip
, gzip
and many others. So, by using this Python module, you're essentially using gzip
compatible compression algorithm without the convenient wrapper. More about this library can be found on Wikipedia.
bz2
is a module that provides support for bzip2
compression. This algorithm is generally more effective than the deflate method, but might be slower. It also works only on individual files and therefore can't create archives.
lzma
is both name of the algorithm and Python module. It can produce higher compression ratio than some older methods and is the algorithm behind the xz
utility (more specifically LZMA2).
gzip
is a utility most of us are familiar with. It's also a name of a Python module. This module uses the already mentioned zlib
compression algorithm and serves as an interface similar to the gzip
and gunzip
utilities.
shutils
is a module we generally don't associate with compression and decompression, but it provides utility methods for working with archives and can be a convenient way for producing tar
, gztar
, zip
, bztar
or xztar
archives.
zipfile
- as the name suggests - allows us to work with zip
archives in Python. This module provides all the expected methods for creating, reading, writing or appending to ZIP files as well as classes and objects for easier manipulation of such files.
tarfile
- as with zipfile
above, you can probably guess that this module is used for working with tar
archives. It can read and write gzip
, bz2
and lzma
files or archives. It also has support for other features we know from tar
utility - list of those is available at the top of above linked docs page.
zlib
. This is fairly low level library and therefore might not be so commonly used so let's just look at the basic compression/decompression of whole file at once:import zlib, sys
filename_in = "data"
filename_out = "compressed_data"
with open(filename_in, mode="rb") as fin, open(filename_out, mode="wb") as fout:
data = fin.read()
compressed_data = zlib.compress(data, zlib.Z_BEST_COMPRESSION)
print(f"Original size: {sys.getsizeof(data)}")
# Original size: 1000033
print(f"Compressed size: {sys.getsizeof(compressed_data)}")
# Compressed size: 1024
fout.write(compressed_data)
with open(filename_out, mode="rb") as fin:
data = fin.read()
compressed_data = zlib.decompress(data)
print(f"Compressed size: {sys.getsizeof(data)}")
# Compressed size: 1024
print(f"Decompressed size: {sys.getsizeof(compressed_data)}")
# Decompressed size: 1000033
head -c 1MB </dev/zero > data
, which gives us 1MB of zeroes. We open and read this file into memory and then use the compress
function to create the compressed data. This data is then written into output file. To demonstrate that we are able to recover the data, we then again open the compressed file and use decompress
function on it. From the print statements we can see that the sizes of both compressed and decompressed data match.bz2
. It can be used in very similar fashion as the zlib
above:import bz2, os, sys
filename_in = "data"
filename_out = "compressed_data.bz2"
with open(filename_in, mode="rb") as fin, bz2.open(filename_out, "wb") as fout:
fout.write(fin.read())
print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 1000000
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 48
with bz2.open(filename_out, "rb") as fin:
data = fin.read()
print(f"Decompressed size: {sys.getsizeof(data)}")
# Decompressed size: 1000033
os.stat
to inspect the size of files. lzma
and to avoid showing the same code over and over again, let's do an incremental compression this time:import lzma, os
lzc = lzma.LZMACompressor()
# cat /usr/share/dict/words | sort -R | head -c 1MB > data
filename_in = "data"
filename_out = "compressed_data.xz"
with open(filename_in, mode="r") as fin, open(filename_out, "wb") as fout:
for chunk in fin.read(1024):
compressed_chunk = lzc.compress(chunk.encode("ascii"))
fout.write(compressed_chunk)
fout.write(lzc.flush())
print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 972398
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 736
with lzma.open(filename_out, "r") as fin:
words = fin.read().decode("utf-8").split()
print(words[:5])
# ['dabbing', 'hauled', "seediness's", 'Iroquoian', 'vibe']
/usr/share/dict/words
. This is so that we can actually confirm that the decompressed data is identical with original.LZMACompressor.compress
. These chunks are then written into an output file. After whole file is read and compressed we need to call flush
to finish the compression process and flush out any remaining data from the compressor.gzip
for the same tasks:import os, sys, shutil, gzip
filename_in = "data"
filename_out = "compressed_data.tar.gz"
with open(filename_in, "rb") as fin, gzip.open(filename_out, "wb") as fout:
# Reads the file by chunks to avoid exhausting memory
shutil.copyfileobj(fin, fout)
print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 1000000
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 1023
with gzip.open(filename_out, "rb") as fin:
data = fin.read()
print(f"Decompressed size: {sys.getsizeof(data)}")
# Decompressed size: 1000033
gzip
and shutils
. It might seem like we did the same bulk compression as with zlib
or bz2
earlier, but thanks to shutil.copyfileobj
we get the chunked incremental compression without having to loop over the data like we did with lzma
.gzip
module is that it also provides commandline interface, and I'm not talking about the Linux gzip
and gunzip
but about Python integration:python3 -m gzip -h
usage: gzip.py [-h] [--fast | --best | -d] [file [file ...]]
...
ls -l data*
-rw-rw-r-- 1 martin martin 1000000 aug 22 18:48 data
# Use fast compression on file "data"
python3 -m gzip --fast data
# File named "data.gz" was generated:
ls -l data*
-rw-rw-r-- 1 martin martin 1000000 aug 22 18:48 data
-rw-rw-r-- 1 martin martin 1008 aug 22 20:50 data.gz
zip
or tar
, or you need archives in formats provided by one of these, then this section will show you how to use them. Apart from the basic compression/decompression operations, these 2 modules also include some other utility methods, such as testing checksums, using passwords or listing files in archives. So, let's dive in and see all these in action.import zipfile
# shuf -n5 /usr/share/dict/words > words.txt
files = ["words1.txt", "words2.txt", "words3.txt", "words4.txt", "words5.txt"]
archive = "archive.zip"
password = b"verysecret"
with zipfile.ZipFile(archive, "w") as zf:
for file in files:
zf.write(file)
zf.setpassword(password)
with zipfile.ZipFile(archive, "r") as zf:
crc_test = zf.testzip()
if crc_test is not None:
print(f"Bad CRC or file headers: {crc_test}")
info = zf.infolist() # also zf.namelist()
print(info) # See all attributes at https://docs.python.org/3/library/zipfile.html#zipinfo-objects
# [ <ZipInfo filename='words1.txt' filemode='-rw-r--r--' file_size=37>,
# <ZipInfo filename='words2.txt' filemode='-rw-r--r--' file_size=47>,
# ... ]
file = info[0]
with zf.open(file) as f:
print(f.read().decode())
# Olav
# teakettles
# ...
zf.extract(file, "/tmp", pwd=password) # also zf.extractall()
zipfile
module. In this snippet we start by creating ZIP archive using ZipFile
context manager in "write" (w
) mode and then add the files to this archive. You will notice that we didn't actually need to open the files that we're adding - all we needed to do is call write
passing in the file name. After adding all the files, we also set archive password using setpassword
method.ZipInfo
objects, but you could also inspect its attributes to get CRC, size, compression type, etc./tmp/
)."a"
):with zipfile.ZipFile(archive, "a") as zf:
zf.write("words6.txt")
print(zf.namelist())
# ['words1.txt', 'words2.txt', 'words3.txt', 'words4.txt', 'words5.txt', 'words6.txt']
gzip
module, Python's zipfile
and tarfile
also provide CLI. To perform basic archiving and extracting use the following:python3 -m zipfile -c arch.zip words1.txt words2.txt # Create
python3 -m zipfile -t arch.zip # Test
Done testing
python3 -m zipfile -e arch.zip /tmp # Extract
ls /tmp/words*
/tmp/words1.txt /tmp/words2.txt
tarfile
module. This module is similar to zipfile
, but also implements some extra features:import tarfile
files = ["words1.txt", "words2.txt", "words3.txt", "words4.txt"]
archive = "archive.tar.gz"
with tarfile.open(archive, "w:gz") as tar:
for file in files:
tar.add(file) # can also be dir (added recursively), symlink, etc
print(f"archive contains: {tar.getmembers()}")
# [<TarInfo 'words1.txt' at 0x7f71ed74f8e0>,
# <TarInfo 'words2.txt' at 0x7f71ed74f9a8>
# ... ]
info = tar.gettarinfo("words1.txt") # Other Linux attributes - https://docs.python.org/3/library/tarfile.html#tarinfo-objects
print(f"{tar.name} contains {info.name} with permissions {oct(info.mode)[-3:]}, size: {info.size} and owner: {info.uid}:{info.gid}")
# .../archive.tar contains words1.txt with permissions 644, size: 37 and owner: 500:500
def change_permissions(tarinfo):
tarinfo.mode = 0o100600 # -rw-------.
return tarinfo
tar.add("words5.txt", filter=change_permissions)
tar.list()
# -rw-r--r-- martin/martin 37 2021-08-23 09:01:56 words1.txt
# -rw-r--r-- martin/martin 47 2021-08-23 09:02:06 words2.txt
# ...
# -rw------- martin/martin 42 2021-08-23 09:02:22 words5.txt
"w:gz"
which specifies that we want to use GZ compression. After that we add all our files to the archive. With tarfile
module we can also pass in for example symlinks or whole directories that would be recursively added.getmembers
method. To get insight about individual files we can use gettarinfo
, which provides all the Linux file attributes.tarfile
provides one cool feature that we haven't seen with other modules and that is ability to modify attributes of files when they're being added to archive. In the above snippet we change permission of a file by supplying filter
argument which modifies the TarInfo.mode
. This value has to be provided as octal number, here 0o100600
sets the permissions to 0600
or -rw-------.
.list
method, which gives us output similar to ls -l
.tar
archive is to open it and extract it. To do this, we open it with "r:gz"
mode, retrieve an info object (member
) using file name, check whether it really is a file and extract it to desired location:with tarfile.open(archive, "r:gz") as tar:
member = tar.getmember("words3.txt")
if member.isfile():
tar.extract(member, "/tmp/")
zipfile
or tarfile
and resorting to the ones like lzma
only if you really have to.