25
loading...
This website collects cookies to deliver better user experience
virtualenv unzipperEnv -p python3.9 // create virtualenv
source unzipperEnv/bin/activate // activate virtualenv
pip install google-auth google-cloud-storage
mkdir unzipper && cd unzipper // create and enter a directory
touch storages.py // create a script to house our code
code . // open code editor
credentials.json
import io
from zipfile import ZipFile, is_zipfile
from google.cloud import storage
from google.oauth2 import service_account
# declare unzipping function
def zipextract(zipfilename_with_path):
# auth config
SERVICE_ACCOUNT_FILE = 'credentials.json'
credentials = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE)
bucketname = 'your-bucket-id'
storage_client = storage.Client(credentials=credentials)
bucket = storage_client.get_bucket(bucketname)
destination_blob_pathname = zipfilename_with_path
blob = bucket.blob(destination_blob_pathname)
zipbytes = io.BytesIO(blob.download_as_string())
if is_zipfile(zipbytes):
with ZipFile(zipbytes, 'r') as myzip:
for contentfilename in myzip.namelist():
contentfile = myzip.read(contentfilename)
# unzip pdf files only, leave out if you don't need this.
if '.pdf' in contentfilename.casefold():
output_file = f'./{contentfilename.split("/")[-1]}'
outfile = open(output_file, 'wb')
outfile.write(contentfile)
outfile.close()
blob = bucket.blob(
f'{zipfilename_with_path.rstrip(".zip")}/{contentfilename}'
)
with open(output_file, "rb") as my_pdf:
blob.upload_from_file(my_pdf)
# make the file publicly accessible
blob.make_public()
print('done running function')
if __name__ == '__main__':
zipfilename_with_path = input('enter the zipfile path: ')
zipextract(zipfilename_with_path)
documents/reports/2021/January.zip
credentials.json
(service account details). Under the hood, the google cloud libraries use the requests module.is_zipfile
checks our byte representation of the zip file to ensure what we want to unzip is an actual zip file.with ZipFile(zipbytes, 'r') as myzip:
for contentfilename in myzip.namelist():
contentfile = myzip.read(contentfilename)
.pdf
extension. It's probably not the best method as the pdf may not be accurate. This StackOverflow question offers some interesting solutions. In my case, however, I was content with the workaround.if '.pdf' in contentfilename.casefold():
output_file = f'./{contentfilename.split("/")[-1]}'
outfile = open(output_file, 'wb')
outfile.write(contentfile)
outfile.close()
blob = bucket.blob(
f'{zipfilename_with_path.rstrip(".zip")}/{contentfilename}'
)
with open(output_file, "rb") as my_pdf:
blob.upload_from_file(my_pdf)
# make the file publicly accessible
blob.make_public()
rstrip()
function ensures the extracted folder doesn't have .zip
extension in its name./tmp
./tmp
takes up system memory. In addition, files in the /tmp
directory are only available to the app instance that created the files. When the instance is deleted, the temporary files are deleted. This will ensure the files we're writing and re-uploading are deleted as soon as possible since we no longer need them.# change path here 👇🏽
output_file = f'/tmp/{contentfilename.split("/")[-1]}'
outfile = open(output_file, 'wb')
outfile.write(contentfile)
outfile.close()
"""
this line changes and there's no need for auth module imports
storage_client = storage.Client(credentials=credentials)
"""
# no credential requirements
storage_client = storage.Client()