Release v2.0.1 (What’s new?).
Welcome to s3pathlib
Documentation#
s3pathlib is a Python package that offers an object-oriented programming (OOP) interface to work with AWS S3 objects and directories. Its API is designed to be similar to the standard library pathlib and is user-friendly. The package also supports versioning in AWS S3.
Note
You may not be viewing the full document, FULL DOCUMENT IS HERE
Quick Start#
Import the library, declare an S3Path object
# import
>>> from s3pathlib import S3Path
# construct from string, auto join parts
>>> p = S3Path("bucket", "folder", "file.txt")
# construct from S3 URI works too
>>> p = S3Path("s3://bucket/folder/file.txt")
# construct from S3 ARN works too
>>> p = S3Path("arn:aws:s3:::bucket/folder/file.txt")
>>> p.bucket
'bucket'
>>> p.key
'folder/file.txt'
>>> p.uri
's3://bucket/folder/file.txt'
>>> p.console_url # click to preview it in AWS console
'https://s3.console.aws.amazon.com/s3/object/bucket?prefix=folder/file.txt'
>>> p.arn
'arn:aws:s3:::bucket/folder/file.txt'
Talk to AWS S3 and get some information
# s3pathlib maintains a "context" object that holds the AWS authentication information
# you just need to build your own boto session object and attach to it
>>> import boto3
>>> from s3pathlib import context
>>> context.attach_boto_session(
... boto3.session.Session(
... region_name="us-east-1",
... profile_name="my_aws_profile",
... )
... )
>>> p = S3Path("bucket", "folder", "file.txt")
>>> p.write_text("a lot of data ...")
>>> p.etag
'3e20b77868d1a39a587e280b99cec4a8'
>>> p.size
56789000
>>> p.size_for_human
'51.16 MB'
# folder works too, you just need to use a tailing "/" to identify that
>>> p = S3Path("bucket", "datalake/")
>>> p.count_objects()
7164 # number of files under this prefix
>>> p.calculate_total_size()
(7164, 236483701963) # 7164 objects, 220.24 GB
>>> p.calculate_total_size(for_human=True)
(7164, '220.24 GB') # 7164 objects, 220.24 GB
Manipulate Folder in S3
Native S3 Write API (those operation that change the state of S3) only operate on object level. And the list_objects API returns 1000 objects at a time. You need additional effort to manipulate objects recursively. s3pathlib
CAN SAVE YOUR LIFE
# create a S3 folder
>>> p = S3Path("bucket", "github", "repos", "my-repo/")
# upload all python file from /my-github-repo to s3://bucket/github/repos/my-repo/
>>> p.upload_dir("/my-repo", pattern="**/*.py", overwrite=False)
# copy entire s3 folder to another s3 folder
>>> p2 = S3Path("bucket", "github", "repos", "another-repo/")
>>> p1.copy_to(p2, overwrite=True)
# delete all objects in the folder, recursively, to clean up your test bucket
>>> p.delete()
>>> p2.delete()
S3 Path Filter
Ever think of filter S3 object by it’s attributes like: dirname, basename, file extension, etag, size, modified time? It is supposed to be simple in Python:
>>> s3bkt = S3Path("bucket") # assume you have a lots of files in this bucket
>>> iterproxy = s3bkt.iter_objects().filter(
... S3Path.size >= 10_000_000, S3Path.ext == ".csv" # add filter
... )
>>> iterproxy.one() # fetch one
S3Path('s3://bucket/larger-than-10MB-1.csv')
>>> iterproxy.many(3) # fetch three
[
S3Path('s3://bucket/larger-than-10MB-1.csv'),
S3Path('s3://bucket/larger-than-10MB-2.csv'),
S3Path('s3://bucket/larger-than-10MB-3.csv'),
]
>>> for p in iterproxy: # iter the rest
... print(p)
File Like Object for Simple IO
S3Path
is file-like object. It support open
and context manager syntax out of the box. Here are only some highlight examples:
# Stream big file by line
>>> p = S3Path("bucket", "log.txt")
>>> with p.open("r") as f:
... for line in f:
... do what every you want
# JSON io
>>> import json
>>> p = S3Path("bucket", "config.json")
>>> with p.open("w") as f:
... json.dump({"password": "mypass"}, f)
# pandas IO
>>> import pandas as pd
>>> p = S3Path("bucket", "dataset.csv")
>>> df = pd.DataFrame(...)
>>> with p.open("w") as f:
... df.to_csv(f)
Now that you have a basic understanding of s3pathlib, let’s read the full document to explore its capabilities in greater depth.
Getting Help#
Please use the python-s3pathlib
tag on Stack Overflow to get help.
Submit a I want help
issue tickets on GitHub Issues
Contributing#
Please see the Contribution Guidelines.
Copyright#
s3pathlib is an open source project. See the LICENSE file for more information.
Install#
s3pathlib
is released on PyPI, so all you need is:
$ pip install s3pathlib
To upgrade to latest version:
$ pip install --upgrade s3pathlib