Pure S3 Path Manipulation#
Sanhe
Apr 20, 2023
15 min read
What is Pure S3 Path#
A Pure S3 Path is a Python object that represents an AWS S3 bucket, object, or folder. However, it’s important to note that a Pure S3 Path object does not make any calls to the AWS API, nor does it imply the existence of the corresponding S3 object. Rather, it’s a lightweight abstraction that allows you to work with S3 paths in a Pythonic, object-oriented manner without incurring any network overhead.
[1]:
from s3pathlib import S3Path
s3path = S3Path("s3://bucket/folder/file.txt")
print(s3path)
S3Path('s3://bucket/folder/file.txt')
Construct an S3 Path object in Python#
With s3pathlib, there are numerous ways to create
an S3Path
object.
From bucket, and key parts#
In a file system, you typically use a file path like C:\\Users\username\file.txt
on Windows or /Users/username/file.txt
on a POSIX system. It’s similarly intuitive to construct an S3 Path from a string.
[2]:
# construct from bucket, key parts
s3path = S3Path("bucket", "folder", "file.txt")
s3path
[2]:
S3Path('s3://bucket/folder/file.txt')
[3]:
# construct from full path also works
s3path = S3Path("bucket/folder/file.txt")
s3path
[3]:
S3Path('s3://bucket/folder/file.txt')
S3 uses /
as a delimiter to organize and browse your keys hierarchically. With s3pathlib
, the delimiter is handled intelligently.
[4]:
s3path = S3Path("bucket", "/folder/", "/file.txt")
s3path
[4]:
S3Path('s3://bucket/folder/file.txt')
From S3 URI#
S3 URI is the unique resource identifier within the context of the S3 protocol. They follow this naming convention: s3://bucket-name/key-name
. You can create an S3 Path from S3 URI.
[5]:
s3path = S3Path("s3://bucket/folder/file.txt")
s3path
[5]:
S3Path('s3://bucket/folder/file.txt')
You can also use the from_s3_uri()
factory method to create an S3Path
object from an URI.
[6]:
s3path = S3Path.from_s3_uri("s3://bucket/folder/file.txt")
s3path
[6]:
S3Path('s3://bucket/folder/file.txt')
From S3 ARN#
S3 ARN is the Amazon Resource Name of an S3 resources. They follow this naming convention: arn:aws:s3:::bucket_name/key_name
. You can create an S3 Path from S3 ARN.
[7]:
s3path = S3Path("arn:aws:s3:::bucket/folder/file.txt")
s3path
[7]:
S3Path('s3://bucket/folder/file.txt')
You can use the from_s3_arn()
factory method to create an S3Path
object from an ARN.
[8]:
s3path = S3Path.from_s3_arn("arn:aws:s3:::bucket/folder/file.txt")
s3path
[8]:
S3Path('s3://bucket/folder/file.txt')
S3 Path Types#
S3 Path is a logical concept that can represent different types of AWS S3 concepts. Here is the list of S3 Path types:
📜 Classic S3 object: represents an S3 object, such as
s3://bucket/folder/file.txt
.📁 Logical S3 directory: represents an S3 directory, such as
s3://bucket/folder/
.🪣 S3 bucket: represents an S3 bucket, such as
s3://bucket/
Void Path: denotes the absence of any bucket or key, essentially representing a blank slate, no bucket, no key, no nothing.
Relative Path: represents a path relative to another S3 Path. For example, the relative path from
s3://bucket/folder/file.txt
tos3://bucket/
is simplyfolder/file.txt
. A relative path can be joined with another S3 Path to create a new S3 Path. Importantly, any concrete path joined with a void path will result in the original concrete path.Concrete Path: represents an S3 Path that refers to a concrete object in the S3 storage system. This includes classic S3 object paths, logical S3 directory paths, and S3 bucket paths. Any concrete path joined with a relative path will result in another concrete path.
Classic S3 object#
Similar to a file on your local laptop, an S3 object stores your data. At any given moment, it could be just a pointer, and the object doesn’t have to exist in S3.
[9]:
s3path = S3Path("s3://bucket/folder/file.txt")
s3path
[9]:
S3Path('s3://bucket/folder/file.txt')
There are some “is XYZ test” methods can tell you whether the S3 Path object is a “file”, “directory”, “bucket”, “void path”, “relative path”.
[10]:
s3path.is_file()
[10]:
True
[11]:
s3path.is_dir()
[11]:
False
[12]:
s3path.is_bucket()
[12]:
False
[13]:
s3path.is_void()
[13]:
False
[14]:
s3path.is_relpath()
[14]:
False
Logical S3 Directory#
Since AWS S3 is an object storage system, not a file system, directories are only a logical concept in AWS S3. AWS uses / as the path delimiter in S3 keys. There are two types of directories in AWS S3:
Hard directory: When you create a folder in the S3 console, it creates a special object without any content (an empty string) with the
/
character at the end of the key. You can see the folder as an object in the list_objects API response.Soft directory: This type of directory does not actually exist; it is a virtual concept used to help organize your objects in a folder. For example, if you have an S3 object like
s3://bucket/folder/file.txt
, then thes3://bucket/folder/
path is a soft folder. Although you can see it in the S3 console, it does not actually exist.
You can create a S3 directory from string, URI, ARN.
[15]:
s3dir = S3Path("bucket", "folder/")
s3dir
[15]:
S3Path('s3://bucket/folder/')
[16]:
s3dir = S3Path("s3://bucket/folder/")
s3dir
[16]:
S3Path('s3://bucket/folder/')
[17]:
s3dir = S3Path("arn:aws:s3:::bucket/folder/")
s3dir
[17]:
S3Path('s3://bucket/folder/')
Sometimes, you may be concerned that you forgot to append a trailing slash /
to the end of a path to indicate that it refers to a directory. In this case, you can use the to_dir()
method to ensure that the path refers to a directory.
[18]:
s3dir = S3Path("bucket", "folder").to_dir()
s3dir
[18]:
S3Path('s3://bucket/folder/')
You can also use “is XYZ test” methods on S3 directory too.
[19]:
s3dir.is_dir()
[19]:
True
[20]:
s3dir.is_file()
[20]:
False
[21]:
s3dir.is_bucket()
[21]:
False
[22]:
s3dir.is_void()
[22]:
False
[23]:
s3dir.is_relpath()
[23]:
False
S3 Bucket#
An S3 bucket is a special type of directory that can be thought of as a “root” directory without a key. In other words, it represents the top-level directory of the bucket, and it is both a bucket and a directory in its own right.
[24]:
s3bkt = S3Path("bucket")
s3bkt
[24]:
S3Path('s3://bucket/')
[25]:
s3bkt.is_bucket()
[25]:
True
[26]:
s3bkt.is_dir()
[26]:
True
[27]:
s3bkt.is_file()
[27]:
False
You can use root()
method to get the S3 bucket of any S3 object or directory.
[28]:
s3bkt = S3Path("bucket/folder/file.txt").root
s3bkt
[28]:
S3Path('s3://bucket/')
Void Path#
While Void path should not be used in your application, it can serve as an indicator that something is wrong if you accidentally attempt to use a Void path to perform an S3 API operation.
[29]:
s3path = S3Path()
s3path
[29]:
S3VoidPath()
[30]:
s3path.is_void()
[30]:
True
[31]:
s3path.is_file()
[31]:
False
[32]:
s3path.is_dir()
[32]:
False
[33]:
s3path.is_bucket()
[33]:
False
[34]:
s3path.is_relpath()
[34]:
True
Relative Path#
Relative paths are very useful for S3 Path calculations. For example, if you want to move all objects in folder A
to another folder B
, you can use the relative path from each object C
to A
to calculate the target location in B
. Specifically, the target location for each object can be found by joining the relative path from C
to A
with the folder path B
. In other words, the formula for the target path is: Target = B + (C - A)
.
Even though you can, but I don’t recommend you to construct a relative path manually. You should use path calculation method relative_to()
to create it.
[35]:
# The correct way
s3relpath = S3Path("s3://bucket/folder/file.txt").relative_to(S3Path("s3://bucket/folder"))
s3relpath
[35]:
S3RelPath('file.txt')
[36]:
# The manual way (NOT RECOMMENDED)
s3relpath = S3Path.make_relpath("file.txt")
s3relpath
[36]:
S3RelPath('file.txt')
[37]:
s3path = S3Path("s3://another-bucket/another-folder").to_dir().joinpath(s3relpath)
s3path
[37]:
S3Path('s3://another-bucket/another-folder/file.txt')
S3 Path Variable Naming Convention#
I recommend the following variable naming convention for different types of S3 Path. So when you read the code, you can easily tell what to expect.
s3path_xyz
: Classic S3 objects3dir_xyz
: Logical S3 directorys3bkt_xyz
: S3 buckets3void_xyz
: Void Paths3relpath_xyz
: Relative Path
S3 Path Attributes#
S3 Path object has a lot of useful attributes (even though they are property method).
bucket
: Return the bucket name as a string.key
: return the S3 key as a string.parts
: Provides sequence-like access to the components in the filesystem path.uri
: Return the AWS S3 URI.arn
: Return an AWS S3 Resource ARN.console_url
: Return an url that can inspect the object, directory details in AWS Console.us_gov_cloud_console_url
: Return a Gov Cloud url that can inspect the object, directory details in AWS Console.
[38]:
# create an instance
s3path = S3Path("bucket", "folder", "file.txt")
[39]:
s3path.bucket
[39]:
'bucket'
[40]:
s3path.key
[40]:
'folder/file.txt'
[41]:
s3path.parts
[41]:
['folder', 'file.txt']
The S3Path
class is both immutable and hashable. These attributes don’t require any AWS boto3 API calls and are generally available. Because S3Path objects are immutable, you cannot change the value of these attributes once they have been created.
[42]:
try:
s3path.bucket = "new-bucket"
except Exception as e:
print(e)
can't set attribute S3Path.bucket
[43]:
s3path.uri
[43]:
's3://bucket/folder/file.txt'
[44]:
s3path.arn
[44]:
'arn:aws:s3:::bucket/folder/file.txt'
[45]:
s3path.console_url
[45]:
'https://console.aws.amazon.com/s3/object/bucket?prefix=folder/file.txt'
[46]:
s3path.us_gov_cloud_console_url
[46]:
'https://console.amazonaws-us-gov.com/s3/object/bucket?prefix=folder/file.txt'
Logically, a S3Path
is also a file system like object. So it should have those file system concepts too:
basename
: the file name with extension.fname
: file name without file extension.ext
: file extension, if availabledirname
: the basename of the parent directoryabspath
: the absolute path is the full path from the root drive. You can think of S3 bucket as the root drive.parent
: the parent directory S3 Pathdirpath
: the absolute path of the parent directory. It is equal tos3path.parent.abspath
[47]:
s3path.basename
[47]:
'file.txt'
[48]:
s3path.fname
[48]:
'file'
[49]:
s3path.ext
[49]:
'.txt'
[50]:
s3path.dirname
[50]:
'folder'
[51]:
s3path.abspath
[51]:
'/folder/file.txt'
[52]:
s3path.parent
[52]:
S3Path('s3://bucket/folder/')
[53]:
s3path.dirpath
[53]:
'/folder/'
S3 Path Methods#
Comparison#
Because every S3Path
object corresponds to an S3 URI (except for relative paths), it’s often useful to compare these URIs. Therefore, the comparison operator is implemented for S3Path
, allowing you to compare one S3Path
to another.
[54]:
S3Path("bucket/file.txt") == S3Path("bucket/file.txt")
[54]:
True
[55]:
S3Path("bucket") == S3Path("bucket")
[55]:
True
[56]:
S3Path("bucket1") == S3Path("bucket2")
[56]:
False
[57]:
S3Path("bucket1") < S3Path("bucket2")
[57]:
True
[58]:
S3Path("bucket1") <= S3Path("bucket2")
[58]:
True
[59]:
# right one is a prefix of the left one
S3Path("bucket/a/1.txt") > S3Path("bucket/a/")
[59]:
True
[60]:
S3Path("bucket/a/1.txt") < S3Path("bucket/a/2.txt")
[60]:
True
Hash#
S3Path
is hashable. You can use set data structure to deduplicate them.
[61]:
p1 = S3Path("bucket", "1.txt")
p2 = S3Path("bucket", "2.txt")
p3 = S3Path("bucket", "3.txt")
set1 = {p1, p2}
set2 = {p2, p3}
[62]:
# union
set1.union(set2)
[62]:
{S3Path('s3://bucket/1.txt'),
S3Path('s3://bucket/2.txt'),
S3Path('s3://bucket/3.txt')}
[63]:
# intersection
set1.intersection(set2)
[63]:
{S3Path('s3://bucket/2.txt')}
[64]:
# difference
set1.difference(set2)
[64]:
{S3Path('s3://bucket/1.txt')}
Mutate the immutable S3Path#
It’s common to modify existing S3Path objects. However, since S3Path is immutable by design, it cannot be directly edited. Nonetheless, there are numerous utility methods available that enable you to manipulate S3Path objects in various ways.
copy()
: Create a copy of anS3Path
object that logically equals to this one, but is actually different identity in memory. Also, the cache data are cleared.change()
: Create a newS3Path
by replacing part of the attributes.joinpath()
: join with other path parts or relative paths to form anotherS3Path
.parent
: travel back to the the parent directoryS3Path
.
Copy
[65]:
s3path1 = S3Path("bucket", "folder", "file.txt")
s3path2 = s3path1.copy()
s3path2
[65]:
S3Path('s3://bucket/folder/file.txt')
[66]:
s3path1 == s3path2
[66]:
True
[67]:
s3path1 is s3path2
[67]:
False
Change
[68]:
s3path = S3Path("bkt", "a", "b", "c.jpg")
[69]:
# only change the bucket
s3path.change(new_bucket="new-bkt")
[69]:
S3Path('s3://new-bkt/a/b/c.jpg')
[70]:
# only change the absolute path
s3path.change(new_abspath="x/y/z.png")
[70]:
S3Path('s3://bkt/x/y/z.png')
[71]:
# only change the file extention
s3path.change(new_ext=".png")
[71]:
S3Path('s3://bkt/a/b/c.png')
[72]:
# only change the file name
s3path.change(new_fname="ddd")
[72]:
S3Path('s3://bkt/a/b/ddd.jpg')
[73]:
# only change the base name (file name + file extension)
s3path_new = s3path.change(new_basename="ddd.png")
s3path_new
[73]:
S3Path('s3://bkt/a/b/ddd.png')
[74]:
s3path_new.is_file()
[74]:
True
[75]:
# only change the base name, but this time it becomes a folder
s3path_new = s3path.change(new_basename="ddd/")
s3path_new
[75]:
S3Path('s3://bkt/a/b/ddd/')
[76]:
s3path_new.is_dir()
[76]:
True
[77]:
# only change the dir name
s3path.change(new_dirname="ddd/")
[77]:
S3Path('s3://bkt/a/ddd/c.jpg')
[78]:
# only change the dir name
s3path.change(new_dirname="ddd")
[78]:
S3Path('s3://bkt/a/ddd/c.jpg')
[79]:
s3path.change(new_dirpath="xxx/yyy/")
[79]:
S3Path('s3://bkt/xxx/yyy/c.jpg')
Join
S3Path.joinpath
is a very powerful method.
[80]:
s3path1 = S3Path("bucket", "folder", "subfolder", "file.txt")
s3path1
[80]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[81]:
s3path2 = s3path1.parent
s3path2
[81]:
S3Path('s3://bucket/folder/subfolder/')
[82]:
relpath1 = s3path1.relative_to(s3path2)
relpath1
[82]:
S3RelPath('file.txt')
[83]:
# join concrete path with a relative path
s3path2.joinpath(relpath1)
[83]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[84]:
s3path3 = s3path2.parent
s3path3
[84]:
S3Path('s3://bucket/folder/')
[85]:
relpath2 = s3path2.relative_to(s3path3)
relpath2
[85]:
S3RelPath('subfolder/')
[86]:
s3path3.joinpath(relpath2, relpath1)
[86]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[87]:
s3path3.joinpath("subfolder", "file.txt")
[87]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[88]:
# it's OK if you mess up with the "/"
s3path3.joinpath("/subfolder/", "/file.txt")
[88]:
S3Path('s3://bucket/folder/subfolder/file.txt')
The /
operator provide a syntax sugar for joinpath
method
[89]:
s3path = S3Path("bucket")
s3path / "file.txt"
[89]:
S3Path('s3://bucket/file.txt')
[90]:
s3path / "folder" / "file.txt"
[90]:
S3Path('s3://bucket/folder/file.txt')
Calculate Relative Path#
The relative_to()
method is used to calculate the relative path between two paths. The syntax for this method is s3path_from.relative_to(s3path_to)
. Note that the s3path_to
argument must be a shorter path than the s3path_from
argument in order for the method to work correctly.
[91]:
S3Path("bucket", "a/b/c").relative_to(S3Path("bucket", "a")).parts
[91]:
['b', 'c']
[92]:
S3Path("bucket", "a").relative_to(S3Path("bucket", "a")).parts
[92]:
[]
[93]:
# this won't work
try:
S3Path("bucket", "a").relative_to(S3Path("bucket", "a/b/c")).parts
except Exception as e:
print(e)
s3://bucket/a does not start with s3://bucket/a/b/c
The -
operator override provide a syntax sugar for relative_to
method.
[94]:
(S3Path("bucket", "a/b/c") - S3Path("bucket", "a")).parts
[94]:
['b', 'c']
What’s Next#
Now that we have established the basics of working with s3pathlib
, let’s explore how to use it to interact with the AWS S3 API.
[94]: