Pure S3 Path Manipulation#

Sanhe

Apr 20, 2023

15 min read

What is Pure S3 Path#

A Pure S3 Path is a Python object that represents an AWS S3 bucket, object, or folder. However, it’s important to note that a Pure S3 Path object does not make any calls to the AWS API, nor does it imply the existence of the corresponding S3 object. Rather, it’s a lightweight abstraction that allows you to work with S3 paths in a Pythonic, object-oriented manner without incurring any network overhead.

[1]:
from s3pathlib import S3Path

s3path = S3Path("s3://bucket/folder/file.txt")
print(s3path)
S3Path('s3://bucket/folder/file.txt')

Construct an S3 Path object in Python#

With s3pathlib, there are numerous ways to create an S3Path object.

From bucket, and key parts#

In a file system, you typically use a file path like C:\\Users\username\file.txt on Windows or /Users/username/file.txt on a POSIX system. It’s similarly intuitive to construct an S3 Path from a string.

[2]:
# construct from bucket, key parts
s3path = S3Path("bucket", "folder", "file.txt")
s3path
[2]:
S3Path('s3://bucket/folder/file.txt')
[3]:
# construct from full path also works
s3path = S3Path("bucket/folder/file.txt")
s3path
[3]:
S3Path('s3://bucket/folder/file.txt')

S3 uses / as a delimiter to organize and browse your keys hierarchically. With s3pathlib, the delimiter is handled intelligently.

[4]:
s3path = S3Path("bucket", "/folder/", "/file.txt")
s3path
[4]:
S3Path('s3://bucket/folder/file.txt')

From S3 URI#

S3 URI is the unique resource identifier within the context of the S3 protocol. They follow this naming convention: s3://bucket-name/key-name. You can create an S3 Path from S3 URI.

[5]:
s3path = S3Path("s3://bucket/folder/file.txt")
s3path
[5]:
S3Path('s3://bucket/folder/file.txt')

You can also use the from_s3_uri() factory method to create an S3Path object from an URI.

[6]:
s3path = S3Path.from_s3_uri("s3://bucket/folder/file.txt")
s3path
[6]:
S3Path('s3://bucket/folder/file.txt')

From S3 ARN#

S3 ARN is the Amazon Resource Name of an S3 resources. They follow this naming convention: arn:aws:s3:::bucket_name/key_name. You can create an S3 Path from S3 ARN.

[7]:
s3path = S3Path("arn:aws:s3:::bucket/folder/file.txt")
s3path
[7]:
S3Path('s3://bucket/folder/file.txt')

You can use the from_s3_arn() factory method to create an S3Path object from an ARN.

[8]:
s3path = S3Path.from_s3_arn("arn:aws:s3:::bucket/folder/file.txt")
s3path
[8]:
S3Path('s3://bucket/folder/file.txt')

S3 Path Types#

S3 Path is a logical concept that can represent different types of AWS S3 concepts. Here is the list of S3 Path types:

  1. 📜 Classic S3 object: represents an S3 object, such as s3://bucket/folder/file.txt.

  2. 📁 Logical S3 directory: represents an S3 directory, such as s3://bucket/folder/.

  3. 🪣 S3 bucket: represents an S3 bucket, such as s3://bucket/

  4. Void Path: denotes the absence of any bucket or key, essentially representing a blank slate, no bucket, no key, no nothing.

  5. Relative Path: represents a path relative to another S3 Path. For example, the relative path from s3://bucket/folder/file.txt to s3://bucket/ is simply folder/file.txt. A relative path can be joined with another S3 Path to create a new S3 Path. Importantly, any concrete path joined with a void path will result in the original concrete path.

  6. Concrete Path: represents an S3 Path that refers to a concrete object in the S3 storage system. This includes classic S3 object paths, logical S3 directory paths, and S3 bucket paths. Any concrete path joined with a relative path will result in another concrete path.

Classic S3 object#

Similar to a file on your local laptop, an S3 object stores your data. At any given moment, it could be just a pointer, and the object doesn’t have to exist in S3.

[9]:
s3path = S3Path("s3://bucket/folder/file.txt")
s3path
[9]:
S3Path('s3://bucket/folder/file.txt')

There are some “is XYZ test” methods can tell you whether the S3 Path object is a “file”, “directory”, “bucket”, “void path”, “relative path”.

[10]:
s3path.is_file()
[10]:
True
[11]:
s3path.is_dir()
[11]:
False
[12]:
s3path.is_bucket()
[12]:
False
[13]:
s3path.is_void()
[13]:
False
[14]:
s3path.is_relpath()
[14]:
False

Logical S3 Directory#

Since AWS S3 is an object storage system, not a file system, directories are only a logical concept in AWS S3. AWS uses / as the path delimiter in S3 keys. There are two types of directories in AWS S3:

  • Hard directory: When you create a folder in the S3 console, it creates a special object without any content (an empty string) with the / character at the end of the key. You can see the folder as an object in the list_objects API response.

  • Soft directory: This type of directory does not actually exist; it is a virtual concept used to help organize your objects in a folder. For example, if you have an S3 object like s3://bucket/folder/file.txt, then the s3://bucket/folder/ path is a soft folder. Although you can see it in the S3 console, it does not actually exist.

You can create a S3 directory from string, URI, ARN.

[15]:
s3dir = S3Path("bucket", "folder/")
s3dir
[15]:
S3Path('s3://bucket/folder/')
[16]:
s3dir = S3Path("s3://bucket/folder/")
s3dir
[16]:
S3Path('s3://bucket/folder/')
[17]:
s3dir = S3Path("arn:aws:s3:::bucket/folder/")
s3dir
[17]:
S3Path('s3://bucket/folder/')

Sometimes, you may be concerned that you forgot to append a trailing slash / to the end of a path to indicate that it refers to a directory. In this case, you can use the to_dir() method to ensure that the path refers to a directory.

[18]:
s3dir = S3Path("bucket", "folder").to_dir()
s3dir
[18]:
S3Path('s3://bucket/folder/')

You can also use “is XYZ test” methods on S3 directory too.

[19]:
s3dir.is_dir()
[19]:
True
[20]:
s3dir.is_file()
[20]:
False
[21]:
s3dir.is_bucket()
[21]:
False
[22]:
s3dir.is_void()
[22]:
False
[23]:
s3dir.is_relpath()
[23]:
False

S3 Bucket#

An S3 bucket is a special type of directory that can be thought of as a “root” directory without a key. In other words, it represents the top-level directory of the bucket, and it is both a bucket and a directory in its own right.

[24]:
s3bkt = S3Path("bucket")
s3bkt
[24]:
S3Path('s3://bucket/')
[25]:
s3bkt.is_bucket()
[25]:
True
[26]:
s3bkt.is_dir()
[26]:
True
[27]:
s3bkt.is_file()
[27]:
False

You can use root() method to get the S3 bucket of any S3 object or directory.

[28]:
s3bkt = S3Path("bucket/folder/file.txt").root
s3bkt
[28]:
S3Path('s3://bucket/')

Void Path#

While Void path should not be used in your application, it can serve as an indicator that something is wrong if you accidentally attempt to use a Void path to perform an S3 API operation.

[29]:
s3path = S3Path()
s3path
[29]:
S3VoidPath()
[30]:
s3path.is_void()
[30]:
True
[31]:
s3path.is_file()
[31]:
False
[32]:
s3path.is_dir()
[32]:
False
[33]:
s3path.is_bucket()
[33]:
False
[34]:
s3path.is_relpath()
[34]:
True

Relative Path#

Relative paths are very useful for S3 Path calculations. For example, if you want to move all objects in folder A to another folder B, you can use the relative path from each object C to A to calculate the target location in B. Specifically, the target location for each object can be found by joining the relative path from C to A with the folder path B. In other words, the formula for the target path is: Target = B + (C - A).

Even though you can, but I don’t recommend you to construct a relative path manually. You should use path calculation method relative_to() to create it.

[35]:
# The correct way
s3relpath = S3Path("s3://bucket/folder/file.txt").relative_to(S3Path("s3://bucket/folder"))
s3relpath
[35]:
S3RelPath('file.txt')
[36]:
# The manual way (NOT RECOMMENDED)
s3relpath = S3Path.make_relpath("file.txt")
s3relpath
[36]:
S3RelPath('file.txt')
[37]:
s3path = S3Path("s3://another-bucket/another-folder").to_dir().joinpath(s3relpath)
s3path
[37]:
S3Path('s3://another-bucket/another-folder/file.txt')

S3 Path Variable Naming Convention#

I recommend the following variable naming convention for different types of S3 Path. So when you read the code, you can easily tell what to expect.

  • s3path_xyz: Classic S3 object

  • s3dir_xyz: Logical S3 directory

  • s3bkt_xyz: S3 bucket

  • s3void_xyz: Void Path

  • s3relpath_xyz: Relative Path

S3 Path Attributes#

S3 Path object has a lot of useful attributes (even though they are property method).

  • bucket: Return the bucket name as a string.

  • key: return the S3 key as a string.

  • parts: Provides sequence-like access to the components in the filesystem path.

  • uri: Return the AWS S3 URI.

  • arn: Return an AWS S3 Resource ARN.

  • console_url: Return an url that can inspect the object, directory details in AWS Console.

  • us_gov_cloud_console_url: Return a Gov Cloud url that can inspect the object, directory details in AWS Console.

[38]:
# create an instance
s3path = S3Path("bucket", "folder", "file.txt")
[39]:
s3path.bucket
[39]:
'bucket'
[40]:
s3path.key
[40]:
'folder/file.txt'
[41]:
s3path.parts
[41]:
['folder', 'file.txt']

The S3Path class is both immutable and hashable. These attributes don’t require any AWS boto3 API calls and are generally available. Because S3Path objects are immutable, you cannot change the value of these attributes once they have been created.

[42]:
try:
    s3path.bucket = "new-bucket"
except Exception as e:
    print(e)
can't set attribute S3Path.bucket
[43]:
s3path.uri
[43]:
's3://bucket/folder/file.txt'
[44]:
s3path.arn
[44]:
'arn:aws:s3:::bucket/folder/file.txt'
[45]:
s3path.console_url
[45]:
'https://console.aws.amazon.com/s3/object/bucket?prefix=folder/file.txt'
[46]:
s3path.us_gov_cloud_console_url
[46]:
'https://console.amazonaws-us-gov.com/s3/object/bucket?prefix=folder/file.txt'

Logically, a S3Path is also a file system like object. So it should have those file system concepts too:

  • basename: the file name with extension.

  • fname: file name without file extension.

  • ext: file extension, if available

  • dirname: the basename of the parent directory

  • abspath: the absolute path is the full path from the root drive. You can think of S3 bucket as the root drive.

  • parent: the parent directory S3 Path

  • dirpath: the absolute path of the parent directory. It is equal to s3path.parent.abspath

[47]:
s3path.basename
[47]:
'file.txt'
[48]:
s3path.fname
[48]:
'file'
[49]:
s3path.ext
[49]:
'.txt'
[50]:
s3path.dirname
[50]:
'folder'
[51]:
s3path.abspath
[51]:
'/folder/file.txt'
[52]:
s3path.parent
[52]:
S3Path('s3://bucket/folder/')
[53]:
s3path.dirpath
[53]:
'/folder/'

S3 Path Methods#

Comparison#

Because every S3Path object corresponds to an S3 URI (except for relative paths), it’s often useful to compare these URIs. Therefore, the comparison operator is implemented for S3Path, allowing you to compare one S3Path to another.

[54]:
S3Path("bucket/file.txt") == S3Path("bucket/file.txt")
[54]:
True
[55]:
S3Path("bucket") == S3Path("bucket")
[55]:
True
[56]:
S3Path("bucket1") == S3Path("bucket2")
[56]:
False
[57]:
S3Path("bucket1") < S3Path("bucket2")
[57]:
True
[58]:
S3Path("bucket1") <= S3Path("bucket2")
[58]:
True
[59]:
# right one is a prefix of the left one
S3Path("bucket/a/1.txt") > S3Path("bucket/a/")
[59]:
True
[60]:
S3Path("bucket/a/1.txt") < S3Path("bucket/a/2.txt")
[60]:
True

Hash#

S3Path is hashable. You can use set data structure to deduplicate them.

[61]:
p1 = S3Path("bucket", "1.txt")
p2 = S3Path("bucket", "2.txt")
p3 = S3Path("bucket", "3.txt")
set1 = {p1, p2}
set2 = {p2, p3}
[62]:
# union
set1.union(set2)
[62]:
{S3Path('s3://bucket/1.txt'),
 S3Path('s3://bucket/2.txt'),
 S3Path('s3://bucket/3.txt')}
[63]:
# intersection
set1.intersection(set2)
[63]:
{S3Path('s3://bucket/2.txt')}
[64]:
# difference
set1.difference(set2)
[64]:
{S3Path('s3://bucket/1.txt')}

Mutate the immutable S3Path#

It’s common to modify existing S3Path objects. However, since S3Path is immutable by design, it cannot be directly edited. Nonetheless, there are numerous utility methods available that enable you to manipulate S3Path objects in various ways.

  • copy(): Create a copy of an S3Path object that logically equals to this one, but is actually different identity in memory. Also, the cache data are cleared.

  • change(): Create a new S3Path by replacing part of the attributes.

  • joinpath(): join with other path parts or relative paths to form another S3Path.

  • parent: travel back to the the parent directory S3Path.

Copy

[65]:
s3path1 = S3Path("bucket", "folder", "file.txt")
s3path2 = s3path1.copy()
s3path2
[65]:
S3Path('s3://bucket/folder/file.txt')
[66]:
s3path1 == s3path2
[66]:
True
[67]:
s3path1 is s3path2
[67]:
False

Change

[68]:
s3path = S3Path("bkt", "a", "b", "c.jpg")
[69]:
# only change the bucket
s3path.change(new_bucket="new-bkt")
[69]:
S3Path('s3://new-bkt/a/b/c.jpg')
[70]:
# only change the absolute path
s3path.change(new_abspath="x/y/z.png")
[70]:
S3Path('s3://bkt/x/y/z.png')
[71]:
# only change the file extention
s3path.change(new_ext=".png")
[71]:
S3Path('s3://bkt/a/b/c.png')
[72]:
# only change the file name
s3path.change(new_fname="ddd")
[72]:
S3Path('s3://bkt/a/b/ddd.jpg')
[73]:
# only change the base name (file name + file extension)
s3path_new = s3path.change(new_basename="ddd.png")
s3path_new
[73]:
S3Path('s3://bkt/a/b/ddd.png')
[74]:
s3path_new.is_file()
[74]:
True
[75]:
# only change the base name, but this time it becomes a folder
s3path_new = s3path.change(new_basename="ddd/")
s3path_new
[75]:
S3Path('s3://bkt/a/b/ddd/')
[76]:
s3path_new.is_dir()
[76]:
True
[77]:
# only change the dir name
s3path.change(new_dirname="ddd/")
[77]:
S3Path('s3://bkt/a/ddd/c.jpg')
[78]:
# only change the dir name
s3path.change(new_dirname="ddd")
[78]:
S3Path('s3://bkt/a/ddd/c.jpg')
[79]:
s3path.change(new_dirpath="xxx/yyy/")
[79]:
S3Path('s3://bkt/xxx/yyy/c.jpg')

Join

S3Path.joinpath is a very powerful method.

[80]:
s3path1 = S3Path("bucket", "folder", "subfolder", "file.txt")
s3path1
[80]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[81]:
s3path2 = s3path1.parent
s3path2
[81]:
S3Path('s3://bucket/folder/subfolder/')
[82]:
relpath1 = s3path1.relative_to(s3path2)
relpath1
[82]:
S3RelPath('file.txt')
[83]:
# join concrete path with a relative path
s3path2.joinpath(relpath1)
[83]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[84]:
s3path3 = s3path2.parent
s3path3
[84]:
S3Path('s3://bucket/folder/')
[85]:
relpath2 = s3path2.relative_to(s3path3)
relpath2
[85]:
S3RelPath('subfolder/')
[86]:
s3path3.joinpath(relpath2, relpath1)
[86]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[87]:
s3path3.joinpath("subfolder", "file.txt")
[87]:
S3Path('s3://bucket/folder/subfolder/file.txt')
[88]:
# it's OK if you mess up with the "/"
s3path3.joinpath("/subfolder/", "/file.txt")
[88]:
S3Path('s3://bucket/folder/subfolder/file.txt')

The / operator provide a syntax sugar for joinpath method

[89]:
s3path = S3Path("bucket")
s3path / "file.txt"
[89]:
S3Path('s3://bucket/file.txt')
[90]:
s3path / "folder" / "file.txt"
[90]:
S3Path('s3://bucket/folder/file.txt')

Calculate Relative Path#

The relative_to() method is used to calculate the relative path between two paths. The syntax for this method is s3path_from.relative_to(s3path_to). Note that the s3path_to argument must be a shorter path than the s3path_from argument in order for the method to work correctly.

[91]:
S3Path("bucket", "a/b/c").relative_to(S3Path("bucket", "a")).parts
[91]:
['b', 'c']
[92]:
S3Path("bucket", "a").relative_to(S3Path("bucket", "a")).parts
[92]:
[]
[93]:
# this won't work
try:
    S3Path("bucket", "a").relative_to(S3Path("bucket", "a/b/c")).parts
except Exception as e:
    print(e)
s3://bucket/a does not start with s3://bucket/a/b/c

The - operator override provide a syntax sugar for relative_to method.

[94]:
(S3Path("bucket", "a/b/c") - S3Path("bucket", "a")).parts
[94]:
['b', 'c']

What’s Next#

Now that we have established the basics of working with s3pathlib, let’s explore how to use it to interact with the AWS S3 API.

[94]: