S3 Write API#

Sanhe

Apr 20, 2023

15 min read

What is S3 Write API?#

AWS offers many S3 Write APIs for put, copy, delete objects, and more. Since write APIs can cause irreversible impacts, it’s important to ensure that you understand the behavior of the API before using it. In this section, we will learn how to use these APIs.

Simple Text / Bytes Read and Write#

[2]:
from s3pathlib import S3Path

s3path = S3Path("s3://s3pathlib/file.txt")
s3path
[2]:
S3Path('s3://s3pathlib/file.txt')
[3]:
s3path.write_text("Hello Alice!")
s3path.read_text()
[3]:
'Hello Alice!'
[4]:
s3path.write_bytes(b"Hello Bob!")
s3path.read_bytes()
[4]:
b'Hello Bob!'

Note that the s3path.write_bytes() or s3path.write_text() will overwrite the existing file silently. They don’t raise an error if the file already exists. If you want to avoid overwrite, you can check the existence of the file before writing.

[5]:
if s3path.exists() is False:
    s3path.write_text("Hello Alice!")

The s3path.write_bytes() and s3path.write_text() will return a new object representing the object you just put. This is because on a versioning enabled bucket, the put_object API will create a new version of the object. So the s3path.write_bytes() and s3path.write_text() should return the new version of the object.

[6]:
# in regular bucket, there's no versioning
s3path_new = s3path.write_text("Hello Alice!")
print(s3path_new == s3path)
print(s3path_new is s3path)
True
False
[7]:
# in versioning enabled bucket, write_text() will create a new version
s3path = S3Path("s3://s3pathlib-versioning-enabled/file.txt")
s3path_v1 = s3path.write_text("v1")
s3path_v2 = s3path.write_text("v2")
[8]:
s3path_v1.read_text(version_id=s3path_v1.version_id)
[8]:
'v1'
[9]:
s3path_v2.read_text(version_id=s3path_v2.version_id)
[9]:
'v2'
[10]:
print(f"v1 = {s3path_v1.version_id}")
print(f"v2 = {s3path_v2.version_id}")
v1 = FpAUGgRibznqKGqCcHUc_c_95Hn7ZaJE
v2 = a8tyUUnxHJFt2J3LhEARHrMsOnSYqiSN

File-like object IO#

File Object is an object exposing a file-oriented API (with methods such as read() or write()) to an underlying resource. Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.). File objects are also called file-like objects or streams.

Note

Special Thanks to smart_open. S3Path.open is just a wrapper around smart_open.

Tip

S3Path.open also support version_id parameter.

JSON#

[11]:
import json

s3path = S3Path("s3://s3pathlib/data.json")

# write to s3
with s3path.open(mode="w") as f:
    json.dump({"name": "Alice"}, f)
[12]:
# read from s3
with s3path.open(mode="r") as f:
    print(json.load(f))
{'name': 'Alice'}

YAML#

[13]:
import yaml

s3path = S3Path("s3://s3pathlib/config.yml")

# write to s3
with s3path.open(mode="w") as f:
    yaml.dump({"name": "Alice"}, f)
[14]:
# read from s3
with s3path.open(mode="r") as f:
    print(yaml.load(f, Loader=yaml.SafeLoader))
{'name': 'Alice'}

Pandas#

[15]:
import pandas as pd

s3path = S3Path("s3://s3pathlib/data.csv")

df = pd.DataFrame(
    [
        (1, "Alice"),
        (2, "Bob"),
    ],
    columns=["id", "name"]
)

# write to s3
with s3path.open(mode="w") as f:
    df.to_csv(f, index=False)
[16]:
# read from s3
with s3path.open(mode="r") as f:
    df = pd.read_csv(f)
    print(df)
   id   name
0   1  Alice
1   2    Bob

Polars#

[17]:
import polars as pl

s3path = S3Path("s3://s3pathlib/data.parquet")

df = pl.DataFrame(
    [
        (1, "Alice"),
        (2, "Bob"),
    ],
    schema=["id", "name"]
)

# write to s3
with s3path.open(mode="wb") as f:
    df.write_parquet(f)
[18]:
# read from s3
with s3path.open(mode="rb") as f:
    df = pl.read_parquet(f)
    print(df)
shape: (2, 2)
┌─────┬───────┐
│ id  ┆ name  │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ Alice │
│ 2   ┆ Bob   │
└─────┴───────┘

Tagging and Metadata#

Object Tag is a key-value that can help categorize storage. Tag is mutable so you can update it anytime.

You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value pairs. After you upload the object, you cannot modify object metadata (immutable). The only way to modify object metadata is to make a copy of the object and set the metadata.

[19]:
s3path = S3Path("s3://s3pathlib/file.txt")
[20]:
# put initial metadata and tags
s3path.write_text("Hello", metadata={"name": "alice", "age": "18"}, tags={"name": "alice", "age": "18"})
[20]:
S3Path('s3://s3pathlib/file.txt')

We have the folloinwg methods to interact with tags:

[21]:
# you can use ``S3Path.get_tags()`` to get tags
# this method returns a tuple with two item
# the first item is the version_id
# the second item is the tags
s3path.get_tags()[1]
[21]:
{'name': 'alice', 'age': '18'}
[22]:
# do partial update
s3path.update_tags({"age": "24", "email": "alice@email.com"})
s3path.get_tags()[1]
[22]:
{'name': 'alice', 'age': '24', 'email': 'alice@email.com'}
[23]:
# do full replacement
s3path.put_tags({"age": "30"})
s3path.get_tags()[1]
[23]:
{'age': '30'}
[24]:
# if an object doesn't have tag, it will return empty dict
s3path_without_tags = S3Path("s3://s3pathlib/file-without-tags.txt")
s3path_without_tags.write_text("Hello")
s3path_without_tags.get_tags()[1]
[24]:
{}

You can access the object metadata using the metadata property method. It will first inspect the object-level cache; if not found, it will fetch the metadata from S3 and cache it.

[25]:
s3path.metadata
[25]:
{'age': '18', 'name': 'alice'}

There’s no way to only update the metadata without updating the content. You have to put the object again with the new metadata.

[26]:
# the ``write_text`` method returns a new ``S3Path`` object representing the new object (with new metadata)
s3path_new = s3path.write_text("Hello", metadata={"name": "alice", "age": "24"})
[27]:
# You will see old metadata because you are accessing the metadata cache of the old ``S3Path``
# the cache was updated when you did the ``write_text`` above
s3path.metadata
[27]:
{'age': '18', 'name': 'alice'}
[28]:
# You will see new metadata
s3path_new.metadata
[28]:
{'name': 'alice', 'age': '24'}
[29]:
# You can also create a new ``S3Path`` object (without cache) and access the metadata
S3Path("s3://s3pathlib/file.txt").metadata
[29]:
{'age': '24', 'name': 'alice'}

Delete, Copy, Move (Cut)#

``s3pathlib`` provides the following APIs: - :meth:`~s3pathlib.core.delete.DeleteAPIMixin.delete`: delete object or directory (recursively). similar to `os.remove `_ and `shutil.rmtree `_ - :meth:`~s3pathlib.core.copy.CopyAPIMixin.copy_to`: copy object or directory (recursively) from one location to another. similar to `shutil.copy `_ and `shutil.copytree `_ - :meth:`~s3pathlib.core.copy.CopyAPIMixin.move_to`: move (cut) object or directory (recursively) from one location to another. similar to `shutil.move `_

Delete#

The delete API is the recommended API from 2.X.Y to delete:

  • object

  • directory

  • specific version of an object

  • all versions of an object

  • all object all versions in a directory

By default, if you are trying to delete everything in S3 bucket, it will prompt to confirm the deletion. You can skip the confirmation by setting skip_prompt=True.

[30]:
s3dir = S3Path("s3://s3pathlib/tmp/")
s3dir.joinpath("README.txt").write_text("readme")
s3dir.joinpath("file.txt").write_text("Hello")
s3dir.joinpath("folder/file.txt").write_text("Hello")
s3dir.count_objects()
[30]:
3
[31]:
# Delete a file
s3path_readme = s3dir.joinpath("README.txt")
s3path_readme.delete()
s3path_readme.exists()
[31]:
False
[32]:
s3dir.count_objects()
[32]:
2
[33]:
# Delete the entire folder
s3dir.delete()
s3dir.count_objects()
[33]:
0
[34]:
# Delete a specific version of an object (permanently delete)
s3path = S3Path("s3://s3pathlib-versioning-enabled/file.txt")
s3path.delete(is_hard_delete=True)
v1 = s3path.write_text("v1").version_id
v2 = s3path.write_text("v2").version_id
v3 = s3path.write_text("v3").version_id
s3path.list_object_versions().all()
[34]:
[S3Path('s3://s3pathlib-versioning-enabled/file.txt'),
 S3Path('s3://s3pathlib-versioning-enabled/file.txt'),
 S3Path('s3://s3pathlib-versioning-enabled/file.txt')]
[35]:
s3path.delete(version_id=v1)
try:
    s3path.read_text(version_id=v1)
except Exception as e:
    print(e)
An error occurred (NoSuchVersion) when calling the GetObject operation: The specified version does not exist.
[36]:
s3path.list_object_versions().all()
[36]:
[S3Path('s3://s3pathlib-versioning-enabled/file.txt'),
 S3Path('s3://s3pathlib-versioning-enabled/file.txt')]
[37]:
# Delete all versions of an object (permanently delete)
s3path.delete(is_hard_delete=True)
s3path.list_object_versions().all()
[37]:
[]
[38]:
# Delete all objects all versions in a directory (permanently delete)
s3dir = S3Path("s3://s3pathlib-versioning-enabled/tmp/")
s3path1 = s3dir.joinpath("file1.txt")
s3path2 = s3dir.joinpath("file2.txt")
s3dir.delete(is_hard_delete=True)
s3path1.write_text("v1")
s3path1.write_text("v2")
s3path2.write_text("v1")
s3path2.write_text("v2")
s3dir.list_object_versions().all()
[38]:
[S3Path('s3://s3pathlib-versioning-enabled/tmp/file1.txt'),
 S3Path('s3://s3pathlib-versioning-enabled/tmp/file1.txt'),
 S3Path('s3://s3pathlib-versioning-enabled/tmp/file2.txt'),
 S3Path('s3://s3pathlib-versioning-enabled/tmp/file2.txt')]
[39]:
s3path.delete(is_hard_delete=True)
s3path.list_object_versions().all()
[39]:
[]

Copy#

[40]:
s3path_source = S3Path("s3://s3pathlib/source/data.json")
s3path_source.write_text("this is data")
s3path_target = s3path.change(new_dirname="target")
print(f"Copy {s3path_source.uri} to {s3path_target.uri} ...")
s3path_source.copy_to(s3path_target, overwrite=True)
print(f"content of {s3path_target.uri} is: {s3path_target.read_text()!r}")
print(f"{s3path_source} still exists: {s3path_source.exists()}")
Copy s3://s3pathlib/source/data.json to s3://s3pathlib-versioning-enabled/target/file.txt ...
content of s3://s3pathlib-versioning-enabled/target/file.txt is: 'this is data'
S3Path('s3://s3pathlib/source/data.json') still exists: True

Move#

move is actually copy then delete the original file. It’s a shortcut of copy_to and delete.

[41]:
s3path_source = S3Path("s3://s3pathlib/source/config.yml")
s3path_source.write_text("this is config")
s3path_target = s3path.change(new_dirname="target")
print(f"Copy {s3path_source.uri} to {s3path_target.uri} ...")
s3path_source.move_to(s3path_target, overwrite=True)
print(f"content of {s3path_target.uri} is: {s3path_target.read_text()!r}")
print(f"{s3path_source} still exists: {s3path_source.exists()}")
Copy s3://s3pathlib/source/config.yml to s3://s3pathlib-versioning-enabled/target/file.txt ...
content of s3://s3pathlib-versioning-enabled/target/file.txt is: 'this is config'
S3Path('s3://s3pathlib/source/config.yml') still exists: False

Upload File or Folder#

s3pathlib provides the following APIs:

Upload File

[45]:
# at begin, the file does not exist
s3path = S3Path("s3pathlib", "daily-report.txt")
s3path.exists()
[45]:
False
[46]:
# upload a file, then file should exist
from pathlib_mate import Path

# create some test files
path = Path("daily-report.txt")
path.write_text("this is a daily report")
s3path.upload_file(path) # or absolute path as string

s3path.exists()
[46]:
True
[47]:
s3path.read_text()
[47]:
'this is a daily report'
[48]:
# By default, upload file doesn't allow overwrite, but you can set overwrite as True to skip that check.
try:
    s3path.upload_file(path, overwrite=False)
except Exception as e:
    print(e)
cannot write to s3://s3pathlib/daily-report.txt, s3 object ALREADY EXISTS! open console for more details https://console.aws.amazon.com/s3/object/s3pathlib?prefix=daily-report.txt.

Upload Folder

You can easily upload the entire folder to S3. The folder structure will be preserved.

[50]:
# at begin, the folder does not exist
s3dir = S3Path("s3pathlib", "uploaded-documents/")
s3dir.exists()
[50]:
False
[51]:
# create some test files
dir_documents = Path("documents")
dir_documents.joinpath("folder").mkdir(exist_ok=True, parents=True)
dir_documents.joinpath("README.txt").write_text("read me first")
dir_documents.joinpath("folder", "file.txt").write_text("this is a file")

s3dir.upload_dir(dir_documents, overwrite=True)

# inspect s3 dir folder structure
for s3path in s3dir.iter_objects():
    print(s3path)
S3Path('s3://s3pathlib/uploaded-documents/README.txt')
S3Path('s3://s3pathlib/uploaded-documents/folder/file.txt')

What’s Next#

With a thorough understanding of all the features provided by s3pathlib, it’s time to see how you can use this package to develop applications for production.

[ ]: