S3 Write API#
Sanhe
Apr 20, 2023
15 min read
What is S3 Write API?#
AWS offers many S3 Write APIs for put, copy, delete objects, and more. Since write APIs can cause irreversible impacts, it’s important to ensure that you understand the behavior of the API before using it. In this section, we will learn how to use these APIs.
Simple Text / Bytes Read and Write#
[2]:
from s3pathlib import S3Path
s3path = S3Path("s3://s3pathlib/file.txt")
s3path
[2]:
S3Path('s3://s3pathlib/file.txt')
[3]:
s3path.write_text("Hello Alice!")
s3path.read_text()
[3]:
'Hello Alice!'
[4]:
s3path.write_bytes(b"Hello Bob!")
s3path.read_bytes()
[4]:
b'Hello Bob!'
Note that the s3path.write_bytes()
or s3path.write_text()
will overwrite the existing file silently. They don’t raise an error if the file already exists. If you want to avoid overwrite, you can check the existence of the file before writing.
[5]:
if s3path.exists() is False:
s3path.write_text("Hello Alice!")
The s3path.write_bytes()
and s3path.write_text()
will return a new object representing the object you just put. This is because on a versioning enabled bucket, the put_object
API will create a new version of the object. So the s3path.write_bytes()
and s3path.write_text()
should return the new version of the object.
[6]:
# in regular bucket, there's no versioning
s3path_new = s3path.write_text("Hello Alice!")
print(s3path_new == s3path)
print(s3path_new is s3path)
True
False
[7]:
# in versioning enabled bucket, write_text() will create a new version
s3path = S3Path("s3://s3pathlib-versioning-enabled/file.txt")
s3path_v1 = s3path.write_text("v1")
s3path_v2 = s3path.write_text("v2")
[8]:
s3path_v1.read_text(version_id=s3path_v1.version_id)
[8]:
'v1'
[9]:
s3path_v2.read_text(version_id=s3path_v2.version_id)
[9]:
'v2'
[10]:
print(f"v1 = {s3path_v1.version_id}")
print(f"v2 = {s3path_v2.version_id}")
v1 = FpAUGgRibznqKGqCcHUc_c_95Hn7ZaJE
v2 = a8tyUUnxHJFt2J3LhEARHrMsOnSYqiSN
File-like object IO#
File Object is an object exposing a file-oriented API (with methods such as read()
or write()
) to an underlying resource. Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.). File objects are also called file-like objects or streams.
Note
Special Thanks to smart_open. S3Path.open
is just a wrapper around smart_open
.
Tip
S3Path.open
also support version_id
parameter.
JSON#
[11]:
import json
s3path = S3Path("s3://s3pathlib/data.json")
# write to s3
with s3path.open(mode="w") as f:
json.dump({"name": "Alice"}, f)
[12]:
# read from s3
with s3path.open(mode="r") as f:
print(json.load(f))
{'name': 'Alice'}
YAML#
[13]:
import yaml
s3path = S3Path("s3://s3pathlib/config.yml")
# write to s3
with s3path.open(mode="w") as f:
yaml.dump({"name": "Alice"}, f)
[14]:
# read from s3
with s3path.open(mode="r") as f:
print(yaml.load(f, Loader=yaml.SafeLoader))
{'name': 'Alice'}
Pandas#
[15]:
import pandas as pd
s3path = S3Path("s3://s3pathlib/data.csv")
df = pd.DataFrame(
[
(1, "Alice"),
(2, "Bob"),
],
columns=["id", "name"]
)
# write to s3
with s3path.open(mode="w") as f:
df.to_csv(f, index=False)
[16]:
# read from s3
with s3path.open(mode="r") as f:
df = pd.read_csv(f)
print(df)
id name
0 1 Alice
1 2 Bob
Polars#
[17]:
import polars as pl
s3path = S3Path("s3://s3pathlib/data.parquet")
df = pl.DataFrame(
[
(1, "Alice"),
(2, "Bob"),
],
schema=["id", "name"]
)
# write to s3
with s3path.open(mode="wb") as f:
df.write_parquet(f)
[18]:
# read from s3
with s3path.open(mode="rb") as f:
df = pl.read_parquet(f)
print(df)
shape: (2, 2)
┌─────┬───────┐
│ id ┆ name │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ Alice │
│ 2 ┆ Bob │
└─────┴───────┘
Tagging and Metadata#
Object Tag is a key-value that can help categorize storage. Tag is mutable so you can update it anytime.
You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value pairs. After you upload the object, you cannot modify object metadata (immutable). The only way to modify object metadata is to make a copy of the object and set the metadata.
[19]:
s3path = S3Path("s3://s3pathlib/file.txt")
[20]:
# put initial metadata and tags
s3path.write_text("Hello", metadata={"name": "alice", "age": "18"}, tags={"name": "alice", "age": "18"})
[20]:
S3Path('s3://s3pathlib/file.txt')
We have the folloinwg methods to interact with tags:
get_tags()
: Get s3 object tags in key value pairs dict.put_tags()
: Do full replacement of s3 object tags.update_tags()
: Do partial updates of s3 object tags.
[21]:
# you can use ``S3Path.get_tags()`` to get tags
# this method returns a tuple with two item
# the first item is the version_id
# the second item is the tags
s3path.get_tags()[1]
[21]:
{'name': 'alice', 'age': '18'}
[22]:
# do partial update
s3path.update_tags({"age": "24", "email": "alice@email.com"})
s3path.get_tags()[1]
[22]:
{'name': 'alice', 'age': '24', 'email': 'alice@email.com'}
[23]:
# do full replacement
s3path.put_tags({"age": "30"})
s3path.get_tags()[1]
[23]:
{'age': '30'}
[24]:
# if an object doesn't have tag, it will return empty dict
s3path_without_tags = S3Path("s3://s3pathlib/file-without-tags.txt")
s3path_without_tags.write_text("Hello")
s3path_without_tags.get_tags()[1]
[24]:
{}
You can access the object metadata using the metadata
property method. It will first inspect the object-level cache; if not found, it will fetch the metadata from S3 and cache it.
[25]:
s3path.metadata
[25]:
{'age': '18', 'name': 'alice'}
There’s no way to only update the metadata without updating the content. You have to put the object again with the new metadata.
[26]:
# the ``write_text`` method returns a new ``S3Path`` object representing the new object (with new metadata)
s3path_new = s3path.write_text("Hello", metadata={"name": "alice", "age": "24"})
[27]:
# You will see old metadata because you are accessing the metadata cache of the old ``S3Path``
# the cache was updated when you did the ``write_text`` above
s3path.metadata
[27]:
{'age': '18', 'name': 'alice'}
[28]:
# You will see new metadata
s3path_new.metadata
[28]:
{'name': 'alice', 'age': '24'}
[29]:
# You can also create a new ``S3Path`` object (without cache) and access the metadata
S3Path("s3://s3pathlib/file.txt").metadata
[29]:
{'age': '24', 'name': 'alice'}
Delete, Copy, Move (Cut)#
``s3pathlib`` provides the following APIs: - :meth:`~s3pathlib.core.delete.DeleteAPIMixin.delete`: delete object or directory (recursively). similar to `os.removeDelete#
The delete
API is the recommended API from 2.X.Y to delete:
object
directory
specific version of an object
all versions of an object
all object all versions in a directory
By default, if you are trying to delete everything in S3 bucket, it will prompt to confirm the deletion. You can skip the confirmation by setting skip_prompt=True
.
[30]:
s3dir = S3Path("s3://s3pathlib/tmp/")
s3dir.joinpath("README.txt").write_text("readme")
s3dir.joinpath("file.txt").write_text("Hello")
s3dir.joinpath("folder/file.txt").write_text("Hello")
s3dir.count_objects()
[30]:
3
[31]:
# Delete a file
s3path_readme = s3dir.joinpath("README.txt")
s3path_readme.delete()
s3path_readme.exists()
[31]:
False
[32]:
s3dir.count_objects()
[32]:
2
[33]:
# Delete the entire folder
s3dir.delete()
s3dir.count_objects()
[33]:
0
[34]:
# Delete a specific version of an object (permanently delete)
s3path = S3Path("s3://s3pathlib-versioning-enabled/file.txt")
s3path.delete(is_hard_delete=True)
v1 = s3path.write_text("v1").version_id
v2 = s3path.write_text("v2").version_id
v3 = s3path.write_text("v3").version_id
s3path.list_object_versions().all()
[34]:
[S3Path('s3://s3pathlib-versioning-enabled/file.txt'),
S3Path('s3://s3pathlib-versioning-enabled/file.txt'),
S3Path('s3://s3pathlib-versioning-enabled/file.txt')]
[35]:
s3path.delete(version_id=v1)
try:
s3path.read_text(version_id=v1)
except Exception as e:
print(e)
An error occurred (NoSuchVersion) when calling the GetObject operation: The specified version does not exist.
[36]:
s3path.list_object_versions().all()
[36]:
[S3Path('s3://s3pathlib-versioning-enabled/file.txt'),
S3Path('s3://s3pathlib-versioning-enabled/file.txt')]
[37]:
# Delete all versions of an object (permanently delete)
s3path.delete(is_hard_delete=True)
s3path.list_object_versions().all()
[37]:
[]
[38]:
# Delete all objects all versions in a directory (permanently delete)
s3dir = S3Path("s3://s3pathlib-versioning-enabled/tmp/")
s3path1 = s3dir.joinpath("file1.txt")
s3path2 = s3dir.joinpath("file2.txt")
s3dir.delete(is_hard_delete=True)
s3path1.write_text("v1")
s3path1.write_text("v2")
s3path2.write_text("v1")
s3path2.write_text("v2")
s3dir.list_object_versions().all()
[38]:
[S3Path('s3://s3pathlib-versioning-enabled/tmp/file1.txt'),
S3Path('s3://s3pathlib-versioning-enabled/tmp/file1.txt'),
S3Path('s3://s3pathlib-versioning-enabled/tmp/file2.txt'),
S3Path('s3://s3pathlib-versioning-enabled/tmp/file2.txt')]
[39]:
s3path.delete(is_hard_delete=True)
s3path.list_object_versions().all()
[39]:
[]
Copy#
[40]:
s3path_source = S3Path("s3://s3pathlib/source/data.json")
s3path_source.write_text("this is data")
s3path_target = s3path.change(new_dirname="target")
print(f"Copy {s3path_source.uri} to {s3path_target.uri} ...")
s3path_source.copy_to(s3path_target, overwrite=True)
print(f"content of {s3path_target.uri} is: {s3path_target.read_text()!r}")
print(f"{s3path_source} still exists: {s3path_source.exists()}")
Copy s3://s3pathlib/source/data.json to s3://s3pathlib-versioning-enabled/target/file.txt ...
content of s3://s3pathlib-versioning-enabled/target/file.txt is: 'this is data'
S3Path('s3://s3pathlib/source/data.json') still exists: True
Move#
move is actually copy then delete the original file. It’s a shortcut of copy_to
and delete
.
[41]:
s3path_source = S3Path("s3://s3pathlib/source/config.yml")
s3path_source.write_text("this is config")
s3path_target = s3path.change(new_dirname="target")
print(f"Copy {s3path_source.uri} to {s3path_target.uri} ...")
s3path_source.move_to(s3path_target, overwrite=True)
print(f"content of {s3path_target.uri} is: {s3path_target.read_text()!r}")
print(f"{s3path_source} still exists: {s3path_source.exists()}")
Copy s3://s3pathlib/source/config.yml to s3://s3pathlib-versioning-enabled/target/file.txt ...
content of s3://s3pathlib-versioning-enabled/target/file.txt is: 'this is config'
S3Path('s3://s3pathlib/source/config.yml') still exists: False
Upload File or Folder#
s3pathlib
provides the following APIs:
upload_file()
: upload a file to s3.upload_dir()
: upload a directory (recursively) to a s3 prefix.
Upload File
[45]:
# at begin, the file does not exist
s3path = S3Path("s3pathlib", "daily-report.txt")
s3path.exists()
[45]:
False
[46]:
# upload a file, then file should exist
from pathlib_mate import Path
# create some test files
path = Path("daily-report.txt")
path.write_text("this is a daily report")
s3path.upload_file(path) # or absolute path as string
s3path.exists()
[46]:
True
[47]:
s3path.read_text()
[47]:
'this is a daily report'
[48]:
# By default, upload file doesn't allow overwrite, but you can set overwrite as True to skip that check.
try:
s3path.upload_file(path, overwrite=False)
except Exception as e:
print(e)
cannot write to s3://s3pathlib/daily-report.txt, s3 object ALREADY EXISTS! open console for more details https://console.aws.amazon.com/s3/object/s3pathlib?prefix=daily-report.txt.
Upload Folder
You can easily upload the entire folder to S3. The folder structure will be preserved.
[50]:
# at begin, the folder does not exist
s3dir = S3Path("s3pathlib", "uploaded-documents/")
s3dir.exists()
[50]:
False
[51]:
# create some test files
dir_documents = Path("documents")
dir_documents.joinpath("folder").mkdir(exist_ok=True, parents=True)
dir_documents.joinpath("README.txt").write_text("read me first")
dir_documents.joinpath("folder", "file.txt").write_text("this is a file")
s3dir.upload_dir(dir_documents, overwrite=True)
# inspect s3 dir folder structure
for s3path in s3dir.iter_objects():
print(s3path)
S3Path('s3://s3pathlib/uploaded-documents/README.txt')
S3Path('s3://s3pathlib/uploaded-documents/folder/file.txt')
What’s Next#
With a thorough understanding of all the features provided by s3pathlib, it’s time to see how you can use this package to develop applications for production.
[ ]: