S3 Read API#

Sanhe

Apr 20, 2023

15 min read

What is S3 Read API#

AWS S3 provides wide range of APIs, but some of these functions only retrieve information from the server without changing the state of the S3 bucket (e.g. no files are moved, changed, or deleted). Unlike Write API functions, using Read API functions improperly will NOT cause any negative impact. Therefore, it is recommended to start by exploring the Read API functions before diving into the Write API.

Configure the AWS Context object#

Before you can run any AWS API, you must first authenticate. The Context is a singleton object that manages authenticated sessions.

To get started, it’s necessary to configure AWS CLI credentials on your local machine. If you’re unsure how to do this, you can follow this official guide provided by AWS.

Once it’s done, you can run the following command to test your authentication.

[4]:
!aws sts get-caller-identity
{
    "UserId": "ABCDEFABCDEFABCDEFABC",
    "Account": "111122223333",
    "Arn": "arn:aws:iam::111122223333:user/johndoe"
}

The Context object stores a pre-authenticated boto session, which is created using your default credentials (if available). However, you can also configure a custom boto session yourself and attach it to the context.

[2]:
import boto3
from s3pathlib import context

context.attach_boto_session(
    boto3.session.Session(
        region_name="us-east-1",
        profile_name="my_aws_profile",
    )
)

When s3pathlib making AWS API calls, it prioritize to use the boto session stored in the Context object. However, you can always explicitly pass in a custom boto session to the API call if needed.

[10]:
from s3pathlib import S3Path
from boto_session_manager import BotoSesManager

bsm = BotoSesManager(
    region_name="us-east-1",
    profile_name="my_aws_profile",
)
s3path = S3Path("s3://my-bucket/test.txt")
_ = s3path.write_text("hello world", bsm=bsm) # explicit pass the boto session

If you are running the code from Cloud machine like AWS EC2 or AWS Lambda, follow this official guide to grant your computational machine proper AWS S3 access.

Get S3 Object Metadata#

An object consists of data and its descriptive metadata. s3pathlib provides a user-friendly interface for accessing object metadata without needing to explicitly invoke the API. Additionally, it automatically caches the underlying head_object API response for improved performance.

[5]:
s3path = S3Path("s3://s3pathlib/test.txt")
s3path.write_text("hello world" * 1000) # create a test object
[5]:
S3Path('s3://s3pathlib/test.txt')
[14]:
s3path.etag
[14]:
'4d5d1cba9eb18884a5410f4b83bc6951'
[15]:
s3path.last_modified_at
[15]:
datetime.datetime(2023, 4, 20, 7, 1, 13, tzinfo=tzutc())
[16]:
s3path.size
[16]:
11000
[17]:
s3path.size_for_human
[17]:
'10.74 KB'
[18]:
s3path.version_id
[18]:
'null'
[23]:
print(s3path.expire_at)
None

Note

Metadata is cached only once, either when it’s first accessed or when the get_object API is called. The cache is not automatically refreshed and cannot detect server-side changes. To obtain the latest server-side metadata value, you can use the clear_cache() method to clear the cache. The latest data will be retrieved on the next attempt to access the metadata.

please see the following example.

[26]:
# Create a test file
s3path = S3Path("s3://s3pathlib/file-with-metadata.txt")
s3path.write_text("hello world", metadata={"creator": "s3pathlib"})
print(s3path.size)
print(s3path.metadata)
11
{'creator': 's3pathlib'}
[27]:
# The server side data is changed
s3path.write_text("hello charlice", metadata={"creator": "charlice"})
# You still see the old data
print(s3path.size)
print(s3path.metadata)
11
{'creator': 's3pathlib'}
[28]:
# After you clear the cache, you got the latest data
s3path.clear_cache()
print(s3path.size)
print(s3path.metadata)
14
{'creator': 'charlice'}

Check if Object or Directory Exists#

Check the Existence Of An Object#

You can check if an S3 bucket exists using the exists() method.

[32]:
S3Path("s3pathlib").exists()
[32]:
True
[33]:
S3Path("a-bucket-never-exists").exists()
[33]:
False

You can check if an S3 object exists also.

[34]:
S3Path("s3://s3pathlib/a-file-never-exists.txt").exists()
[34]:
False
[35]:
s3path = S3Path("s3://s3pathlib/test.txt")
s3path.write_text("hello world")
s3path.exists()
[35]:
True

S3 Versioning is a feature to preserve, retrieve, and restore every version of every object stored in your buckets. s3pathlib Also support checking existence of “an object (the latest version)” or “a specific version”.

[40]:
# the s3pathlib-versioning-enabled bucket enabled versioning
s3path = S3Path("s3://s3pathlib-versioning-enabled/test.txt")
# prepare some test data
v1 = s3path.write_text("v1").version_id # add v1
v2 = s3path.write_text("v2").version_id # add v2
s3path.delete() # add a delete marker on v2
[40]:
S3Path('s3://s3pathlib-versioning-enabled/test.txt')
[41]:
# the object (latest) is considered as "not exists" since the latest version is marked as "deleted"
s3path.exists()
[41]:
False
[42]:
s3path.exists(version_id=v1) # but the older version is considered as exists
[42]:
True
[43]:
s3path.read_text(version_id=v1) # verify that it is really the older version
[43]:
'v1'

Check The Existence of A Directory#

As an S3 directory is a logical concept and often doesn’t physically exist, its exists() method will return True only if there is at least one object within the directory or if the directory is a hard folder (an empty object with a trailing “/”).

[44]:
# at begin, the folder not exists because there's no file in it
s3dir = S3Path("s3://s3pathlib/soft-folder/")
s3dir.exists()
[44]:
False
[45]:
# after creating a file in it, even though it is not a hard folder, it is still considered as "exists"
s3dir.joinpath("file.txt").write_text("hello world")
s3dir.exists()
[45]:
True
[46]:
# at begin, the hard folder not exists, because we haven't created it yet
s3dir = S3Path("s3://s3pathlib/hard-folder/")
s3dir.exists()
[46]:
False
[48]:
# after creating a hard folder, now it exists
s3dir.mkdir(exist_ok=True)
s3dir.exists()
[48]:
True
[50]:
# and you can see that the hard folder is just an empty object with trailing "/" in the S3 key
s3dir.read_text()
[50]:
''
[49]:
# and there is no object in it
s3dir.count_objects()
[49]:
0

You cannot check existence for Void path and Relative path, because they are logical concepts.

Count Number of Objects and Total Size in a Directory#

AWS Console has a button “Calculate Total Size” tells you how many objects and the total size in a S3 folder. calculate_total_size() and count_objects() can do that too.

_images/calculate-total-size.png
[62]:
s3dir = S3Path("s3://s3pathlib/calculate-total-zie/")
s3dir.mkdir(exist_ok=True) # this is a hard folder and actually exists
s3dir.joinpath("file1.txt").write_text("Hello Alice\n" * 1000)
s3dir.joinpath("file2.txt").write_text("Hello Bob\n" * 1000)
s3dir.joinpath("file3.txt").write_text("Hello Cathy\n" * 1000)
[62]:
S3Path('s3://s3pathlib/calculate-total-zie/file3.txt')
[63]:
s3dir.calculate_total_size()
[63]:
(3, 34000)
[64]:
s3dir.calculate_total_size(for_human=True)
[64]:
(3, '33.20 KB')
.. note:: In the AWS S3 console, when you click the "Create Folder" button, an empty object with a trailing ``/`` is created to represent the logical folder. Although invisible to humans, the empty object actually exists and counts as an object in the native AWS boto3 API. As humans, we don't necessarily care about "logical folders" and typically only want to see the number of objects we expect. Therefore, **by default**, ``s3pathlib`` **doesn't count logical folders** and doesn't include them in the :meth:``~s3pathlib.core.iter_objects.IterObjectsAPIMixin.iter_objects`` API. If you want to include logical folders, you can use the ``include_folder=True`` argument. For more information on logical folders, see the official AWS documentation on `Using Folder `_.
[66]:
# since we "include folder", so it returns 4 (one hard folder and three objects)
s3dir.count_objects(include_folder=True)
[66]:
4

List and Filter Objects#

In a file system, it is very common to:

  • list all sub-folders and files in the current directory, not recursively.

  • recursively travel through all sub-folders and files.

  • filter folder and files by user-defined criteria.

S3Pathlib provides a user-friendly interface to do so.

List Objects#

The iter_objects() method is the core API for listing and filtering S3 objects (not directories). It supports the following arguments:

  • batch_size: an integer, the number of S3 objects returned per API call. Internally, it makes pagination API calls to iterate through all S3 objects. A large batch size can reduce the total number of API calls and improve performance.

  • limit: an integer, limits the number of objects you want to return.

  • recursive: default to True, and goes through subfolders as well. However, you can set it to False to only iterate through the top-level folder.

  • it go through sub folder too. But you can set to False to go through top level folder only

  • include_folder: defaults to False. If set to True, it also returns empty S3 objects that end with a trailing /, which are considered as folders in the S3 console.

[92]:
# first, let's prepare some test data
s3dir = S3Path("s3://s3pathlib/list-objects/")
s3dir.joinpath("README.txt").write_text("read me please") # 1
s3dir.joinpath("logo.png").write_bytes(b"01010101" * 1000) # 2
s3dir.joinpath("folder/data1.json").write_text('{"name": "alice"}') # 3
s3dir.joinpath("folder/data2.json").write_text('{"name": "bob"}') # 4
s3dir.joinpath("folder/subfolder/config.ini").write_text('this is a config file') # 5
s3dir.joinpath("folder/logs/day1.txt").write_text("Hello Alice\n" * 1000) # 6
s3dir.joinpath("folder/logs/day2.txt").write_text("Hello Bob\n" * 1000) # 7
s3dir.count_objects()
[92]:
7
[93]:
for s3path in s3dir.iter_objects():
    print(s3path)
S3Path('s3://s3pathlib/list-objects/README.txt')
S3Path('s3://s3pathlib/list-objects/folder/data1.json')
S3Path('s3://s3pathlib/list-objects/folder/data2.json')
S3Path('s3://s3pathlib/list-objects/folder/logs/day1.txt')
S3Path('s3://s3pathlib/list-objects/folder/logs/day2.txt')
S3Path('s3://s3pathlib/list-objects/folder/subfolder/config.ini')
S3Path('s3://s3pathlib/list-objects/logo.png')
[94]:
for s3path in s3dir.iter_objects(recursive=False):
    print(s3path)
S3Path('s3://s3pathlib/list-objects/README.txt')
S3Path('s3://s3pathlib/list-objects/logo.png')

The iter_objects() actually returns a S3PathIterProxy object. This is a user-friendly iterable Python object that allows you to iterate over a subset of the returned data instead of loading everything into memory in one shot. It also provides additional features such as pagination, skipping, getting one or none, and custom filtering.

[95]:
# Create proxy
proxy = s3dir.iter_objects()
[96]:
# Get one item
proxy.one()
[96]:
S3Path('s3://s3pathlib/list-objects/README.txt')
[97]:
# Get many items
proxy.many(2)
[97]:
[S3Path('s3://s3pathlib/list-objects/folder/data1.json'),
 S3Path('s3://s3pathlib/list-objects/folder/data2.json')]
[98]:
# Skip some items
proxy.skip(1) # s3://s3pathlib/list-objects/folder/data1.json is skipped
proxy.one()
[98]:
S3Path('s3://s3pathlib/list-objects/folder/logs/day2.txt')
[99]:
# Get the rest of items
proxy.all()
[99]:
[S3Path('s3://s3pathlib/list-objects/folder/subfolder/config.ini'),
 S3Path('s3://s3pathlib/list-objects/logo.png')]
[100]:
# Get one item or none
print(proxy.one_or_none())
None
[101]:
# Load everything into a list in one shot
s3dir.iter_objects().all()
[101]:
[S3Path('s3://s3pathlib/list-objects/README.txt'),
 S3Path('s3://s3pathlib/list-objects/folder/data1.json'),
 S3Path('s3://s3pathlib/list-objects/folder/data2.json'),
 S3Path('s3://s3pathlib/list-objects/folder/logs/day1.txt'),
 S3Path('s3://s3pathlib/list-objects/folder/logs/day2.txt'),
 S3Path('s3://s3pathlib/list-objects/folder/subfolder/config.ini'),
 S3Path('s3://s3pathlib/list-objects/logo.png')]

Filter Objects#

Filtering by Attributes#

s3pathlib provides a SQL liked interface that allows you to filter the object by their attributes. Below is the full list of built-in attributes can be used for filtering:

  • S3Path.bucket: str

  • S3Path.key: str

  • S3Path.uri: str

  • S3Path.arn: str

  • S3Path.parts: list[str]

  • S3Path.basename: str

  • S3Path.fname: str

  • S3Path.ext: str

  • S3Path.dirname: str

  • S3Path.dirpath: str

  • S3Path.abspath: str

  • S3Path.etag: str

  • S3Path.size: int

  • S3Path.last_modified_at: datetime

  • S3Path.version_id: str

  • S3Path.expire_at: datetime

[102]:
# filter by file extension
for s3path in s3dir.iter_objects().filter(S3Path.ext == ".json"):
    print(s3path)
S3Path('s3://s3pathlib/list-objects/folder/data1.json')
S3Path('s3://s3pathlib/list-objects/folder/data2.json')
[118]:
# filter by file extension
for s3path in s3dir.iter_objects().filter(S3Path.size >= 1000):
    print(s3path, s3path.size)
S3Path('s3://s3pathlib/list-objects/folder/logs/day1.txt') 12000
S3Path('s3://s3pathlib/list-objects/folder/logs/day2.txt') 10000
S3Path('s3://s3pathlib/list-objects/logo.png') 8000

Filtering by Comparator#

Comparator is just a function to construct the filtering criteria for you.

[117]:
for s3path in s3dir.iter_objects().filter(S3Path.size.between(1_000, 1_000_1000)):
    print(s3path, s3path.size)
S3Path('s3://s3pathlib/list-objects/folder/logs/day1.txt') 12000
S3Path('s3://s3pathlib/list-objects/folder/logs/day2.txt') 10000
S3Path('s3://s3pathlib/list-objects/logo.png') 8000
[121]:
for s3path in s3dir.iter_objects().filter(S3Path.basename.startswith("data")):
    print(s3path)
S3Path('s3://s3pathlib/list-objects/folder/data1.json')
S3Path('s3://s3pathlib/list-objects/folder/data2.json')
[122]:
for s3path in s3dir.iter_objects().filter(S3Path.abspath.contains("subfolder")):
    print(s3path)
S3Path('s3://s3pathlib/list-objects/folder/subfolder/config.ini')

Logical Operator#

If you want to use multiple criteria, the filter() method takes multiple positioning arguments and join them with logic AND automatically.

[108]:
for s3path in s3dir.iter_objects().filter(S3Path.ext == ".txt", S3Path.size >= 1000):
    print(s3path)
S3Path('s3://s3pathlib/list-objects/folder/logs/day1.txt')
S3Path('s3://s3pathlib/list-objects/folder/logs/day2.txt')

The filter() method also can be chained, all chained filters will be joined with logic AND. It uses lazy load technique to evaluate the criteria when the data is returned.

[110]:
for s3path in (
    s3dir.iter_objects()
        .filter(S3Path.ext == ".txt")
        .filter(S3Path.size >= 1000)
):
    print(s3path)
S3Path('s3://s3pathlib/list-objects/folder/logs/day1.txt')
S3Path('s3://s3pathlib/list-objects/folder/logs/day2.txt')

The and_, or_, not_ helper functions can define complicated filtering logics.

[113]:
from s3pathlib import and_, or_, not_

for s3path in s3dir.iter_objects().filter(not_(or_(S3Path.ext == ".txt", S3Path.ext == ".png"))):
    print(s3path)
S3Path('s3://s3pathlib/list-objects/folder/data1.json')
S3Path('s3://s3pathlib/list-objects/folder/data2.json')
S3Path('s3://s3pathlib/list-objects/folder/subfolder/config.ini')

Custom Filter Function#

You can define your own custom filter function. A filter function is simply a callable function that takes only one argument, a S3Path object, and returns a boolean value to indicate whether we want to keep this object. If it returns False, the S3Path object will not be returned. You can define arbitrary criteria in your filter function.

[114]:
# the size in bytes is odd number
def size_is_odd(s3path: S3Path) -> bool:
    return s3path.size % 2

for s3path in s3dir.iter_objects().filter(size_is_odd):
    print(s3path, s3path.size)
S3Path('s3://s3pathlib/list-objects/folder/data1.json') 17
S3Path('s3://s3pathlib/list-objects/folder/data2.json') 15
S3Path('s3://s3pathlib/list-objects/folder/subfolder/config.ini') 21

Note

When using filter with limit argument. The iterator yield limit number of items first, then apply filter to the results afterwards. In conclusion, the final number of matched items is usually SMALLER than limit.

Iter Directory#

The iter_objects method only returns objects. If you want to iterate folders and objects all together, you can use the iterdir() method.

[126]:
s3dir.iterdir().all()
[126]:
[S3Path('s3://s3pathlib/list-objects/folder/'),
 S3Path('s3://s3pathlib/list-objects/README.txt'),
 S3Path('s3://s3pathlib/list-objects/logo.png')]

List Object Versions#

On versioning enabled bucket, you can list all versions of an object by calling the list_object_versions() method. It will return versions in reverse chronological order. The is_delete_marker() method can be used to check if the version is a delete marker.

[13]:
# First, let's prepare some test data
import time

s3path = S3Path("s3pathlib-versioning-enabled/file.txt")
s3path.write_text("v1")
time.sleep(1)
s3path.write_text("v2")
time.sleep(1)
s3path.delete()
time.sleep(1)
s3path.write_text("v3")
time.sleep(1)
s3path.write_text("v4")
time.sleep(1)
s3path.delete()
time.sleep(1)
s3path.write_text("v5")
time.sleep(1)
[16]:
for s3path_versioned in s3path.list_object_versions():
    version_id = s3path_versioned.version_id
    is_delete_marker = s3path_versioned.is_delete_marker()
    try:
        content = s3path_versioned.read_text(version_id=version_id)
    except Exception as e:
        content = "it's a delete marker"
    print(f"version_id = {version_id}, is_delete_marker = {is_delete_marker}, content = {content}")
version_id = Fh4t9N2vUelLUa8Z8gSSJSDyKTVOMA19, is_delete_marker = False, content = v5
version_id = Hij7d8MKHqv_RaOimyHHTP3IbX9Dpcp8, is_delete_marker = True, content = it's a delete marker
version_id = 29rFPVkoeNj_SbW28yARvZ9rSps1Lr6P, is_delete_marker = False, content = v4
version_id = wJwrSJy36Wa5vJr4DhBw5T_MuQ3hMvMf, is_delete_marker = False, content = v3
version_id = 8ARcxb.AOTVBKPMjBXzhYX9uk5krGrIV, is_delete_marker = True, content = it's a delete marker
version_id = 36HjB5gG7oBzuS7Iiu7GixMyPKzkIEog, is_delete_marker = False, content = v2
version_id = nm2Zw.jv2yGywNmCVsR5wodrHwfPwTUX, is_delete_marker = False, content = v1

What’s Next#

Now that we’ve learned some examples of using S3 read APIs, let’s move on to the next section to learn how to use S3 write APIs.

[ ]: