You think you know AWS S3?

My S3 study notes

The information in this post was pulled together last year when I was studying for an AWS certification. I thought that my knowledge of S3 was really good, and figured that I could pretty much ignore it for the exam, and focus on the AWS services that I didn’t have much experience with. Man, was I wrong. This is likely just part 1 of a multi-part series on S3 because I could easily do a bunch of extra posts describing in detail how some of the parts work (eg: CORS, Server-Side Encryption, or running a static website on S3)

S3 Basics

Buckets

  • Bucket names must be globally unique
  • Bucket names can’t contain uppercase or underscores
  • Bucket names must start with an alpha numeric
  • Bucket names must be between 3-63 characters long
  • Buckets exist within a specific AWS region, but can be migrated
  • Buckets are strongly consistent! (Since December 2020)

Objects

The data that resides in a bucket are known as objects. It is a object store, after all! Objects are comprised of a key and a value:

  • Key = full path to the file relative to the bucket, eg: s3://my-bucket-name/engineering/sysops/filename.txt
  • The key provides quasi-support for directory names, using slashes in the object key name
  • Max object size is 5TB
  • Max upload size is 5GB (for larger objects your must use “multi-part upload”)
  • Objects support metadata, tags, versioning and server-side encryption

Versioning

  • Must be enabled on the bucket, not on objects within it
  • If you suspend versioning, existing versions are kept!
  • If an object has a version “null”, that means it was created before versioning was enabled
  • Deleting a versioned file just adds a “delete” marker which hides it (but you can still see it under “List versions”)

Server Side Encryption

There are different ways to use SSE. You can use HTTP headers when uploading data, you can force encryption on the bucket using a policy, or you can use S3’s “Default Encryption”. Let’s go through each one to get more detail…

Use HTTP headers

You can trigger SSE on individual file uploads by adding a specific HTTP header. There are three variants of SSE:

SSE-S3

  • Keys are managed by S3
  • Set http header "x-amz-server-side-encryption": "AWS256"

SSE-KMS:

  • Keys are managed by KMS
  • Set http header "x-amz-server-side-encryption": "aws:kms"
  • Specify the KMS key
  • Benefit: auditing and control over key access

SSE-C

  • You manage your own keys
  • Encryption key must be provided in HTTP headers for every request! (HTTPS is mandatory!)
  • Must be done via the AWS cli because you have to pass in a key (the AWS Console does not support it)

“Force encryption” bucket policy

You can apply a bucket policy that denies “PUT” API calls if encryption headers are not used. Details

Use S3 “Default Encryption”

This is a setting you can enable in your S3 bucket to ensure that all uploaded files are encrypted. When you enable it, you will be prompted for key type (SSE-S3 or SSE-KMS). If you specify SSE-KMS, you must provide the key. The purpose of S3 default encryption isn’t to force an encryption configuration, but to ensure that all files will be encrypted even if you forget to specify. You are still free to set your own server-side encryption (eg: SSE-C) and it will be honored.

Security

Securing S3 buckets is extremely important. There have been many high-profile security breaches that began with an S3 bucket that was accidentally made public. Don’t let this happen to you!

User based (IAM policies)

You can grant access to an S3 bucket by attaching a policy to a user or group.

Resource based

You can control access to an S3 bucket by attaching a policy to the S3 bucket.

  • Bucket/Object Policies
    • Look like IAM policies
    • Can be applied to buckets or objects
    • Can force encryption, grant access
  • Object ACL
  • Bucket ACL

MFA-Delete

Additional security can be used to prevent accidental or unauthorized object deletion:

  • You must enable versioning on the S3 bucket for MFA-Delete to work
  • Once enabled, you must use MFA to suspend versioning or delete an object version
  • You must be root to enable/disable MFA-Delete (even an admin can’t do it)
  • You can only configure MFA using the CLI

Block Public Access

Introduced after some high-profile data leaks, there is a configuration option that can be applied to a bucket. This guarantees that the bucket cannot be accessed from outside the AWS account. This can also be applied at the account level

Pre-signed URL

You can generate time-limited URLs that has credentials that allow anonymous users to access S3 buckets or objects. They can be used to upload or download. To create a pre-signed URL, you can use the SDK or CLI, eg: aws s3 presign s3://mybucket/file.jpg --region eu-west-1 --expires-in 300

  • By default, the timeout on a pre-signed URL is 3600 seconds
  • The URL inherits the permission of the IAM entity that generated the URL!

VPC Endpoint

To ensure that a bucket can only be used from within your own VPC, you can create an S3 “VPC endpoint” inside your VPC, and add the condition aws:SourceVpce or aws:SourceVpc to the bucket policy.

Access Points

If your S3 bucket policy is getting too large to manage, you can create access points:

  • Access point policies allow specific IAM users or groups
  • Each Access Point has its own DNS name
  • Access points can be restricted to VPC access only

NOTE: you should block direct access to the S3 bucket using a condition

AI-driven DLP with Macie

Macie is a fully managed service provided by AWS to monitor your S3 buckets for permissive policies and unencrypted buckets. I provides you with an inventory of your S3 buckets, and uses machine learning and pattern matching to identify a variety of different data types (eg: credit card numbers, social security numbers)

S3 Static Websites

You can publish a static website directly from S3! To do this:

  • Enable “static website hosting” in the console
  • Disable “block public access”
  • Add bucket policy to allow public access

CORS

Cross Origin Resource Sharing is a header-based mechanism that allows an origin (eg: https://myimages.com:443) to indicate which other origins (eg: http://leecher.com:80) are allowed to load resources from it. For a normal website, you can allow it by adding the following CORS http headers:

Access-Control-Allow-Origin: http://leecher.com
Access-Control-Allow-Methods: GET

S3 uses a JSON document to configure CORS. Here is an example configuration:

[
    {
        "AllowedHeaders": [
            "Authorization"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "http://leecher.com"
        ],
        "ExposeHeaders": [],
        "NoAgeSeconds": 3000
    }
]

CloudFront access to S3

OAI (Origin Access Identity)

To restrict access to S3 content so it must come via CloudFront, using origin access identity:

  1. Create a special CloudFront user (a ‘origin access identity’) and associate it with your distribution
  2. Configure your S3 bucket permission so CloudFront can use the OAI to access files in the bucket
  3. Make sure users can’t access the bucket directly

Custom origin + custom headers

If you’re using S3 as a website using CloudFront, you cannot use OAI!

  1. Origin Custom Headers: Configure CloudFront to forward custom headers to your origin.
  2. Viewer Protocol Policy: Configure your distribution to require viewers to use HTTPS to access CloudFront.
  3. Origin Protocol Policy: Configure your distribution to require CloudFront to use the same protocol as viewers to forward requests to the origin.
  4. Update your application to only accept requests that include custom headers you configured in CloudFront

Server Access Logging

You can track access to buckets using CloudTrail. The logging itself is free, but you have to pay for log storage. WARNING: Never save the logs to the bucket you’re monitoring! This will create an expensive feedback loop!

S3 Replication

You can replicate objects between different S3 buckets.

  • Requires S3 bucket versioning
  • You can limit replication rules to a scope, or apply it to the whole bucket
  • S3 can create an IAM role for you automatically (see service-linked roles)
  • After activation, it will only copy new changes! It’s not retroactive
  • No replication chaining, eg:
    • Replication: bucket1 > bucket2
    • Replication: bucket2 > bucket3
    • Writing an object to bucket1 will only replicate it to bucket2! It will not appear in bucket3!

S3 Inventory

  • List objects and their metadata
  • Useful for auditit replication or encryption status (compliance)
  • Output CSV, ORC or Apache Parquet
  • Query the data using Athena, Redshift, Presto, Hive, Spark
  • Filter the report using S3 Select

Storage Classes / Tiers

Standard (11 nines durability)

  • S3 General purpose
  • S3 Infrequent Access, retrieval cost per GB!
  • S3 One Zone Infrequent Access, retrieval cost per GB!

Intelligent Tiering

  • Monthly fee for monitoring and auto-tiering
  • Automatically moves objects between S3 General Purpose and Infrequent Access

S3 Analytics

As an alternative to Intelligent Tiering, you can set up S3 Analytics to help you determine when to transition objects between S3 Standard and Standard IA (doesn’t work for the other tiers). It generates a daily report, and it provides information helpful for creating Lifecycle Rules.

Glacier

  • Minimum storage duration 90 days
  • Retrieval links have an expiration date!
  • Items are not called objects, but archives
  • Does not use buckets, but vaults
  • 3 retrieval options:
    • Expedited (up to 5 minutes)
    • Standard (3 to 5 hours)
    • Bulk (5 to 12 hours)

Glacier Deep Archive

  • Minimum storage duration 180 days
  • 2 retrieval options:
    • Standard (12 hours)
    • Bulk (48 hours)

Glacier Vault Locks

Lets you create locks on Vaults, and they are immutable once set up. A lock could be configured, for example, to deny deletion on objects less than 365 days old. Data stored in a Glacier vault under a lock will be retained even if you stop paying your bill.. This is the strongest data protection offered by AWS.

Lifecycle Rules:

You can set up rules to move objects between tiers after defined periods, eg:

  • Automatically move objects from Standard to Standard IA after 30 days
  • Move objects from Standard IA after 60 days
  • Permanently delete objects after 90 days
  • Delete unfinished multi-part uploads from failed uploads

Performance

Requests per second

  • 3500 write requests per second
  • 5500 read requests per second By spreading requests across 4 prefixes, you can achieve 22k read requests per second! (NOTE: A prefix is a path within a bucket, eg: s3://mybucket/prefix/one/file.txt s3://mybucket/prefix/two/file.txt)

KMS limitation

If you’re using SSE-KMS you will hit KMS limits (check your service quotas - it could be 5500 requests per second)

Multi-part upload

This is essential for files over 5GB, but it can help for any file over 100MB by uploading in parallel!

S3 Transfer Acceleration

This uses edge locations to increase upload or download performance. It’s compatible with multi-part upload.

Byte-range fetches

Speed up download by requesting specific sections of a file. You can do this if you only want a file head, but you can also a whole file in multiple ranges and request all of them in parallel!

Batch Operations

  • Lets you define jobs that perform bulk operations on S3 objects, for example:
    • Modify metadata and propertires
    • Copy objects between buckets
    • Modify ACLs
    • Invoke Lambda function
    • Encrypt files
  • A jobs consists of a list of objects (you can use S3 Inventory), an action and optional parameters.
  • S3 Batch Operations manager tracks progress, retries failed operations, sends notifications, generates reports.

Searching S3 buckets

S3 Select / Glacier Select

This lets you run queries on individual files in S3 (eg: a large .csv) without pre-indexing

  • Takes advantage of server side filtering
  • Much cheaper than downloading the file to search

Athena

Lets you create a table schema to index S3 buckets and search your data using SQL

S3 Event Notifications

  • Notify on events like:
    • S3:ObjectCreated
    • S3:ObjectRemoved
    • S3:ObjectRestore
    • S3:Replication
  • Notification targets:
    • SNS
    • SQS
    • Lambda function