Using S3 as a static data API with CloudFront
About three years ago, I built a tiny app that I use a lot. It tells me which books on my Goodreads to-read shelf and on a selection of book awards lists are available for digital loan at my local library. It does this with web scraping and API reverse-engineering because Goodreads shut their API down, and the library catalogue API is not public 🙁.
When I built the original app, I had a bunch of functionality that needed a back end I could write to, so I used Firebase. Since then though, I’ve retired the features that needed to write data, and all I do now is serve up static data that has been prepared and stored in advance by AWS Batch jobs. I wanted to rewrite the whole stack to take advantage of the patterns I’ve created in the years since I wrote the original, and I’ve always wanted to try serving up JSON from S3 as if it is an API. S3 with CloudFront is very snappy and extremely cheap!
I had a couple of requirements for the architecture in mind:
- I wanted to use my existing secure HTTP certificates and Route 53 config with CloudFront to serve the data from HTTP endpoints
- I wanted to avoid having to do slow CloudFront cache invalidations when the content of the files to serve changes
One of the patterns to avoid cache invalidations that AWS recommends is to rename the S3 objects every time a new version is written. That seemed fine, but I needed a way to serve up an index of these so that my front end web client knew which files to ask for!
My data structure is pretty simple:
- Each list or shelf from Goodreads is described in a file that gives the basic book metadata (author, title, cover image URL, list URL)
- The scraping results for each list or shelf is written to a JSON file per library that contains metadata on which editions are available at that library (format: eBook or eAudiobook, library URL)
I reimplemented the original scraper to generate a unique suffix for each output file and write this to an S3 bucket.
To create the index, I used Lambda@Edge. I wrote about using this a while back when I found it was a great way to generate social media sharing previews when server-side rendering isn’t available. The approach I decided on was to have a Lambda@Edge function enumerate the files in the S3 buckets, and produce a JSON file that described both the lists or shelf that they came from, along with their unique key names.
I discovered that CloudFront applies caching behaviours to Lambda@Edge responses. I separated the behaviour for my index from the default S3 serve so that I could use a no-cache policy for the index and ensure it was always returning the most recent set of uniquely named objects.
I used CloudFormation to create the Lambda and the CloudFront distribution. Chat GPT was very helpful in writing CloudFormation YAML files for me, so I didn’t have to spend hours of trial and error finding the right syntax from the reference docs. I had hoped to use inline code in the CloudFormation template for my Lambda, but that turned out to be a problem thanks to one of the gotchas that I discovered…
CloudFront, CloudFormation, and Lambda@Edge tips and gotchas
Lambda@Edge functions have to be deployed to us-east-1
The functions are replicated out over the content delivery network from Virginia. That’s OK, but my S3 bucket was in Sydney and…
A regular CloudFront template can only apply to one region
I wanted to deploy my S3 bucket in ap-southeast-2, but I couldn’t create a template for both the Lambda for Virginia and the rest of the infrastructure in Sydney. There are some patterns that can help, but they are complex. I went with one template for the Lambda and one for the rest with an annoying manual copy and paste of the Lambda ARN from one to the other.
The latest Node runtimes use AWS v3 SDK and require ES Modules
I used Node 20, which makes v3 of the AWS SDK available (Chat GPT assumes v2 will be available has to be encouraged to modernise). The new SDK requires ES Modules.
Nothing wrong with that, BUT… I couldn’t find a way to create inline code for a Lambda in a CloudFormation template with an entry point that has a .mjs
extension. Lambda will try to use CommonJS if it sees a .js
file which fails in Node 20 and the v3 SDK. The solution is to upload the code for the Lambda from the AWS CLI or put it in S3, but it’s a pity, because it’s kind of cool to embed the code in the template when it’s such a trivial function.
CORS can be added with a CloudFront cache policy
CloudFront has response header policies available that will add CORS headers on the fly.
The latest version of CloudFront cache and response policies have to be referenced by UUID
When Chat GPT inserted UUIDs I thought it was dreaming (next-level hallucinating) but it seems like it’s the only way to make it work. E.g. the policy to turn off caching looks like this:
CachePolicyId: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad # Managed-CachingDisabled
But “how do you find the UUID?” you ask. See those “View policy” links from the CloudFront behaviour console below? That’s the easiest way I could find.
Chat GPT generated some obsolete policy UUIDs that failed, so I recommend checking the UUIDs manually using the console.
The CloudFront templates
Once I’d got past the gotchas and same quirky Chat GPT hallucinations (e.g. Lambda@Edge requires a specific return format for the headers, and Chat GPT was convinced that I should use the regular Lambda format) I had these working:
AWSTemplateFormatVersion: '2010-09-09'
Resources:
LambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: LambdaEdgeRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
- edgelambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: LambdaEdgePolicy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: arn:aws:logs:*:*:*
- Effect: Allow
Action:
- s3:ListBucket
- s3:GetObject
Resource:
- arn:aws:s3:::<MY S3 BUCKET NAME>
- arn:aws:s3:::<MY S3 BUCKET NAME>/*
LambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: wcibIndex
Handler: index.handler
Role: !GetAtt LambdaExecutionRole.Arn
Code:
ZipFile: " "
Runtime: nodejs20.x
Timeout: 30
Outputs:
LambdaFunctionArn:
Description: "ARN of the Lambda function"
Value: !GetAtt LambdaFunction.Arn
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
LambdaFunctionArn:
Type: String
Description: The ARN of the Lambda@Edge function deployed in us-east-1
Resources:
CloudFrontOriginAccessIdentity:
Type: AWS::CloudFront::CloudFrontOriginAccessIdentity
Properties:
CloudFrontOriginAccessIdentityConfig:
Comment: Access identity
CloudFrontDistribution:
Type: AWS::CloudFront::Distribution
Properties:
DistributionConfig:
Origins:
- Id: wcibS3Origin
DomainName: !Sub ${S3Bucket}.s3.ap-southeast-2.amazonaws.com
S3OriginConfig:
OriginAccessIdentity: !Sub origin-access-identity/cloudfront/${CloudFrontOriginAccessIdentity}
Enabled: true
DefaultRootObject: index.html
DefaultCacheBehavior:
TargetOriginId: wcibS3Origin
ViewerProtocolPolicy: redirect-to-https
AllowedMethods:
- GET
- HEAD
- OPTIONS
CachedMethods:
- GET
- HEAD
- OPTIONS
CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6 # Managed-CachingOptimized
OriginRequestPolicyId: 88a5eaf4-2fd4-4709-b370-b4c650ea3fcf # Managed-CORS-S3Origin
ResponseHeadersPolicyId: 5cc3b908-e619-4b99-88e5-2cf7f45965bd # Managed-CORS-With-Preflight
CacheBehaviors:
- PathPattern: /index.json
TargetOriginId: wcibS3Origin
ViewerProtocolPolicy: redirect-to-https
AllowedMethods:
- GET
- HEAD
- OPTIONS
CachedMethods:
- GET
- HEAD
- OPTIONS
CachePolicyId: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad # Managed-CachingDisabled
OriginRequestPolicyId: 88a5eaf4-2fd4-4709-b370-b4c650ea3fcf # Managed-CORS-S3Origin
ResponseHeadersPolicyId: 5cc3b908-e619-4b99-88e5-2cf7f45965bd # Managed-CORS-With-Preflight
LambdaFunctionAssociations:
- EventType: origin-request
LambdaFunctionARN: !Ref LambdaFunctionArn
ViewerCertificate:
AcmCertificateArn: <MY AWS Certification Manager certificate>
SslSupportMethod: sni-only
Aliases:
- <MY DOMAIN NAME>
HttpVersion: http2
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: <MY BUCKET NAME>
CorsConfiguration:
CorsRules:
- AllowedOrigins:
- '*'
AllowedMethods:
- GET
- HEAD
AllowedHeaders:
- '*'
MaxAge: 3000
S3BucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket: !Ref S3Bucket
PolicyDocument:
Statement:
- Effect: Allow
Principal:
CanonicalUser: !GetAtt CloudFrontOriginAccessIdentity.S3CanonicalUserId
Action: s3:GetObject
Resource: arn:aws:s3:::<MY BUCKET NAME>/*
Route53RecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: <MY HOSTED ZONE>
Name: <MY DOMAIN NAME>
Type: A
AliasTarget:
DNSName: !GetAtt CloudFrontDistribution.DomainName
HostedZoneId: <MY HOSTED ZONE> # CloudFront Hosted Zone ID
The Lambda template will spit out the ARN for the function and this is supplied to the second template, e.g.:
ARN="arn:aws:lambda:us-east-1:<MY ACCOUNT>:function:wcibIndex:<PUBLISHED VERSION>"
aws cloudformation update-stack \
--stack-name wcib-content \
--template-body file://cloudfront.yaml \
--parameters "ParameterKey=LambdaFunctionArn,ParameterValue=$ARN" \
--capabilities CAPABILITY_NAMED_IAM \
--region ap-southeast-2
Conclusion
Achievement unlocked.
I now have a CloudFront distribution that provides endpoints that work just like the APIs I usually build in GCP Cloud Functions, except they are blazingly fast from a cold start!
Image credit: Paper icons created by Freepik — Flaticon