Using S3 as a static data API with CloudFront

6 min readJun 17, 2024

About three years ago, I built a tiny app that I use a lot. It tells me which books on my Goodreads to-read shelf and on a selection of book awards lists are available for digital loan at my local library. It does this with web scraping and API reverse-engineering because Goodreads shut their API down, and the library catalogue API is not public 🙁.

When I built the original app, I had a bunch of functionality that needed a back end I could write to, so I used Firebase. Since then though, I’ve retired the features that needed to write data, and all I do now is serve up static data that has been prepared and stored in advance by AWS Batch jobs. I wanted to rewrite the whole stack to take advantage of the patterns I’ve created in the years since I wrote the original, and I’ve always wanted to try serving up JSON from S3 as if it is an API. S3 with CloudFront is very snappy and extremely cheap!

I had a couple of requirements for the architecture in mind:

I wanted to use my existing secure HTTP certificates and Route 53 config with CloudFront to serve the data from HTTP endpoints
I wanted to avoid having to do slow CloudFront cache invalidations when the content of the files to serve changes

One of the patterns to avoid cache invalidations that AWS recommends is to rename the S3 objects every time a new version is written. That seemed fine, but I needed a way to serve up an index of these so that my front end web client knew which files to ask for!

My data structure is pretty simple:

Each list or shelf from Goodreads is described in a file that gives the basic book metadata (author, title, cover image URL, list URL)
The scraping results for each list or shelf is written to a JSON file per library that contains metadata on which editions are available at that library (format: eBook or eAudiobook, library URL)

I reimplemented the original scraper to generate a unique suffix for each output file and write this to an S3 bucket.

To create the index, I used Lambda@Edge. I wrote about using this a while back when I found it was a great way to generate social media sharing previews when server-side rendering isn’t available. The approach I decided on was to have a Lambda@Edge function enumerate the files in the S3 buckets, and produce a JSON file that described both the lists or shelf that they came from, along with their unique key names.

I discovered that CloudFront applies caching behaviours to Lambda@Edge responses. I separated the behaviour for my index from the default S3 serve so that I could use a no-cache policy for the index and ensure it was always returning the most recent set of uniquely named objects.

I used CloudFormation to create the Lambda and the CloudFront distribution. Chat GPT was very helpful in writing CloudFormation YAML files for me, so I didn’t have to spend hours of trial and error finding the right syntax from the reference docs. I had hoped to use inline code in the CloudFormation template for my Lambda, but that turned out to be a problem thanks to one of the gotchas that I discovered…

CloudFront, CloudFormation, and Lambda@Edge tips and gotchas

Lambda@Edge functions have to be deployed to us-east-1

The functions are replicated out over the content delivery network from Virginia. That’s OK, but my S3 bucket was in Sydney and…

A regular CloudFront template can only apply to one region

I wanted to deploy my S3 bucket in ap-southeast-2, but I couldn’t create a template for both the Lambda for Virginia and the rest of the infrastructure in Sydney. There are some patterns that can help, but they are complex. I went with one template for the Lambda and one for the rest with an annoying manual copy and paste of the Lambda ARN from one to the other.

The latest Node runtimes use AWS v3 SDK and require ES Modules

I used Node 20, which makes v3 of the AWS SDK available (Chat GPT assumes v2 will be available has to be encouraged to modernise). The new SDK requires ES Modules.

Nothing wrong with that, BUT… I couldn’t find a way to create inline code for a Lambda in a CloudFormation template with an entry point that has a .mjs extension. Lambda will try to use CommonJS if it sees a .js file which fails in Node 20 and the v3 SDK. The solution is to upload the code for the Lambda from the AWS CLI or put it in S3, but it’s a pity, because it’s kind of cool to embed the code in the template when it’s such a trivial function.

CORS can be added with a CloudFront cache policy

CloudFront has response header policies available that will add CORS headers on the fly.

The latest version of CloudFront cache and response policies have to be referenced by UUID

When Chat GPT inserted UUIDs I thought it was dreaming (next-level hallucinating) but it seems like it’s the only way to make it work. E.g. the policy to turn off caching looks like this:

CachePolicyId: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad # Managed-CachingDisabled

But “how do you find the UUID?” you ask. See those “View policy” links from the CloudFront behaviour console below? That’s the easiest way I could find.

Chat GPT generated some obsolete policy UUIDs that failed, so I recommend checking the UUIDs manually using the console.

The CloudFront templates

Once I’d got past the gotchas and same quirky Chat GPT hallucinations (e.g. Lambda@Edge requires a specific return format for the headers, and Chat GPT was convinced that I should use the regular Lambda format) I had these working:

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: LambdaEdgeRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
                - edgelambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: LambdaEdgePolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: arn:aws:logs:*:*:*
              - Effect: Allow
                Action:
                  - s3:ListBucket
                  - s3:GetObject
                Resource:
                  - arn:aws:s3:::<MY S3 BUCKET NAME>
                  - arn:aws:s3:::<MY S3 BUCKET NAME>/*

  LambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: wcibIndex
      Handler: index.handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: " "
      Runtime: nodejs20.x
      Timeout: 30

Outputs:
  LambdaFunctionArn:
    Description: "ARN of the Lambda function"
    Value: !GetAtt LambdaFunction.Arn

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  LambdaFunctionArn:
    Type: String
    Description: The ARN of the Lambda@Edge function deployed in us-east-1

Resources:
  CloudFrontOriginAccessIdentity:
    Type: AWS::CloudFront::CloudFrontOriginAccessIdentity
    Properties:
      CloudFrontOriginAccessIdentityConfig:
        Comment: Access identity

  CloudFrontDistribution:
    Type: AWS::CloudFront::Distribution
    Properties:
      DistributionConfig:
        Origins:
          - Id: wcibS3Origin
            DomainName: !Sub ${S3Bucket}.s3.ap-southeast-2.amazonaws.com
            S3OriginConfig:
              OriginAccessIdentity: !Sub origin-access-identity/cloudfront/${CloudFrontOriginAccessIdentity}
        Enabled: true
        DefaultRootObject: index.html
        DefaultCacheBehavior:
          TargetOriginId: wcibS3Origin
          ViewerProtocolPolicy: redirect-to-https
          AllowedMethods:
            - GET
            - HEAD
            - OPTIONS
          CachedMethods:
            - GET
            - HEAD
            - OPTIONS
          CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6 # Managed-CachingOptimized
          OriginRequestPolicyId: 88a5eaf4-2fd4-4709-b370-b4c650ea3fcf # Managed-CORS-S3Origin
          ResponseHeadersPolicyId: 5cc3b908-e619-4b99-88e5-2cf7f45965bd # Managed-CORS-With-Preflight
        CacheBehaviors:
          - PathPattern: /index.json
            TargetOriginId: wcibS3Origin
            ViewerProtocolPolicy: redirect-to-https
            AllowedMethods:
              - GET
              - HEAD
              - OPTIONS
            CachedMethods:
              - GET
              - HEAD
              - OPTIONS
            CachePolicyId: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad # Managed-CachingDisabled
            OriginRequestPolicyId: 88a5eaf4-2fd4-4709-b370-b4c650ea3fcf # Managed-CORS-S3Origin
            ResponseHeadersPolicyId: 5cc3b908-e619-4b99-88e5-2cf7f45965bd # Managed-CORS-With-Preflight
            LambdaFunctionAssociations:
              - EventType: origin-request
                LambdaFunctionARN: !Ref LambdaFunctionArn
        ViewerCertificate:
          AcmCertificateArn: <MY AWS Certification Manager certificate>
          SslSupportMethod: sni-only
        Aliases:
          - <MY DOMAIN NAME>
        HttpVersion: http2

  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: <MY BUCKET NAME>
      CorsConfiguration:
        CorsRules:
          - AllowedOrigins:
              - '*'
            AllowedMethods:
              - GET
              - HEAD
            AllowedHeaders:
              - '*'
            MaxAge: 3000

  S3BucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref S3Bucket
      PolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              CanonicalUser: !GetAtt CloudFrontOriginAccessIdentity.S3CanonicalUserId
            Action: s3:GetObject
            Resource: arn:aws:s3:::<MY BUCKET NAME>/*

  Route53RecordSet:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: <MY HOSTED ZONE>
      Name: <MY DOMAIN NAME>
      Type: A
      AliasTarget:
        DNSName: !GetAtt CloudFrontDistribution.DomainName
        HostedZoneId: <MY HOSTED ZONE> # CloudFront Hosted Zone ID

The Lambda template will spit out the ARN for the function and this is supplied to the second template, e.g.:

ARN="arn:aws:lambda:us-east-1:<MY ACCOUNT>:function:wcibIndex:<PUBLISHED VERSION>"

aws cloudformation update-stack \
  --stack-name wcib-content \
  --template-body file://cloudfront.yaml \
  --parameters "ParameterKey=LambdaFunctionArn,ParameterValue=$ARN" \
  --capabilities CAPABILITY_NAMED_IAM \
  --region ap-southeast-2

Conclusion

Achievement unlocked.

I now have a CloudFront distribution that provides endpoints that work just like the APIs I usually build in GCP Cloud Functions, except they are blazingly fast from a cold start!

Image credit: Paper icons created by Freepik — Flaticon

Using S3 as a static data API with CloudFront

CloudFront, CloudFormation, and Lambda@Edge tips and gotchas

The CloudFront templates

Conclusion

Written by Gareth Cronin

Responses (2)