My tool-kit for tiny long-run batch jobs

In my first story in this series, I explained that I spend about a day a week solving problems that interest me with tiny software applications. I’ve been assembling a tool-kit to quickly build against architectural patterns that I find keep coming up in tiny apps. The technology I choose has to solve these problems:

  • I only want to pay for what I use (scale-to-zero)
  • I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)
  • I don’t have time for maintenance activities (no patching servers, automated scale-up)
  • I’m not a good UI designer or front end engineer (design systems are great)

In my fourth story in the series, I explained my approach to doing background processing that isn’t in response to an interaction from a user interface, where the processing will complete in a minute or less. AWS Lambda works well in that situation, but Lambda isn’t so great when a job is going to run for a long time. The costs start creeping up, and the global maximum execution time has a maximum allowed duration of fifteen minutes. Fortunately AWS also provide AWS Batch, which is ideal for longer running jobs, but is a bit more complicated than deploying a Lambda.

To illustrate this tool-kit I’ll use a different side-project example: web-scraping Goodreads and the book catalogues at my local libraries.

TL;DR

  • Docker with Node.js on AWS EC2
  • AWS S3
  • AWS EventBridge (formerly the “events” part of CloudWatch)
  • Github Actions

The business problem

Every time I finish a book and want to borrow a new one, the basic question I want to answer is “what books that I might want to read are available in digital format at my local library?”. This is tricky because:

  • Goodreads doesn’t have an API (its owner, Amazon, shut it down 👿)
  • The library catalogues don’t have public APIs

I like to read books that win literary awards or appear on Goodreads user-curated lists, so what I want to do is take all of the books on the lists of award winners and useful community lists (Goodreads has lists of award winners), and see which ones are available in digital format at my local library. I also want to check if any of the books I have shelved in my “to-read” shelf on Goodreads are available.

Without APIs, I need to employ web-scraping techniques to visit and analyse the data. Web-scraping is slow and book lists are long, so I need to do this in a long-running batch job.

The application I built to use this data (only available for Auckland and Wellington libraries in New Zealand, sorry) is “What Can I Borrow?”. I used the same tool-kit as I described in my second story in this series.

Screenshot from wcib.apps.cronin.nz

The basic process

As I noted in my last story, while I won’t go into the details of how I do scraping here, I use SuperAgent with Cheerio. Cheerio has excellent support for CSS selectors on full HTML docs and snippets, and I find that sticking to CSS selectors makes for much easier to maintain scraping code.

Three problems to solve

  • I only want to pay for what I use (scale-to-zero)
  • I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)
  • I don’t have time for maintenance activities (no patching servers, automated scale-up)

The execution environment

The container

The Dockerfile I ended up with is very simple:

FROM node:latest
WORKDIR /usr/src/app
COPY lib lib
COPY populatefirestore.js .
COPY package.json .
COPY firebaseconfig.json .
RUN npm install
ENTRYPOINT [ "node", "populatefirestore.js", "-e"]

I’m using the official Node Docker image, which will run my populatefirestore.js script and pass it the “-e” parameter. This parameter tells the script to read its config from environment variables, which I set in the job definition in Batch.

The repo

The job execution

I did strike a gotcha with configuring access to an S3 bucket for my Batch job. There are two IAM roles on a Batch job definition: “execution role” and “job role”. The execution role is only used by Batch to fetch and start the container, but the optional job role is used by the running container for everything else. Therefore, it’s the “job role” that needs to have access to the S3 bucket and be set on the job definition.

AWS EventBridge has built-in support for AWS Batch jobs, so it is straightforward to hook up a scheduled rule to a job.

Deployment

For the container build, it was Github Actions to the rescue again! A community action for logging into ECR was all I needed to run the Docker job automatically on any push.

name: Production build for populatefirestore in Docker container on ECRon:
push:
branches: [ master ]
jobs:
deploy:
name: Deploy
runs-on: ubuntu-latest
steps:
- name: Checkout Repo
uses: actions/checkout@master
- name: Write Firebase credentials from secret
env:
FIREBASE_CREDENTIALS: ${{secrets.FIREBASE_CREDENTIALS}}
run: 'echo "$FIREBASE_CREDENTIALS" > firebaseconfig.json'
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build, tag, and push image to Amazon ECR
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
ECR_REPOSITORY: <myrepoid>
IMAGE_TAG: latest
run: |
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

Summary

I only want to pay for what I use (scale-to-zero)

As with short-run jobs, the cost of the services here is totally under my control. I run four 10 minute jobs once a week, and in the full months of June, it cost 9c.

I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)

Using Node JS for my function code in AWS Batch meant no new languages to learn. It was worth learning a bit of Docker to take advantage of the Batch infrastructure.

I don’t have time for maintenance activities (no patching servers, automated scale-up)

The stack is serverless and will scale up as required. Github Actions automates deployment.

Bonza!

Technology leader for Xero in Auckland, New Zealand, former start-up founder, father of two, maker of t-shirts and small software products