My tool-kit for tiny short-run batch jobs

In my first story in this series, I explained that I spend about a day a week solving problems that interest me with tiny software applications. I’ve been assembling a tool-kit to quickly build against architectural patterns that I find keep coming up in tiny apps. The technology I choose has to solve these problems:

  • I only want to pay for what I use (scale-to-zero)
  • I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)
  • I don’t have time for maintenance activities (no patching servers, automated scale-up)
  • I’m not a good UI designer or front end engineer (design systems are great)

In my third story I outlined my tool-kit for building tiny APIs. The APIs I build with that took-kit are synchronous and are intended to be consumed by user interfaces and integrations that fit a synchronous pattern.

Another pattern I’ve found useful is what I call a short-run batch job. I often run into a requirement where I want to do some background processing that isn’t in response to an interaction from a user interface. The trigger might be an event in another system or more often on a schedule. I’ll deal with long-running batch jobs in a later story, but the jobs I’m talking about here generally complete in less than a minute. A related requirement is a job queue, so I’ll explain how I do that at the same time.

I’ll return the project of mine I used to illustrate my API tool-kit for this pattern too.

TL;DR

  • Serverless Framework
  • AWS Lambda and Node.js
  • AWS DynamoDB
  • AWS EventBridge (formerly the “events” part of CloudWatch)
  • AWS Simple Email Service with Route 53
  • Github Actions

The business problem

In my last story I explained the project at a high level:

My vision is for some kind of functionality in the Uber Eats app, or Zomato, and similar where a consumer can rate the choice of packaging used in a food delivery, and if it’s a poor choice, to send details of better alternatives to the vendor. When I started exploring this, I discovered that figuring out whether something is recyclable here in New Zealand is not straightforward. There are 67 separate recycling schemes: one for each local authority in the country. In some places polypropylene food containers (“number 5 plastics”) are picked up at the kerb for recycling, in others they go to landfill. There is no central database of the difference in schemes, so to even get started on this one, I’d need to collect all that information, put it in a structured form and expose it with… an API.

I created the API I describe above to answer the question of how different types of material will be handled in different local authority territories, but that information is not static. Each local authority has its own website and they each publish the information in a different way, and every so often they change their suppliers and regulations and update the information. Sometimes the data is in a PDF linked from deep in the information architecture, sometimes it’s in HTML content. Sometimes it’s in paragraph form, sometimes it’s in a table, sometimes there is a “helpful” tool where a user has to search for the material to find the answer. My solution to keep my database up to date is to have a website crawler that runs a few times a day, one site at at time, and lets me know by email if there are any changes. I then eyeball the site and update my structured data with the changes.

The basic process

I toyed with all sorts of ways of scheduling a watch of the websites, but the one I settled on was a Lambda function that:

  • looks in a DynamoDB table to find the least recently visited site
  • checks the site for changes and emails me if there are any (or if the scrape fails)
  • updates the last visit time in the DynamoDB table

This makes for a very simple job queue. AWS EventBridge wakes up every four hours and runs the function, which means six visits every 24 hours, so all 67 sites are visited at least once every couple of weeks — more than enough to keep up with the typical rate of change on the websites.

Three problems to solve

My tool-kit needs to solve the first three of the problems I’ve mentioned:

  • I only want to pay for what I use (scale-to-zero)
  • I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)
  • I don’t have time for maintenance activities (no patching servers, automated scale-up)

The handler

AWS Lambda matches my second tiny apps principle of “no steep learning curves without substantial time-savings”. I know Node.JS, and Lambda supports that along with a number of other popular languages. It means I’m not burning all my time learning the syntax of an unfamiliar language.

I’m not covering the details of the actual function here, but for the curious, I use SuperAgent with Cheerio for web-scraping. Cheerio has excellent support for CSS selectors on full HTML docs and snippets, and I find that sticking to CSS selectors makes for much easier to maintain scraping code.

The database

I find DynamoDB a bit frustrating. It’s optimised for performance and scale-up, which is great (the “no patching servers, automated scale-up” principle), but that’s about all it’s optimised for. The schema definition and API access pattern involve some pretty arcane stuff, so there’s a lot of find-example-copy-paste-cross-fingers going on whenever I use it. But when traded off with ease of configuration using IAM to give the Lambda access to DynamoDB and being able to manage the whole thing with Serverless Framework, it does seem like the best choice for a basic job queue.

Serverless Framework

I created this pattern before the release of AWS CDK, and given how useful I found that for my tiny APIs tool-kit, I might try that next time, but Serverless Framework is a pretty good option for standing up Lambda + DynamoDB solutions. It provides command line tools for deploying incremental changes to Lambda functions (in my case in Node.js) and DynamoDB schemas. It also provides tools for remotely invoking a function, which is handy for testing changes. Most importantly though, it provides offline emulation of the AWS services, so I could build the Lambda and the DynamoDB table locally and get it running before deploying it to AWS.

Serverless uses YAML files for its config (as does CDK) and after much fiddling around my YAML looks like this:

service: watch-recycling-sitesframeworkVersion: ">=1.1.0 <2.0.0"plugins:
- serverless-dynamodb-local
- serverless-offline
custom:
dynamodb:
stages:
- dev
start:
port: 8000
inMemory: true
seed: true
migrate: true
migration:
dir: offlinemigrations
seed:
domain:
sources:
- table: ${self:service}-${opt:stage, self:provider.stage}-v1
sources: [./offlineseeding/sites.json]
provider:
name: aws
runtime: nodejs10.x
stage: dev
region: ap-southeast-2
environment:
DYNAMODB_TABLE: ${self:service}-${opt:stage, self:provider.stage}-v1
iamRoleStatements:
- Effect: Allow
Action:
- dynamodb:Query
- dynamodb:Scan
- dynamodb:GetItem
- dynamodb:PutItem
- dynamodb:UpdateItem
- dynamodb:DeleteItem
- ses:SendEmail
Resource:
- "arn:aws:dynamodb:${opt:region, self:provider.region}:*:table/${self:provider.environment.DYNAMODB_TABLE}"
- "arn:aws:ses:*:*:identity/*"
functions:
diffsite:
handler: handlers/diffsite.handler
timeout: 300
events:
- schedule: rate(4 hours)
resources:
Resources:
TodosDynamoDbTable:
Type: 'AWS::DynamoDB::Table'
DeletionPolicy: Retain
Properties:
AttributeDefinitions:
-
AttributeName: id
AttributeType: S
KeySchema:
-
AttributeName: id
KeyType: HASH
BillingMode: PAY_PER_REQUEST
TableName: ${self:provider.environment.DYNAMODB_TABLE}

Note above that I was also able to include the CloudWatch/EventBridge schedule in my Serverless config.

Mail out

Email is still a useful way to manage a personal to-do list. I could send myself a Slack notification or something similar, but then I need to decide which Slack team to send it to, and jump through a lot more hoops than just pointing at an email address. If my Lambda handler sees that the count of elements in the important section of a website changes, or if I get a non 200 or 300 response from the server, I send myself an email with the details.

AWS SES really is brutally simple. If your domain is already in a zone on Route 53 then it’s also very easy to do domain verification from the AWS console. It’s an older AWS service and that shows in the clunkiness of the API, but examples abound of using it and it works fine in a Lambda.

Deployment

Remembering that I don’t have time for maintenance, I don’t want to rely on locally installed tools to deploy changes to my Lambda function or database schema.

As usual Github Actions did not disappoint!

I found a community action for running a Serverless Framework deploy, with the usual requirement of a couple of secrets in the Github repo for AWS access. Five minutes later, I had a full CI/CD, deploy on a push to my master branch.

name: Deploy serverlesson:
push:
branches:
- master
jobs:
deploy:
name: deploy
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [14.x]
steps:
- uses: actions/checkout@v2
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v1
with:
node-version: ${{ matrix.node-version }}
- run: npm ci
- name: serverless deploy
uses: serverless/github-action@master
with:
args: deploy
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Summary

Let’s revisit those principles.

I only want to pay for what I use (scale-to-zero)

The cost of the services is totally under my control. The Lambda execution time and SES rate for this use case falls under the billing threshold, so it actually costs nothing. The request count in DynamoDB does exceed the threshold, but a month’s charge is 6c — which may as well be zero!

I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)

Using Node JS for my function code in Lambda meant no new languages to learn.

I don’t have time for maintenance activities (no patching servers, automated scale-up)

The stack is serverless and will scale up as required. Github Actions automates deployment.

Bonza!

Next up… long-running batch jobs

Technology leader for Xero in Auckland, New Zealand, former start-up founder, father of two, maker of t-shirts and small software products