My tool-kit for tiny short-run batch jobs

In my first story in this series, I explained that I spend about a day a week solving problems that interest me with tiny software applications. I’ve been assembling a tool-kit to quickly build against architectural patterns that I find keep coming up in tiny apps. The technology I choose has to solve these problems:

  • I only want to pay for what I use (scale-to-zero)
  • I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)
  • I don’t have time for maintenance activities (no patching servers, automated scale-up)
  • I’m not a good UI designer or front end engineer (design systems are great)

In my third story I outlined my tool-kit for building tiny APIs. The APIs I build with that took-kit are synchronous and are intended to be consumed by user interfaces and integrations that fit a synchronous pattern.

Another pattern I’ve found useful is what I call a short-run batch job. I often run into a requirement where I want to do some background processing that isn’t in response to an interaction from a user interface. The trigger might be an event in another system or more often on a schedule. I’ll deal with long-running batch jobs in a later story, but the jobs I’m talking about here generally complete in less than a minute. A related requirement is a job queue, so I’ll explain how I do that at the same time.

I’ll return the project of mine I used to illustrate my API tool-kit for this pattern too.

TL;DR

  • AWS Lambda and Node.js
  • AWS DynamoDB
  • AWS EventBridge (formerly the “events” part of CloudWatch)
  • AWS Simple Email Service with Route 53
  • Github Actions

The business problem

My vision is for some kind of functionality in the Uber Eats app, or Zomato, and similar where a consumer can rate the choice of packaging used in a food delivery, and if it’s a poor choice, to send details of better alternatives to the vendor. When I started exploring this, I discovered that figuring out whether something is recyclable here in New Zealand is not straightforward. There are 67 separate recycling schemes: one for each local authority in the country. In some places polypropylene food containers (“number 5 plastics”) are picked up at the kerb for recycling, in others they go to landfill. There is no central database of the difference in schemes, so to even get started on this one, I’d need to collect all that information, put it in a structured form and expose it with… an API.

I created the API I describe above to answer the question of how different types of material will be handled in different local authority territories, but that information is not static. Each local authority has its own website and they each publish the information in a different way, and every so often they change their suppliers and regulations and update the information. Sometimes the data is in a PDF linked from deep in the information architecture, sometimes it’s in HTML content. Sometimes it’s in paragraph form, sometimes it’s in a table, sometimes there is a “helpful” tool where a user has to search for the material to find the answer. My solution to keep my database up to date is to have a website crawler that runs a few times a day, one site at at time, and lets me know by email if there are any changes. I then eyeball the site and update my structured data with the changes.

The basic process

  • looks in a DynamoDB table to find the least recently visited site
  • checks the site for changes and emails me if there are any (or if the scrape fails)
  • updates the last visit time in the DynamoDB table

This makes for a very simple job queue. AWS EventBridge wakes up every four hours and runs the function, which means six visits every 24 hours, so all 67 sites are visited at least once every couple of weeks — more than enough to keep up with the typical rate of change on the websites.

Three problems to solve

  • I only want to pay for what I use (scale-to-zero)
  • I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)
  • I don’t have time for maintenance activities (no patching servers, automated scale-up)

The handler

I’m not covering the details of the actual function here, but for the curious, I use SuperAgent with Cheerio for web-scraping. Cheerio has excellent support for CSS selectors on full HTML docs and snippets, and I find that sticking to CSS selectors makes for much easier to maintain scraping code.

The database

Serverless Framework

Serverless uses YAML files for its config (as does CDK) and after much fiddling around my YAML looks like this:

service: watch-recycling-sitesframeworkVersion: ">=1.1.0 <2.0.0"plugins:
- serverless-dynamodb-local
- serverless-offline
custom:
dynamodb:
stages:
- dev
start:
port: 8000
inMemory: true
seed: true
migrate: true
migration:
dir: offlinemigrations
seed:
domain:
sources:
- table: ${self:service}-${opt:stage, self:provider.stage}-v1
sources: [./offlineseeding/sites.json]
provider:
name: aws
runtime: nodejs10.x
stage: dev
region: ap-southeast-2
environment:
DYNAMODB_TABLE: ${self:service}-${opt:stage, self:provider.stage}-v1
iamRoleStatements:
- Effect: Allow
Action:
- dynamodb:Query
- dynamodb:Scan
- dynamodb:GetItem
- dynamodb:PutItem
- dynamodb:UpdateItem
- dynamodb:DeleteItem
- ses:SendEmail
Resource:
- "arn:aws:dynamodb:${opt:region, self:provider.region}:*:table/${self:provider.environment.DYNAMODB_TABLE}"
- "arn:aws:ses:*:*:identity/*"
functions:
diffsite:
handler: handlers/diffsite.handler
timeout: 300
events:
- schedule: rate(4 hours)
resources:
Resources:
TodosDynamoDbTable:
Type: 'AWS::DynamoDB::Table'
DeletionPolicy: Retain
Properties:
AttributeDefinitions:
-
AttributeName: id
AttributeType: S
KeySchema:
-
AttributeName: id
KeyType: HASH
BillingMode: PAY_PER_REQUEST
TableName: ${self:provider.environment.DYNAMODB_TABLE}

Note above that I was also able to include the CloudWatch/EventBridge schedule in my Serverless config.

Mail out

AWS SES really is brutally simple. If your domain is already in a zone on Route 53 then it’s also very easy to do domain verification from the AWS console. It’s an older AWS service and that shows in the clunkiness of the API, but examples abound of using it and it works fine in a Lambda.

Deployment

As usual Github Actions did not disappoint!

I found a community action for running a Serverless Framework deploy, with the usual requirement of a couple of secrets in the Github repo for AWS access. Five minutes later, I had a full CI/CD, deploy on a push to my master branch.

name: Deploy serverlesson:
push:
branches:
- master
jobs:
deploy:
name: deploy
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [14.x]
steps:
- uses: actions/checkout@v2
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v1
with:
node-version: ${{ matrix.node-version }}
- run: npm ci
- name: serverless deploy
uses: serverless/github-action@master
with:
args: deploy
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Summary

I only want to pay for what I use (scale-to-zero)

The cost of the services is totally under my control. The Lambda execution time and SES rate for this use case falls under the billing threshold, so it actually costs nothing. The request count in DynamoDB does exceed the threshold, but a month’s charge is 6c — which may as well be zero!

I don’t have a lot of time available for learning or building (no steep learning curves without substantial time-savings)

Using Node JS for my function code in Lambda meant no new languages to learn.

I don’t have time for maintenance activities (no patching servers, automated scale-up)

The stack is serverless and will scale up as required. Github Actions automates deployment.

Bonza!

Next up… long-running batch jobs

Technology leader for Xero in Auckland, New Zealand, former start-up founder, father of two, maker of t-shirts and small software products