Using AI to organize sketchbook scans

I have a pile of scanned sketchbook pages from my xotu project. Each scan has a handwritten date and title in the corner. Organizing 81+ scans by hand is tedious, so I wrote a Python script that uses a vision model to read the handwriting and sort everything into named directories automatically.

The problem

The scans come out of the scanner as xotu-20241226_01.png, xotu-20241226_02.png, etc. Useless names. Each page has a date like 20250311 and a short title written in the top left corner. I want directories named 20250311-A new beginning/ with the original scan and a cropped version of the right half (the actual drawing).

First pass: Gemini API

The first version used Google's Gemini API. The prompt is simple: look at the image, extract the 8-digit date and the title text, return them as {date}-{text}. Gemini 2.5 Flash handled this well, reading even my messy handwriting accurately.

1python xotu-renamer.py ./scans ./output --dry-run

It copies the original as orig.png and crops the right half as main.png. The --dry-run flag lets you preview without writing anything. Non-destructive by design — originals are never touched.

Problems with the cloud approach

Two issues showed up quickly:

  1. Rate limits. The free tier of Gemini hammers you with 429s when you blast 81 requests. I added --delay 3 to throttle, but that turns a quick batch job into a 4 minute wait.

  2. Model expiry. I had pinned gemini-2.5-flash-preview-09-2025 in the URL. Preview models get retired. One day the script just 404'd. Lesson: use stable model names like gemini-2.5-flash, or better, make it a CLI flag.

I also added early exit on 401/403 so an expired API key doesn't burn through 5 retries per image before telling you what's wrong.

Going local with Ollama

The real fix for rate limits is to not use a cloud API at all. I have a machine (genos) with a 4090 sitting around. Ollama makes it trivial to serve models locally.

1ollama pull qwen2.5vl:7b

I added an --ollama flag to the script. When set, it talks to the Ollama API instead of Gemini:

1python xotu-renamer.py ./scans ./output --ollama

It defaults to http://genos:11434 and qwen2.5vl:7b. No API key needed, no rate limits, runs as fast as the GPU can go.

One gotcha: Ollama binds to 127.0.0.1 by default. If you're calling it from another machine you need to set OLLAMA_HOST=0.0.0.0 in the systemd service and restart.

Model quality comparison

The 7B Qwen model worked but wasn't great at reading handwriting. Some titles came out garbled. Gemini was noticeably better here — not surprising given the size difference.

The sweet spot turned out to be gemma3:27b, Google's open model. It fits on a 4090 with quantization and reads handwriting almost as well as cloud Gemini:

1python xotu-renamer.py ./scans ./output --ollama --model gemma3:27b

Slower per image (~10-15s vs ~3-5s for the 7B) but the accuracy is worth it for a batch job you run once.

What the script does now

  • Reads scans from an input directory
  • Calls either Gemini or a local Ollama model to extract date+title
  • Creates a project directory per scan with orig.png and main.png
  • Supports --dry-run, --delay, --model, and --ollama
  • Fails fast on auth errors, retries with backoff on transient ones
  • Prints per-image and total timing for benchmarking

The whole thing is one Python file with no dependencies beyond requests and Pillow. Good enough for a personal tool.

Takeaways

  • Vision models are genuinely useful for this kind of boring extraction task. The prompt engineering is minimal.
  • Pin stable model versions, not preview ones.
  • Running models locally removes the most annoying constraint (rate limits) and is free. A 4090 handles 7B-27B vision models fine.
  • Always add --dry-run to batch tools that create files. Future you will thank present you.