Using AI to organize sketchbook scans
I have a pile of scanned sketchbook pages from my xotu project. Each scan has a handwritten date and title in the corner. Organizing 81+ scans by hand is tedious, so I wrote a Python script that uses a vision model to read the handwriting and sort everything into named directories automatically.
The problem
The scans come out of the scanner as xotu-20241226_01.png,
xotu-20241226_02.png, etc. Useless names. Each page has a date like
20250311 and a short title written in the top left corner. I want
directories named 20250311-A new beginning/ with the original scan
and a cropped version of the right half (the actual drawing).
First pass: Gemini API
The first version used Google's Gemini API. The prompt is simple: look
at the image, extract the 8-digit date and the title text, return them
as {date}-{text}. Gemini 2.5 Flash handled this well, reading even
my messy handwriting accurately.
1python xotu-renamer.py ./scans ./output --dry-run
It copies the original as orig.png and crops the right half as
main.png. The --dry-run flag lets you preview without writing
anything. Non-destructive by design — originals are never touched.
Problems with the cloud approach
Two issues showed up quickly:
-
Rate limits. The free tier of Gemini hammers you with 429s when you blast 81 requests. I added
--delay 3to throttle, but that turns a quick batch job into a 4 minute wait. -
Model expiry. I had pinned
gemini-2.5-flash-preview-09-2025in the URL. Preview models get retired. One day the script just 404'd. Lesson: use stable model names likegemini-2.5-flash, or better, make it a CLI flag.
I also added early exit on 401/403 so an expired API key doesn't burn through 5 retries per image before telling you what's wrong.
Going local with Ollama
The real fix for rate limits is to not use a cloud API at all. I have a machine (genos) with a 4090 sitting around. Ollama makes it trivial to serve models locally.
1ollama pull qwen2.5vl:7b
I added an --ollama flag to the script. When set, it talks to the
Ollama API instead of Gemini:
1python xotu-renamer.py ./scans ./output --ollama
It defaults to http://genos:11434 and qwen2.5vl:7b. No API key
needed, no rate limits, runs as fast as the GPU can go.
One gotcha: Ollama binds to 127.0.0.1 by default. If you're calling
it from another machine you need to set OLLAMA_HOST=0.0.0.0 in the
systemd service and restart.
Model quality comparison
The 7B Qwen model worked but wasn't great at reading handwriting. Some titles came out garbled. Gemini was noticeably better here — not surprising given the size difference.
The sweet spot turned out to be gemma3:27b, Google's open model. It
fits on a 4090 with quantization and reads handwriting almost as well
as cloud Gemini:
1python xotu-renamer.py ./scans ./output --ollama --model gemma3:27b
Slower per image (~10-15s vs ~3-5s for the 7B) but the accuracy is worth it for a batch job you run once.
What the script does now
- Reads scans from an input directory
- Calls either Gemini or a local Ollama model to extract date+title
- Creates a project directory per scan with
orig.pngandmain.png - Supports
--dry-run,--delay,--model, and--ollama - Fails fast on auth errors, retries with backoff on transient ones
- Prints per-image and total timing for benchmarking
The whole thing is one Python file with no dependencies beyond
requests and Pillow. Good enough for a personal tool.
Takeaways
- Vision models are genuinely useful for this kind of boring extraction task. The prompt engineering is minimal.
- Pin stable model versions, not preview ones.
- Running models locally removes the most annoying constraint (rate limits) and is free. A 4090 handles 7B-27B vision models fine.
- Always add
--dry-runto batch tools that create files. Future you will thank present you.