Using AI to organize sketchbook scans

Apr 25, 2026 · 3 min read · python ai ollama gemini art ·

I have a pile of scanned sketchbook pages from my xotu project. Each scan has a handwritten date and title in the corner. Organizing 81+ scans by hand is tedious, so I wrote a Python script that uses a vision model to read the handwriting and sort everything into named directories automatically.

The problem

The scans come out of the scanner as xotu-20241226_01.png, xotu-20241226_02.png, etc. Useless names. Each page has a date like 20250311 and a short title written in the top left corner. I want directories named 20250311-A new beginning/ with the original scan and a cropped version of the right half (the actual drawing).

First pass: Gemini API

The first version used Google's Gemini API. The prompt is simple: look at the image, extract the 8-digit date and the title text, return them as {date}-{text}. Gemini 2.5 Flash handled this well, reading even my messy handwriting accurately.

1python xotu-renamer.py ./scans ./output --dry-run

It copies the original as orig.png and crops the right half as main.png. The --dry-run flag lets you preview without writing anything. Non-destructive by design — originals are never touched.

Problems with the cloud approach

Two issues showed up quickly:

Rate limits. The free tier of Gemini hammers you with 429s when you blast 81 requests. I added --delay 3 to throttle, but that turns a quick batch job into a 4 minute wait.
Model expiry. I had pinned gemini-2.5-flash-preview-09-2025 in the URL. Preview models get retired. One day the script just 404'd. Lesson: use stable model names like gemini-2.5-flash, or better, make it a CLI flag.

I also added early exit on 401/403 so an expired API key doesn't burn through 5 retries per image before telling you what's wrong.

Going local with Ollama

The real fix for rate limits is to not use a cloud API at all. I have a machine (genos) with a 4090 sitting around. Ollama makes it trivial to serve models locally.

1ollama pull qwen2.5vl:7b

I added an --ollama flag to the script. When set, it talks to the Ollama API instead of Gemini:

1python xotu-renamer.py ./scans ./output --ollama

It defaults to http://genos:11434 and qwen2.5vl:7b. No API key needed, no rate limits, runs as fast as the GPU can go.

One gotcha: Ollama binds to 127.0.0.1 by default. If you're calling it from another machine you need to set OLLAMA_HOST=0.0.0.0 in the systemd service and restart.

Model quality comparison

The 7B Qwen model worked but wasn't great at reading handwriting. Some titles came out garbled. Gemini was noticeably better here — not surprising given the size difference.

The sweet spot turned out to be gemma3:27b, Google's open model. It fits on a 4090 with quantization and reads handwriting almost as well as cloud Gemini:

1python xotu-renamer.py ./scans ./output --ollama --model gemma3:27b

Slower per image (~10-15s vs ~3-5s for the 7B) but the accuracy is worth it for a batch job you run once.

What the script does now

Reads scans from an input directory
Calls either Gemini or a local Ollama model to extract date+title
Creates a project directory per scan with orig.png and main.png
Supports --dry-run, --delay, --model, and --ollama
Fails fast on auth errors, retries with backoff on transient ones
Prints per-image and total timing for benchmarking

The whole thing is one Python file with no dependencies beyond requests and Pillow. Good enough for a personal tool.

The code is on GitLab.

Takeaways

Vision models are genuinely useful for this kind of boring extraction task. The prompt engineering is minimal.
Pin stable model versions, not preview ones.
Running models locally removes the most annoying constraint (rate limits) and is free. A 4090 handles 7B-27B vision models fine.
Always add --dry-run to batch tools that create files. Future you will thank present you.