WFVZ Oological Transcription Pipeline

Western Foundation of Vertebrate Zoology · Field Notes Transcription Project

About this project

This work builds on a Cornell capstone project that explored automated transcription of oological card records from the Cornell University Museum of Vertebrates (CUMV). The current project extends that work to the Western Foundation of Vertebrate Zoology (WFVZ), with a focus on designing and refining a pipeline to transcribe the handwritten and typewritten field notes contained in egg record cards — making this information freely accessible to researchers, particularly for comparative studies on nest construction across species, geography, and time.

The WFVZ has already manually transcribed relevant specimen metadata and scanned card images for their collections. This project focuses specifically on the field notes, which contain rich ecological observations — nest materials, placement, height, substrate, incubation stage — that are not captured in standard museum database fields.

By early April 2026, the CUMV team has processed nearly 3,000 records across three species: American Robin, Gray Catbird, and Northern Cardinal.

Pipeline overview

flowchart LR classDef main fill:#fff,stroke:#d8cfc4,stroke-width:1.5px,color:#2c2421 classDef input fill:#e8f4f6,stroke:#2a7a8a,stroke-width:2px,color:#1a4a54 classDef api fill:#fceaea,stroke:#8b1a1a,stroke-width:2px,color:#5a0a0a classDef review fill:#fceaea,stroke:#8b1a1a,stroke-width:3px,color:#5a0a0a,font-weight:bold classDef semantic fill:#f0eef8,stroke:#7060a8,stroke-width:2px,color:#3a2a68 classDef output fill:#e8f5e8,stroke:#3a8a3a,stroke-width:2px,color:#1a4a1a classDef aside fill:#fdf4e3,stroke:#c07820,stroke-width:1.5px,color:#7a4a00,font-style:italic n1["📄 WFVZ inputs PDFs + Excel per species"]:::input n2["🔍 QC & extraction check PDFs, name images"]:::main n3["🖼 Image processing rotate, crop, brighten, resize"]:::main n4["🔗 Metadata matching link images to records"]:::main n5["🤖 GPT-4o transcription via API"]:::api n6["✏️ Review tool this application"]:::review n7["🔬 Semantic extraction Claude API"]:::semantic n8["🔭 Researcher UI browse, query & converse"]:::output n9["🌐 Public data GBIF · WFVZ database"]:::output x1["⚠ Set aside duplicates & low quality"]:::aside x2["⚠ Set aside empty / uninformative images"]:::aside x3["⚠ Set aside no matching metadata"]:::aside n1 --> n2 --> n3 --> n4 --> n5 --> n6 --> n7 n7 --> n8 n7 --> n9 n2 -.->|rejected| x1 n3 -.->|rejected| x2 n4 -.->|rejected| x3

Triage categories

Each card set is automatically assigned a triage category based on two signals: the hesitation score (see below) and the content of the transcribed field notes.

Category	Condition	Meaning
Red	Hesitation score > 10	High model uncertainty — review strongly recommended. ~83% chance of transcription error.
Amber	Score 3–10	Moderate uncertainty — spot-check advised, especially for longer notes.
Green	Score 0–2	Low uncertainty — model was confident throughout. <10% error risk.
Typed	Card identified as typewritten	Typewritten cards are transcribed with near-perfect accuracy regardless of hesitation score.
No Transcript	Field notes blank or header-only	The transcribed field notes field is empty or contains only a label prefix with no real content.
Unknown	No logprob file found	No probability data was available for this card set — score could not be computed.

Hesitation scores & log probabilities

When GPT-4o transcribes a card, the OpenAI API can return the model's log probabilities for each output token — a measure of how confident the model was in each word or character it chose. A logprob close to 0 means the model was very sure; a more negative value means it was considering alternatives.

For each transcription we isolate the tokens that fall within the <transcribed_field_notes> span and compute a hesitation score: the count of content tokens where the difference between the top-ranked and second-ranked token probability is less than 1.0 — meaning the model was genuinely uncertain between two plausible readings.

Example: a hesitation score of 18 out of 45 tokens means that for 18 of the 45 content tokens in the field notes, the model was nearly equally likely to have chosen a different character or word. These are the exact locations where human review adds the most value.

The thresholds (green ≤ 2, amber ≤ 10, red > 10) were calibrated empirically against manually verified transcriptions from the CUMV capstone dataset and refined during the WFVZ processing runs.

Reference

Pasch, G. (2024). Automating Transcription of Oological Collection Records Using Large Language Models. Cornell University Master’s Capstone. hdl.handle.net/1813/118249