Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

arXiv:2604.21345v2 Announce Type: replace Abstract: Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected p-values 0.053-0.448), although gpt-4.1-mini has the highest mean accuracy (0.583); the significant separation is on retention, where gpt-5.1 leads on completeness (0.886) and coverage (0.942). Typed slices isolate whitehouse_press_briefings as an accuracy-hard regime, and a later focused rerun over gpt-4.1, gpt-5-mini, and gpt-5.4 reuses the same stack under the same judges and metrics. This extended preprint keeps those core results aligned with the formal submission while adding a more detailed repository-level account of cross-domain reuse from the companion AI-search paper and an additional typed DeepEval contrastive analysis. Model naming note. Running text uses canonical model names on first introduction. Tables, filenames, and artifact IDs retain compact report labels for consistency with the packaged benchmark outputs. Table A maps the two conventions and is repeated in Section 4.3 where candidate-generation settings are defined.

Leave a Comment