| Hello, I spent the last few months building an AI agent that autonomously writes Go code using local LLMs. The primary use case is log parser generation for SIEM pipelines. A large part of the work ended up being evaluation itself: how do you objectively measure whether a model is actually useful for autonomous coding tasks? So I built a harness that (1) lets agents generate real Go parsers, (2) compiles the Go code, (3) validates extracted fields and types, (4) measures parsing quality against expected schemas, (5) and tracks throughput/speed over longer runs. Given the current release cadence of open-weight models, the results are interesting. I published the first public version of the benchmark and methodology here: Feedback is very welcome. [link] [comments] |