llm-evaluation

Arize Agent Skills, claude-skills, Evals, evaluation framework, Evaluations, llm-evaluation, LLMs, model-evaluation, prompts, system prompt assembly

Models got an order of magnitude better at following instructions in one year

A year ago, frontier models started losing track of instructions somewhere around 200–300 simultaneous constraints. With 2026 models, that ceiling is closer to 2,000 — an order-of-magnitude jump. We re-ran IFScale to see how, and how each model fails.

The post Models got an order of magnitude better at following instructions in one year appeared first on Arize AI.

evaluation framework, evaluation-driven development (EDD), Evaluations, llm-as-a-judge, llm-evaluation, prompt evaluation, regression testing for llms

Prompt templates as configs, not code

This post was written in April 2026. Cloud products, feature maturity, and recommended patterns change over time, so readers should treat these examples as directional guidance. For teams already using Arize, there is a natural extension of that pattern. Prompt Playground can sit upstream of the config layer as the place where prompts are edited, compared, and versioned before they are promoted into whatever config system the company already trusts in production.

The post Prompt templates as configs, not code appeared first on Arize AI.

Scroll to Top