Here’s an interesting new coding benchmark based on lambda-calculus. Results seem very realistic to me since no LLM was benchmaxxed on it yet.