Guess Llama – A game for local Vision LLM

I've been working on a project I call Guess Llama.

The concept is based on the old 'Guess Who?' game.

'Guess Llama' uses a vision LLM backend such as llama.cpp's llama-server to generate and play the game. It currently uses stable-diffusion.cpp's sd-server or Openrouter.ai image generating models to generate the images.

You can enter any 'theme' for the game, or ask the bot to generate one. Such as 'cat', 'llama', 'capybara', 'clown', 'space alien', etc.
The bot suggests 8 items that can go with the theme. (For image variation)
The image server then generates 24 character images with that theme and 2 of the items for each character.
You and the bot are assigned a random character from that set.
You and the bot ask each other yes/no questions until one of you narrow it down to one possible character and win.

The LLM backend actually looks at the images when deciding elimination questions, and looks at its own image when answering the player's elimination question.

Qwen3.5 has been doing great at playing the game. I'm surprised I pulled a win for the example video without cheating. When Qwen3.5 asked me about my capybara's red bandanna I thought it was going to be over.

A smaller Gemma4 seemed to get a bit confused, but I didn't test them extensively. ie. One eliminated my character erroneously despite me answering its question correctly.

I've been using Z-Image-Turbo for local images. It's actually a benefit if the image model has difficulty making the same character twice. We want variation.

With thinking/reasoning it can take a long time for the bot to generate a response. Even using OpenRouter as a backend to speed up testing takes a while.

The context used is around 6.2K tokens when 23 512x512 images are presented to the bot.

Only tested on llama-server & openrouter. Other backends like LMStudio should work.
Only tested on Linux. The github workflows say it should compile on MacOS & Windows.
Can potentially add other image backends. stable-diffusion.cpp & openrouter seemed like the easiest to implement.
You can use the supplied 'Cat' theme if you don't want to wait for images to generate to test this.
Primarily tested with Qwen3.5, but any vision model that can take in an arbitrary number of images (23) should be able to play.
There's no prompt caching, it's processing the tokens every time.

Using openrouter's black-forest-labs/flux.2-klein-4b to generate images currently costs about $0.017 per image, if you don't want to generate them locally. Roughly $0.41 per image set. If you play against openrouter's qwen/qwen3.5-122b-a10b then it can cost up to $0.02 per interaction. (Each round has multiple interactions, generating a question, eliminating the characters based on the answer, etc.)

This seemed like the lowest hanging fruit for a vision based LLM game.

submitted by /u/SM8085
[link] [comments]

Leave a Comment