FYI, Step 3.5 Flash has better perf and context is 1/4 the price in llama.cpp

So i recently updated LMstudio after a long pause and updated my llama.cpp runtimes too.. i was shocked.. i thought maybe something like turboquant was enabled by default.. but.. it just turns out this model's support got way better.

Step 3.5 Flash now slows down ~2.5x less as you load the context up, and uses 1/4 the memory for context!

On a mildly OC'd 5090 + RTX PRO 6000 over x8, i see this with IQ4_NL:
first prompt = 125 token/sec
170k context = 75 token/sec

Previously it was:
first prompt = 125 token/sec
96k context = 45 token/sec

Due to this context memory being 4x cheaper, i can now run Q4_K_L and still get up to 220k context.. if i'm okay with 10% less perf. Or i can setup parallel requests :)

Step 3.5 Flash is now way more useful with agents, cline, and other orchestrators that gobble up context.

submitted by /u/mr_zerolith
[link] [comments]

Leave a Comment