Balancing speed (as measured in tokens/second/user) and throughput (total tokens/second of an AI server) is one of the many challenges enterprises face in deploying AI agents in production in a cost-efficient, scalable manner.
While GPUs have enabled the first wave of AI, they end up hitting the "Agentic Wall" — where GPUs cannot sustain the token speeds per request required for complex reasoning loops to support near real-time agentic use cases, especially on larger models like DeepSeek.