I'm working on a solo project to create a "Live AI Tutor" for digital artists and 3D modelers.
The idea is to integrate a multi-modal LLM (like Gemini) into Discord so it can participate in a voice channel and watch a screen share. Imagine you're sculpting in Blender or drawing in Photoshop, and you can just ask out loud, "Hey, what do you think of the anatomy here?" and the AI responds instantly through voice, having seen your current progress.
Current Workflow Plan:
- Audio: Discord Voice Receive -> Whisper STT -> LLM -> TTS -> Discord Voice Send.
- Visual: Since Discord Bot API has limitations on video streams, I'm looking into automated screen capturing synced with the user's voice prompts.
I think this could be a game-changer for solo creators who want immediate, intelligent feedback without leaving their workflow.
What do you guys think? Is the Discord API too restrictive for this, or are there clever workarounds you've seen for real-time video analysis?
[link] [comments]