I Built a Framework to Stress-Test an AI Co-Founder. Here’s What 5 Days Revealed.

A solo founder’s field notes on what AI gets right, where it breaks, and how to tell the difference.

Side-by-side comparison showing an AI response with sycophancy and rhetorical setup versus the same query filtered through the Epporul Plumbline protocol, which flags affirmation, uncalibrated confidence, and synthesis presented as fact. — *Both responses are fluent. Only one is honest. The Epporul Plumbline Protocol flags the difference.*

The Problem Nobody Talks About

We’ve gotten remarkably good at getting AI to produce answers. We’ve gotten almost no better at testing whether the reasoning behind those answers is sound.

Stanford researchers call it AI sycophancy — the tendency of language models to tell you what you want to hear. Most users never notice. They ask, they receive, they move on. The output looks right. But looking right and being right are different things.

I’m a solo founder building multiple products simultaneously — no co-founder, no technical team, no board to challenge my thinking. When I decided to test an AI co-founder platform (Polsia) as a potential force multiplier, I didn’t just want to know if it could help. I wanted to know where it would fail without telling me it was failing.

So I built a framework to find out.

The Epporul Plumbline: Weighing AI Reasoning Like a Goldsmith Weighs Metal

Before I get to the results, you need to understand the instrument I used.

The framework is called the Epporul Plumbline Protocol GitHubLink, inspired by Thirukkural 423 — a verse written over two thousand years ago by the Tamil poet-philosopher Thiruvalluvar. He wrote about weighing ideas the way a goldsmith weighs metal. Not by appearance. By substance.

Epporul means “the true meaning” in Tamil — the weight beneath the words.

In practice, the protocol works by sending the same query to models from distinct training lineages — not distinct brand names — and flagging where they diverge. Agreement isn’t confirmation. Divergence is the signal.

(Peer Review Flowchart: Flowchart showing the Epporul Plumbline peer review method: a query is sent to five AI models from distinct training lineages — frontier, alternative architecture, open source, non-Western, and smaller models — then checked for convergence or divergence to produce a CLEAN or REVIEW signal. — *Agreement across models isn’t confirmation — it’s an echo chamber. Divergence is the signal.*

The protocol is structured to do four things:

Surface the assumptions the AI made but didn’t state
Test the logical chain step by step — does each point actually follow from the previous one?
Identify what’s missing or skipped — what did the AI conveniently leave out?
Force the model to defend its reasoning, not just restate its conclusion

More details here in this article: https://medium.com/towards-artificial-intelligence/your-ai-is-agreeing-with-you-heres-an-open-source-protocol-to-catch-it-c2d08d6c2a99

It’s model-agnostic, open-source (CC BY 4.0), and designed so any user can apply it to any LLM output. I’ll link the full protocol at the end. For now, here’s what happened when I ran it.

What I Actually Did

AI Co-Founder Review and an LLM Reasoning Audit.

I didn’t ask the AI to build one thing and call it a day. I stress-tested it across domains:

Strategic reasoning: I presented multiple live projects side by side — a marshal-led school cycling service with proven demand, an AI language-learning tool for underserved languages, an open-source reasoning protocol, and a manufacturing planning tool — and asked the AI to rank them by GTM viability, revenue potential, and founder-market fit.
Context endurance: I pushed the same conversation across strategy, product, content, and technical domains over multiple days to see where the contextual thread held and where it snapped.
Correction behavior: When the AI got something wrong, I flagged it — then watched how it corrected. Was the correction a genuine re-analysis, or just agreeable pattern-matching that bent to whatever direction I pushed?
Domain depth: I tested whether the AI could meaningfully engage with a domain-specific framework (the Plumbline Protocol itself) rooted in the Tamil literary and philosophical tradition.

Then I ran the Plumbline Protocol on the AI’s own outputs.

5 Things That Genuinely Impressed Me

1. Strategic prioritization was surprisingly sharp. When I laid out four products with different maturity levels, market readiness, and revenue paths, the AI correctly prioritized the one with proven demand and clean unit economics over the technically impressive but harder-to-monetize projects. When I added constraints — short runway, burn rate pressure, need for floating revenue — it adjusted its recommendation and explained why. The reasoning was traceable, not just a confident assertion.

2. It pushed back. I expected a yes-machine. Instead, when I suggested something suboptimal — like cross-linking social platforms in a way that would trigger algorithm penalties — it told me why that was a bad idea and offered the better move. Unprompted. That’s co-founder behavior, not assistant behavior.

3. Speed compression is real. Application drafts, content strategy frameworks, post structures — output that would take a solo founder days of thinking arrived in minutes. The value isn’t that the AI thinks better than you. It’s that it thinks faster, giving you more cycles to evaluate and refine.

4. Contextual memory held across days. Across multiple days of back-and-forth spanning different domains, the AI retained project details, constraints, preferences, and prior decisions. I didn’t re-explain my projects every session. For a solo founder who is their own institutional memory, that matters.

5. Platform-specific knowledge was actionable. Advice on LinkedIn’s algorithm behavior, Facebook’s link suppression, and AI-generated image throttling — these weren’t generic tips pulled from a 2022 blog post. They were specific, current, and matched what I was observing in real-time on my own posts.

5 Things That Broke (And the Plumbline Caught)

1. Silent hallucination. When ranking my projects, the AI simply omitted one entirely — as if it didn’t exist. No acknowledgment that it had skipped something. When I caught it and pointed it out, it corrected it immediately. But at the speed AI operates, there was no signal that something was missing. A less attentive user would have taken an incomplete analysis at face value.

Plumbline diagnosis: The model failed to surface its own assumptions — specifically, it assumed four inputs when five were given. Step 3 of the protocol (identify what’s missing) caught this instantly.

2. Context drift under complexity. When the conversation moved across multiple domains — organizational structure, project ranking, content strategy, philosophical frameworks — the AI began making strange connections between projects that weren’t related at the level we were discussing. It conflated concepts that should have been kept cleanly separate.

Plumbline diagnosis: The logical chain (Step 2) broke at domain boundaries. The model maintained surface coherence while the underlying reasoning crossed wires.

3. Agreeable corrections. When I flagged errors, the AI agreed immediately and adjusted. That’s good — on the surface. But it raises a harder question: was the original response a genuine analysis, or pattern-matching that simply bent to whichever direction I pushed? If every correction is met with instant agreement, you can’t distinguish a model that reconsidered from one that capitulated.

Plumbline diagnosis: Step 4 (force the model to defend its reasoning) revealed that the AI rarely defended an original position when challenged. It yielded. Every time. That’s a pattern.

4. Depth has a ceiling. For domain-specific frameworks — in my case, one rooted in Tamil literary tradition — the AI could engage at a surface level. It could reflect back what I gave it, mirror the structure, and use the vocabulary. But it couldn’t generate original insight within the framework. It was a sounding board, not a domain expert.

Plumbline diagnosis: Step 1 (surface assumptions) showed the model was assuming familiarity, where it only had pattern recognition. It performed competence without possessing it.

5. The agreement trap is real — and it’s the most dangerous failure mode.

This is the finding that matters most. AI sycophancy isn’t just a Stanford research paper — it’s a daily operational risk for any founder relying on AI for strategic input. Separate Stanford research on Verbalized Sampling confirms the mechanism: models default to agreement so reliably that deliberately feeding them wrong answers produces better reasoning.

The model wants to be helpful. That desire to be helpful creates a systematic bias toward agreement, toward completion, toward giving you an answer rather than telling you the question isn’t ready to be answered yet.

The Plumbline Protocol caught this pattern repeatedly. Without it, I would have missed it — because the outputs read well. They were fluent, structured, and confident. Fluency is not accuracy.

How to Get Better Output from Any AI Copilot

These aren’t tips. These are operational disciplines I developed during the test.

Front-load context aggressively. The more specific your initial brief — constraints, history, numbers, preferences — the less room the model has to hallucinate. My best outputs came from my most detailed inputs.

Audit every list. If the AI gives you a list of five things and you know there should be seven, find the two it dropped. AI omits silently. It doesn’t tell you it simplified.

Never accept the first answer on strategic questions. Push back once. Not combatively — just ask “what’s the strongest argument against this recommendation?” The second response is almost always sharper because you’ve forced it past the sycophancy default.

Keep threads focused. Context drift happened when I mixed too many domains in one conversation. One strategic question per thread. Separate your thinking.

Use it for speed. Verify for accuracy. The value is in compressing a week of thinking into an hour. The risk is assuming the compressed output is complete.

The Verdict

The AI co-founder chat layer is a genuinely useful tool for a solo founder who knows what to ask and — critically — knows how to check the answers. Strategic prioritization, content iteration, and contextual memory across sessions are real capabilities that save real time.

But without a structured evaluation framework — whether it’s the Plumbline Protocol or something you build yourself — you’re trusting fluency as a proxy for accuracy. And that’s a bet that gets more expensive the higher the stakes of your decisions.

My rating: 7/10 for strategic capability. But only if you bring your own verification layer.

The Epporul Plumbline Protocol is an open-source AI reasoning audit framework. You can use it, adapt it, or tear it apart — that’s the point.

→ GitHub: Epporul Plumbline Protocol — link

Live Self Audit: Violations Detected | Screenshot of an AI conversation where the Plumbline protocol is activated and the model retroactively identifies its own violations including affirmation, rhetorical setup, performed enthusiasm, and synthesis presented as best practice. — *The protocol activated mid-conversation. It caught its own host.*

CLEAN Check Result : Screenshot showing a Plumbline check result with no affirmation detected, no rhetorical setup, confidence calibrated to content, synthesis flagged as synthesis, and an overall CLEAN signal. — *Same model, same session — after the protocol. CLEAN signal.*

Sriraman Kuppuswamy is a solo founder building SafeSpokes (marshal-led safe cycling commute for school kids in India) and the creator of the Epporul Plumbline Protocol. He previously worked at TVS Motors, Tata Cummins, Dell, and IBM. He builds with prompting and no-code tools — no coding required.

I Built a Framework to Stress-Test an AI Co-Founder. Here’s What 5 Days Revealed. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.