How I built a multi-LLM system that captured team judgment and freed half a person every week
Luis Polanco | April 2026
Half a person, every week, gone.
Most conversations about AI and individual productivity are about saving you 45 minutes a day. True. But it’s also the least interesting part of the productivity story. The bigger opportunity is capacity locked in work nobody has a name for.
Every team runs a constant layer of invisible effort. Work that lives in between teams that has to happen before anyone can do their actual work.
The challenge to unlocking this isn’t the tools. It’s the ability to capture the judgment teams use to synthesize that work. The tacit knowledge about how to break down work, who works on what, how they think about the customer.
That’s exactly where AI becomes useful. Not as a productivity tool, but as a way to capture and apply the judgment that lives in people’s heads.
I recently built a system that changed that for a client. Ninety hours. One person. No engineering team. It’s been running for three months.
The pile nobody sorted
Four creative teams, video, graphics, writing, social media. All working the same customers, same deliverables.
Every week the same pile landed. Meeting summaries, Zoom transcripts, strategy notes, email threads, none of it sorted, none of it assigned. Someone had to get into all of it, work out what each team needed, catch what was slipping, and get everyone moving.
Transcription, summaries, action items, existing tools handle that. Taking all of it, four teams, multiple customers, and producing something each team can actually move on, nothing does that.
No document, no playbook, no system. Just people who knew how to do this after years of doing the work, the tacit knowledge.
Here’s what they actually did, the work that lived in their heads.
They picked up on what was being asked even when nobody said it directly. What kind of work it was, who should handle it, what was missing. They read what someone meant, not just what they wrote.
They spotted dependencies nobody mentioned. What had to happen first, what could wait, who needed to be looped in from outside the team.
They could tell real work from talk. They knew how work actually moved across the team. Which steps mattered, which ones you could skip, which ones looked optional but would burn you if you did.
It was costing them half a person’s time.
Before the build
I built a system that took that tacit knowledge and put it to work. It pulls information from multiple internal sources, applies the logic and decision-making structure to how the members think and make decisions, and breaks it down into real work, inside their project management system.
The goal was to take the judgment they were already applying and let the system do it, the way the teams did. The system was made up of individual AI components across multiple stages.
This system had to live with the team and be one they could maintain. No engineers, no outside developers, limited IT support.
I evaluated both Make.com and n8n and chose Make.com for this client so that they could easily maintain it.
The rest of this article focuses on the AI components.
Teaching the system to think like the team
Encapsulating judgment in a prompt is hard. My first prompt version was 15K tokens and too slow.
The reasoning layer had to do what the team was doing. It had to know how this team operates. So I sat with them, watched them work, asked why they did things the way they did, what they had learned, what they would change if they had more time.
I collected hundreds of examples and turned them into archetypes the model could learn from. Positive and negative pairs showing what right and wrong look like. Decision trees that mirror how the team thinks through ambiguity and certainty. Domain specific conventions and overrides where the obvious interpretation is right or wrong for this team. Object definitions and mappings so the model knows every person, client, role, and work type it will encounter. Fallback tiering logic so every decision has a path when signals are missing or thin. Confidence threshold settings that force the model to commit or abstain from responding rather than guess.
My approach was to encapsulate domain knowledge first, get the output right, and worry about the prompt size later. After training, the output was correct but end-to-end response was slow. It took over five minutes.
I found traditional prompt decomposition didn’t work. Breaking the prompt by function forced the system to think in silos, but the real team thinks holistically, customer first. So I kept the larger prompt and focused on reducing it without breaking the model’s judgment and context.
Confidence from the inside
To improve the model’s judgment accuracy and for me to validate it, I built confidence logic directly into every prompt. For each output and individual attributes, the model had to calculate a confidence level based on defined criteria, flag whether the content was inferred or directly stated, output its reasoning, and include the snippet of original material it used. These are not the kind of confidence scores you hear about that are unreliable. They were built into the prompt with defined criteria and were invaluable in understanding the model’s reasoning. It was not only useful for evaluating whether the logic was working, but also turned out to be the key to trimming the prompt without reducing quality.
In addition, the confidence level wasn’t just an audit field. Every AI model used the level it generated, with a set threshold, to decide whether it should or should not output anything. This required the model to perform additional reasoning that resulted in a better output. This of course came at a slight additional processing cost, but worth it. This approach reduced noise and gave me signals I could use to determine if I could trust the output. It made the whole system more robust.
For example, the customer identification prompt model worked with limited signals from the source materials, generated its own confidence score, and based on a set threshold it generated an output. Above 60% it committed to the match, and below that it defaulted to unknown. This approach fixed a failure mode where the model was playing it safe even on obvious potential customer matches. This forced the model to make a decision with uncertain information.
The most valuable signal that helped diagnose systemic instruction set flaws was when the model returned with high confidence but wrong output. When the model thought it was right, the rationale pointed at the broken part of the prompt, and the misinterpretation was usually indicative of bad prompt structure or missing examples, not random noise.
Replacing me with another AI
To scale tuning across multiple samples and this larger prompt, rather than doing that manually, I built a second AI model, an LLM-as-a-judge. It had two purposes, one to ensure the output was as expected for every output attribute, and two when the output was incorrect, help me narrow down where in the prompt the issue was. I gave it examples of correct outputs, the output from the first prompt (confidence levels, the inferences, the rationale, the source text) and the reasoning layer prompt. That gave it what it needed to evaluate accuracy and quality while I trimmed.
The judge identified where the prompt logic went wrong, helped me fix it, and flagged what could be cut safely. It didn’t just tell me whether an output was correct, it pointed me at the right section instead of leaving me guessing. It replaced work I had been doing manually at a scale I couldn’t sustain.
The use of confidence levels and the judge working together allowed me to surgically reduce the prompt without losing accuracy.
The final prompt landed around 6K tokens, and I found any more trimming caused the model to start drifting and become inaccurate.
Why the shortcuts failed
Commercial optimization tools couldn’t do this because they cut what they thought was duplication, like similar multi-shot examples and repeated logic across sections. In domain tuning that redundancy is often needed. What looks like duplication isn’t, especially across multi-function, multi-team, multi-workflow logic. Each section needed two unique pairs of archetypes, positive and negative cases, working with the model to handle edge cases. Commercial tools stripped both.
Privacy, like a spy
Privacy was about protecting the source materials, as summaries and emails contained client conversations, internal strategy, and team discussions.
The customer had a clear requirement that no AI model should ever see client names and source material together.
The obvious fix would be to strip the names out, but that idea would not have worked. The complication was that the content being analyzed could have multiple clients with work across them, and just obfuscating the names would prevent the right association of client to type of work needed. Who the client was had to be part of the system as it couldn’t work without it.
To solve this, I decided to use the CIA playbook, compartmentalization. Intelligence agencies compartmentalize projects, so no single person has enough information pieces to assemble a full picture. Each person’s access is limited to what they need to know, nothing else. This became the privacy design principle for the system.
To do this, I replaced client names with tokens, random 16-digit placeholders that meant nothing on their own. The full company name, short name, acronym, email, alias and domains were tokenized across all source materials. I ensured the tokens were ephemeral, so they only lived for one end to end workflow execution.
Each individual AI component only had access to either the client information or the tokenized material, never both. The primary reasoning model saw only tokenized content, no client information. Other components only saw client information, no other content. If any AI model was compromised, logged, or leaked, only one piece was compromised. A final non-AI mechanical step restored the real names before anything moved outside the system to other internal tools or for reporting.
No AI model in the pipeline ever worked on both pieces at the same time.
Not every job needs the smartest model
Using the best model for every step is easy, but it’s not always necessary or cost effective per run.
Using a mix of frontier, previous-generation, and mini models results in each execution being 65-74% cheaper than using the frontier model for all steps. Same inputs, same output quality. Cost savings and performance came by using a mix that best matched the individual AI model to its intended job, and it’s not always the latest and greatest. The right model is the one that has and uses the necessary reasoning the job needs.
Model selection is a balance of input and output cost, quality of output, and latency. The higher reasoning models cost more and may run slower, and may do worse on simple tasks by overthinking, adding complexity, and eating up tokens needlessly. It’s like asking a brain surgeon to make you toast and you get a five-minute discussion on the bread, its ingredients, and how it’s made. You don’t need that.
For the primary reasoning layer, the one executing the human judgment, I used the frontier model. Not because it was the latest and greatest, but because testing resulted in the best outcomes for that job. Other AI components used a smaller model sized to its function, whether that was customer detection, data manipulation, or reporting.
Two other levers worth mentioning that are useful to constrain cost are temperature and max output token caps. Temperature should be a per-function decision based on the use case, whether the task needs determinism or room for reasoning.
Output tokens can be a cost driver, as AI models tend to fill whatever budget you give them. I used dynamic capping that sets a max token based on a multiplier of the input size. This ensured the models were not overperforming needlessly.
When wrong is expensive and when it isn’t
A single accuracy bar across the entire output may not be the right measure, as you can miss real failures or reject output that was fine.
This is about risk management, not necessarily quality assurance. The right question isn’t whether the output is correct. It’s what each attribute costs when it’s wrong and how often it goes wrong.
Each output has a set of attributes, and I measured risk against each one. The risk profile is different for each attribute, and for the totality of the output.
Think about a burger. If the bun is bad, the burger is bad, no matter how good the protein is. If the bun is fine but the protein is bad, the burger is still bad. Those two must be right or nothing else matters. However, if the bun and the protein are both good, then the lettuce, the tomato, the spread, those can vary a little. Nobody sends back a burger because the lettuce was sliced differently than last time.
Some output attributes in a workflow are the bun and the protein. Dates, customer assignment, and hierarchy. They have to be right, because a wrong customer or a wrong date breaks the attribution chain downstream and breaks the value of the system. Humans don’t tend to make those mistakes.
Other attributes are the toppings, like how a task is titled or how a follow-up, task, or action is phrased. The bar there is do no harm within a range. The AI model can rephrase, restructure, create, or fill in detail, but it can’t change meaning, drop something important, or invent things. The output has to be clear, actionable, faithful to the source content. If it hits those three, variance is fine.
Each AI component had its own judge that evaluated the attributes of its output during testing. A final judge evaluated the full input against the full output. Nothing left the system until it was correct.
What this is really about
Get it right and you don’t get 45 minutes back. You get half a person back, redirected to work that actually moves the business.
This approach and techniques are not limited to workflows. The same approach of capturing judgment through domain tuning, works inside products too.
What’s the one job on your team that everyone agrees is critical, but nobody officially owns? The work that would be noticed immediately if it stopped but isn’t on anyone’s job description. That’s the work this is about.
