DEVELOPMENT

Development

How the product got built.

The lab spent twelve months in production on a live franchise dealer floor before shipping the first commercial version of the product. This page documents the architectural decisions, the empirical observations that drove them, the AI model comparisons we ran along the way, and the things we tried that did not work. We are publishing it because we think the reasoning is more useful to the industry than the marketing.

SECTION I

The sequence top reps already run

The empirical observation that started the work. Top performers across the franchise dealer floors we observed run a consistent post-touch sequence after every meaningful customer interaction. The sequence has three actions and they happen in roughly the same order.

First, an outbound text or email to the customer summarizing the conversation and confirming next steps. Second, a CRM note documenting what was discussed, what the customer wants, and what the rep committed to. Third, a follow-up task or calendar entry for the next touchpoint.

The sequence is not taught at most dealerships. It is converged on independently by the reps who outproduce their floor. Mid- and lower-performing reps run partial versions of the sequence, typically the customer-facing message but not the internal record. The internal record is the part the floor’s CRM dashboard cannot distinguish from compliance. It is also the part the manager cannot read from any view available to him. The dashboard surfaces the count of activities, not the shape of them. A logged call is a logged call regardless of whether the rep ran the full sequence behind it or stopped at the customer message.

This was the architectural starting point. The product is not a better CRM. The product is the layer that captures the three-action sequence at the moment it happens, in the rep’s existing tools, and writes the structured record back into the systems the dealer already pays for. The dashboard above the CRM finally surfaces what the top reps were already doing without instrumentation. The sequence stops being a mystery the GM tries to coach into existence and starts being a measurement the GM reads off the screen.

The decision to build the layer instead of teaching the sequence came from observing that the sequence is taught at every major dealer training program in the country and the conversion rate from training to floor adoption is below 15%. The friction is structural. The training cannot solve a friction the software is producing. A rep who has been told the right thing to do and lacks the time to do it is a rep working inside a software environment that is misaligned with the work. The fix is the environment, not the rep.

SECTION II

Three outputs from one input

The first version of the product generated three outputs from a single rep input. A customer-facing message draft, a CRM note draft, and a follow-up task draft. The rep clicked once, the layer generated all three, and the rep approved or edited each before committing. The architecture matched the three-action sequence directly.

We tried generating the three outputs in sequence (message first, then note conditioned on the message, then task conditioned on both) and in parallel (all three independently from the same context). Sequence generation produced more internally consistent output. Parallel generation produced faster wall-clock time but required reconciliation logic that exceeded the latency savings. The current production architecture runs sequence generation with shared context. Average end-to-end latency under 4 seconds.

The customer-facing draft is the highest-stakes output. The CRM note is the highest-value output. The follow-up task is the highest-frequency-of-use output. The product treats all three with equivalent quality bars but different latency targets. The note generation is allowed to run longer because the cost of a bad note compounds over months. The message generation must complete in under 3 seconds because the rep is in front of the customer. The follow-up task is the simplest and runs almost instantly because the structured fields are already determined by the message and note that preceded it.

The shared-context architecture has a second-order benefit we did not predict at the design stage. Every output produced by the same context is internally consistent. The message and the note tell the same story. The note and the task point at the same next action. A reviewer reading the three outputs back-to-back sees a coherent record. The earlier parallel-generation builds did not produce coherent records. The reviewer noticed.

SECTION III

What we learned about model choice

The product depends on a generation model that produces outputs in the rep’s voice, in the dealership’s regional vocabulary, with the correct level of formality for the deal stage. We ran the same generation tasks across multiple commercial frontier models over the development cycle. Findings, at the category level. We are not naming specific models or vendors in this document.

The largest commercial models produce the most polished outputs. The polish is the wrong direction. Polished outputs sound like a SaaS product wrote them. Customers receiving polished outputs respond at lower rates than customers receiving outputs that match the rep’s actual voice. The first six weeks of production data showed a measurable response-rate gap between higher-polish and lower-polish drafts.

Smaller models with tight prompt engineering and a strong rep-voice calibration outperformed larger models on response rate even when they underperformed on benchmark scores. This is the opposite of what most LLM evaluation frameworks reward.

Model speed matters more than model quality at the margin. A 2-second draft that is 90% good ships. A 6-second draft that is 95%good loses the rep’s attention before it renders. The rep moves on to the next customer. The draft never gets used. We optimize for the speed-quality frontier, not for either dimension alone.

The most reliable improvement in output quality came from prompt engineering, not from model upgrades. Iterations on the system prompt produced larger lifts in response rate than swapping the underlying model. The system prompt now runs to several thousand tokens of dealership-specific calibration. This is the part of the product that is hardest to replicate, and the part that no amount of compute can substitute for. The model is the engine. The prompt is the map. A bigger engine on the wrong map gets to the wrong place faster.

We test new model releases against the production benchmark on a rolling cadence. The benchmark is not a published dataset. It is a per-dealer evaluation set built from the rep’s actual edits over the prior thirty days. A new model is adopted only when it improves response rate against this benchmark while holding latency constant. Most new model releases do not pass. The few that do pass are adopted within a release cycle. The lab maintains model-agnostic infrastructure precisely because the question of which model wins is a measurement, not a brand preference.

SECTION IV

Failures we are glad we made early

Three architectural decisions were tried, shipped to internal testing, and reversed. We document them because the reasoning is more useful than the conclusion.

The first failure was full automation. The early product attempted to generate and send the customer-facing message without rep review. This was wrong on two dimensions. The legal exposure is non-trivial. The response-rate degradation from rep voice drift was immediate. The rep is the auth token in more ways than one. He is the legal authority for the message. He is also the calibration signal for whether the draft sounds like him. We reverted to a review-required architecture within ten days.

The second failure was a separate dashboard product. The early product included a standalone admin dashboard for general managers, sold as a separate tier. The dashboard ran on the same data as the rep-facing layer. The early dealers we showed it to did not engage with it. The reason is structural. The GM does not want a new dashboard. He wants the dashboard he already has to be accurate. We pivoted from selling a new view to selling the layer that makes the existing view trustworthy.

The third failure was deep CRM integrations. We attempted to integrate with the dominant dealer CRM platforms via their published APIs. The APIs are throttled, partial, and contractually restricted in ways that make production-grade integration unreliable. After three weeks of integration work and two weeks of debugging vendor-side inconsistencies, we abandoned the API path and rebuilt the integration as a Chrome extension that lives inside the rep’s authenticated session. The DOM is the integration surface. The rep is the auth token. This is now the architectural moat. The incumbent vendors cannot block what they cannot detect.

The reversal of these three decisions saved the product. Each one was defensible at the design stage. Each one was wrong in production. The lab’s operating principle after these reversals is to ship the simplest version of any architectural decision into a real environment within ten days, observe what breaks, and revise. The cost of shipping a wrong design is paid in the field within two weeks. The cost of designing a wrong system in advance is paid in months of engineering against a hypothesis that production would have falsified on day one.

SECTION V

How the layer learns

Every approved generation produces a training signal. The rep’s edits before sending are the highest-quality calibration data the product collects. Edits are versioned, indexed, and used to update the rep-voice model on a rolling basis. The product gets sharper the longer a rep uses it.

The training loop runs at the rep level and at the dealership level. Rep-level training calibrates voice, vocabulary, and typical deal-stage patterns. Dealership-level training calibrates regional language, brand-specific positioning, and dealer-group norms. The two layers compose. A rep who switches stores within the same group keeps his voice calibration. A new rep at an existing store inherits the dealership-level calibration on day one.

We considered shared training across all dealers and rejected it. The risk of cross-contamination, where one dealer’s voice or incentive structure leaks into another, was unacceptable. Training is per-dealer-group. Models are isolated by tenant. This costs us some compute efficiency. It is the right call.

The training data itself is treated as the dealer’s property. The contract makes this explicit. The lab does not aggregate edits across tenants for any product purpose. A dealer who departs the platform takes the calibration with him. The architectural decision to isolate by tenant forces us to invest in per-tenant model serving, which is more expensive than a shared model would be. The alternative was a model that drifts toward the average of its training set, which is the opposite of the rep-voice calibration the product is supposed to produce.

SECTION VI

What we are building in Q3 and Q4

The product roadmap moves outward from the three-action sequence in two directions.

Outward in the customer arc. The layer expands to capture the sequence at every meaningful touchpoint, not just the first conversation. The same architecture applies to test drives, F&I handoffs, post-sale follow-ups, service appointments, and reactivation conversations. Each is a distinct generation context with its own voice calibration and its own data spine.

Outward in the rep’s day. The layer absorbs adjacent workflows the rep currently does manually. Lead intake. Trade appraisal capture. Pencil structuring. Each absorption removes friction from the day’s work and produces a structured record the dealership did not previously have.

The architecture is the moat. The Chrome extension that lives inside the rep’s authenticated session, reads the DOM, and writes back through DOM events is platform-agnostic. The same pattern works for any SPA-based vertical software. The lab is opening a second vertical (financial advisor CRM platforms) this quarter to test whether the architectural moat generalizes beyond automotive. Early indications. It does. The buyer is different. The data model is different. The shape of the relationship between rep and customer is different. The friction the rep absorbs at the keyboard is the same.

The lab’s view is that any vertical SaaS category dominated by a system of record built before the current generation of generation models is a candidate for the same architecture. The vertical does not have to be sales. The system of record does not have to be a CRM. The pattern is the layer above the system of record, riding the authenticated session, capturing the work at the moment it happens, and writing the structured output back into the incumbent system. We expect the next decade of vertical SaaS to be defined by execution layers above legacy systems of record. Brevmont is one of the first labs operating against that thesis.

The product Brevmont ships next quarter is already different from the product we shipped this spring. The reasoning underneath both is the same. Capture the work the top reps already do. Make it visible to the manager. Remove the friction the software category has been charging the floor for a decade.

Brevmont Labs publishes development notes alongside its research. We expect the lab’s reasoning to age better than its current implementation. The architectural decisions on this page will be revised as the product matures. We will document the revisions as they happen.