Start with the operating boundary
Per-token pricing is only one input. The architecture also has to account for where data is allowed to move, how much latency the workflow can tolerate, how often requests arrive, and who owns failure when inference is unavailable.
That boundary determines whether to call a hosted endpoint, run inference inside the operating environment, or split traffic between both.
Service options
Hosted APIs reduce maintenance. The provider owns model serving, scaling, patching, and much of the operational burden.
Private models increase control. The team owns placement, retention, logging, latency policy, fallback behavior, and model changes.
Hybrid routing needs explicit rules. Decide which tasks stay local, which tasks can leave the boundary, and which failures escalate to another model or operator.
Data boundaries
If source data cannot leave a network, tenant, or customer environment, the private option should be evaluated early. Redaction can work, but it becomes an operating surface of its own: parsers, rules, exceptions, audit records, and tests. That work belongs in the cost comparison.
A private deployment still requires compliance work. It narrows the surface by reducing external transmission, tightening retention boundaries, and giving the team more control over prompts, logs, generated artifacts, and access paths.
Latency and throughput constraints
Hosted endpoints add network latency and provider-side queuing. For batch jobs, that may not matter. For tool-calling agents, request chains can turn small delays into user-visible wait. Measure the whole workflow.
Local models can provide tighter latency bounds when hardware, batch policy, and queue depth are controlled. Smaller fine-tuned models can be faster for narrow tasks, but they still need evaluation against hosted alternatives before promotion.
The cost of ownership
Private inference has real operating cost: hardware, images, drivers, quantization choices, model registries, deployment rollback, monitoring, and drift tests. A local model that no one maintains becomes technical debt quickly.
The crossover is workload-specific. Compare monthly API spend, peak concurrency, latency penalties, data-handling work, and the engineering hours required to operate the private stack. A defensible choice has a runbook and a review record.
Hybrid routing
Many production systems benefit from both. Extraction, classification, redaction, and routine triage can stay close to the data. Complex reasoning, rare edge cases, and low-volume research tasks can route to a hosted model when policy and data rules allow it.
Make the routing rules inspectable. The system should record model choice, prompt version, input class, latency, cost, confidence, and fallback reason. Without that record, infrastructure choice becomes opinion instead of an engineering decision.