Now in private preview v2026.6 → Physical AI on Jetson · Voltra routing model · Praxon agent-graph governance

The GPU-native runtime
for production AI.

Production AI is expensive, ungoverned, and invisible. GPU fleets idle at 30%. Agents act without audit. Inference bills arrive with no cost attribution. Mandarum is the GPU-native runtime that fixes all three — compile, serve, route, govern, and observe AI on NVIDIA accelerated computing, from H100 and Blackwell datacenters to Jetson AGX Orin at the edge. One control plane. Every model, agent, and GPU.

Start free See a workflow

H100 · H200 · BlackwellJetson AGX Orin edgeSOC 2 Type IIISO 27001EU AI Act ready200+ integrations

Backed by Khudi Ventures $3M Seed

NorthBank

Helix Health

Atlas Logistics

Vellum AI

Pinecrest

Orbital Systems

Backed by

Khudi Ventures $3M Seed

The problem

Inference sprawl is the new shadow IT.

Enterprises run dozens of models and agents across scattered GPU clusters, clouds, and edge deployments — with no shared serving layer, no utilization governance, no cost attribution, and no audit trail. GPUs sit at 30–40% utilization while inference bills compound. Agents act without identity. And at the factory floor, every unaudited Jetson inference is a liability waiting to be discovered.

01 · Identity

Every inference request is authenticated

Per-model and per-agent service identities, scoped API keys, and short-lived credentials — enforced at every call across cloud H100 clusters and Jetson edge nodes. No anonymous inference.

02 · Policy

GPU spend caps, guardrails, agent-graph enforcement

A declarative policy engine that runs before each inference call — and across entire agent task sessions. Block unauthorized model access. Enforce GPU budget limits per call and per task. Escalate to humans before consequential actions.

03 · Observability

Every GPU cycle traced, every dollar attributed

OpenTelemetry + DCGM GPU hardware telemetry — from H100 SXM5 NVLink throughput to Jetson AGX Orin power draw. Token throughput, GPU utilization, p99 first-token latency, and $/1M tokens attributed per model, per team, and per request.

The platform

Five products. One GPU Runtime.

Each product is designed to work alone. All five are designed to work as one. Every governed inference feeds an optimization flywheel. Every telemetry trace flows into a single control plane. Every policy you write once applies at the H100 rack and the Jetson node. Adopt Voltra today. Add Praxon when you ship agents. The rest of the runtime is there when you need it.

01 · Optimize · Routing Intelligence

Mandarum Voltra

A NeMo-trained routing/optimization model that learns optimal GPU endpoint, quantization tier, and batching strategy from your fleet's own inference logs. Cuts $/token with every request served.

Explore Voltra →

02 · Govern · Agent Runtime

Mandarum Praxon

Agent-graph-aware governance — task-level budget caps, scope enforcement, and NeMo Guardrails across entire multi-step agent sessions, not just individual calls. Reasoning-chain audit trail for EU AI Act.

Explore Praxon →

03 · Remember · GPU Memory & RAG

Mandarum Cuvex

Co-located GPU pipeline: NeMo Retriever embedding → RAPIDS cuVS billion-scale ANN with ACL-native enforcement → GPU reranking. No CPU round-trip. Sub-10ms retrieval at any corpus size.

Explore Cuvex →

04 · Edge · Physical AI Governance

Mandarum Edgeon

Runtime + Sentinel on NVIDIA Jetson AGX Orin for manufacturing, robotics, and healthcare. Same identity, policy, and audit as your cloud H100 fleet — now governing every on-device inference action at the factory floor. IEC 62443, ISO 26262, and FDA SaMD compliance packs built in.

Explore Edgeon →

05 · Simulate · GPU Fleet Digital Twin

Mandarum Twinex

Build a live digital twin of your GPU fleet from DCGM telemetry. Simulate new model deployments, hardware upgrades (H100 → H200), and Dynamo topology changes — before committing a dollar of compute.

Explore Twinex →

Explore the platform

The runtime

One platform. Every model, agent, and GPU fleet.

No lock-in on models. No lock-in on agents. No lock-in on hardware. Mandarum wraps what you already run — NIM microservices, Triton-served open-weight models, LangGraph, CrewAI, or the OpenAI Agents SDK — while a NeMo-trained routing model continuously optimizes endpoint selection across your fleet. Bring your GPUs. Bring your models. Keep your data.

GPU Inference Engine

Triton → TensorRT-LLM → NIM → Dynamo → Voltra

Disaggregated prefill/decode via NVIDIA Dynamo. TRT-LLM compilation for 30–50% latency reduction per model. Voltra's NeMo-trained routing model selects the optimal GPU endpoint and quantization tier per request — improving with every inference served.

TRT-LLM compile

Triton serve

NIM endpoint

Voltra route

Physical AI · Edge

Jetson AGX Orin at the factory floor

Edgeon governs every on-device inference action on NVIDIA Jetson — manufacturing vision AI, industrial robotics, and medical imaging — with the same identity, policy, and immutable audit as your H100 cloud fleet.

edgeon.govern(node="jetson-factory-01", iec62443=True)

Agent Governance

Task-level policy, not just per-call

Praxon enforces budget caps and scope boundaries across entire agent task sessions — NeMo Guardrails + session-level context analysis + human-in-the-loop escalation before consequential actions.

Memory · RAG

GPU-native, ACL-enforced recall

Cuvex co-locates embedding, cuVS billion-scale ANN with ACL-native enforcement, and NeMo Retriever reranking on GPU — no CPU round-trip, sub-10ms retrieval at any corpus size.

Fleet Intelligence

Simulate before you deploy

Twinex builds a live DCGM-fed digital twin of your GPU fleet. Simulate new model deployments, H100→H200 upgrades, and Dynamo topology changes — before a dollar of compute is committed.

Live preview

A real inference request, end to end.

Run a sample production request: TRT-LLM compiled model, Voltra routing to the cheapest GPU tier, Praxon session policy enforced, Cuvex RAG grounding, DCGM telemetry — all in one governed trace.

mandarum · trace workflow: nim-inference-pipeline

↳ ~/mandarum/workflows/nim-inference-pipeline

By the numbers

Production-grade from day one.

Inference uptime SLA

p99 first-token latency

35% → 80%+

GPU utilization lift

Lower inference cost

Native integrations

Observability

One pane. Every model, agent, and Jetson node.

Drill from a GPU utilization alert on an H100 cluster or a Jetson factory node to a specific inference request. Replay any agent session. Diff any model version. Export $/1M tokens per team to your data warehouse.

acme-corp/models/nim-pipeline

running · 4 models · 12 Jetson nodes

96.4%

Request success rate

+2.1 pts

81%

GPU utilization

+46 pts vs baseline

$0.41

$/1M tokens (H100 FP8)

−58% vs unoptimized

nim-llama3-70b · H100 cluster · Voltra → FP8 routetriton: serving142ms

praxon-agent · task-budget: $0.05 · 12/50 stepsguardrails: pass · session-policy: ok—

edgeon · jetson-factory-07 · vision-inspectioniec62443-audit: logged8ms

sentinel · task-budget exceeded · agent-session-4421escalated to human—

cuvex · nemo-retriever · corpus: 2.4B vectors · 14 docsacl: enforced · cuVS: hit6ms

Built on standards

Open by design. GPU-native by architecture.

Bring any model

Anthropic, OpenAI, Google, open-weight via NIM/Triton/vLLM, or your NeMo fine-tuned models. Voltra routes each request to the optimal GPU tier and quantization level automatically.

Bring any agent framework

LangGraph, OpenAI Agents SDK, CrewAI, MCP servers. Praxon wraps any framework with task-level governance, NeMo Guardrails, and reasoning-chain audit — without rewriting your agent code.

Bring your full GPU fleet

NVIDIA H100/H200/Blackwell datacenter clusters, multi-cloud, on-prem, BYOC, and Jetson AGX Orin at the edge. One control plane, one audit trail, one $/1M-token cost view — cloud to factory floor.

Stop deploying models. Start governing them — cloud to edge.

The inference era is defined by efficiency and governance — not raw model capability. Join the teams deploying production AI on NVIDIA accelerated computing with Mandarum. White-glove onboarding, direct access to the founding team, and a 30-day pilot with documented GPU cost reduction.

Start a pilot Talk to the team

The GPU-native runtimefor production AI.