AI Engineering Glossary

Answers: AI Engineering, Explained

Plain-English definitions of the terms behind production AI — each grounded in a real system Prodinit has shipped.

AI Engineering Partner

AI Engineering Partner

An AI engineering partner is a specialist team that designs, builds, and ships production AI systems alongside your company — owning the engineering, not just advising on strategy. Unlike a consultancy that delivers recommendations or a freelancer who completes a single task, a partner takes a product from prototype to reliable, scaled production.

Read answer
Air-Gapped & Private LLM

Air-Gapped AI

Air-gapped AI is the practice of running AI models — including large language models — on infrastructure with no inbound or outbound internet connection. Data, model weights, and inference all stay inside a private network or isolated cloud environment, so sensitive information never crosses the organisation's security boundary.

Read answer
LLMOps & MLOps

Canary and Shadow Deployments

Canary and shadow deployments are two safe ways to roll out a new model or prompt. A canary sends a small slice of real traffic to the new version and grows it only if quality holds. A shadow deployment sends traffic to the new version in parallel without showing users its output, so you can compare before any risk.

Read answer
Agent Architecture

Checkpoint and Resume

Checkpoint and resume is a pattern that lets a long-running AI agent save its state at safe points and continue from there after an interruption — a crash, a timeout, or a pause for human input. Instead of restarting from scratch and repeating expensive work, the agent reloads its last checkpoint and proceeds.

Read answer
Agent Architecture

File System as Context

File system as context is a pattern where an AI agent uses files on disk — not the prompt — as its working memory. Instead of holding everything in a limited context window, the agent reads and writes files, then loads only what each step needs. This lets agents work over far more information than a context window can hold.

Read answer
ML & Fine-tuning

Fine-Tuning

Fine-tuning is the process of further training a pre-trained large language model on a smaller, task-specific dataset so it adapts to a particular style, domain, or behaviour. It adjusts the model's weights — unlike prompting or RAG — making the new behaviour intrinsic to the model rather than supplied at query time.

Read answer
LLMOps & MLOps

LLM Cost Attribution

Cost attribution for LLM applications is the practice of tracing token spend back to the thing that caused it — a feature, customer, request type, or agent step. Instead of one opaque monthly bill, you get a per-unit breakdown that shows where money goes, which is the prerequisite for controlling and optimising LLM cost.

Read answer
LLMOps & MLOps

LLM Evaluation

LLM evaluation is the practice of measuring the quality of a large language model's outputs against defined criteria — accuracy, faithfulness, tone, and safety — rather than assuming they are correct. It uses scoring rubrics, golden datasets, LLM-as-judge methods, and human review to make a non-deterministic system measurable and safe to ship.

Read answer
LLMOps & MLOps

LLM-as-Judge

LLM-as-judge is an evaluation method where a capable language model scores another model's outputs against a rubric, instead of relying on human review for every case. It lets teams evaluate thousands of responses for correctness, faithfulness, and tone at a scale humans can't match — and is validated against human judgments to confirm the judge is reliable.

Read answer
LLMOps & MLOps

LLMOps

LLMOps (Large Language Model Operations) is the set of practices, tools, and infrastructure for deploying, monitoring, evaluating, and continuously improving large language models in production. It extends MLOps with concerns specific to LLMs — prompt management, output evaluation, hallucination detection, token-cost control, and observability over non-deterministic responses.

Read answer
Agent Architecture

Mixture of Agents

Mixture of Agents (MoA) is a pattern where several agents independently produce candidate answers to the same task, and an aggregator agent synthesises them into a single, stronger response. By combining diverse attempts — often from different models or prompts — MoA improves quality and robustness over any one agent acting alone.

Read answer
ML & Fine-tuning

Model Distillation

Model distillation is a technique for transferring the knowledge of a large, capable 'teacher' model into a smaller, cheaper 'student' model. The student is trained to reproduce the teacher's outputs, so it can deliver comparable quality on a target task at a fraction of the inference cost and latency.

Read answer
Agent Architecture

Multi-Agent Deadlocks

A deadlock in a multi-agent system occurs when two or more agents are each waiting on the other to act, so none can proceed and the system stalls. It typically arises from circular dependencies, agents waiting on shared resources, or coordination loops where every agent expects another to move first.

Read answer
Agent Architecture

Orchestrator-Specialist Pattern

The orchestrator-specialist pattern is a multi-agent design where one orchestrator agent plans and delegates work to a set of narrow specialist agents, then assembles their results. The orchestrator owns control flow and state; each specialist does one job well. It keeps large agent systems debuggable by separating coordination from execution.

Read answer
Agent Architecture

Parallel Tool Calls and Partial Failures

Parallel tool calls are when an AI agent invokes several tools at once instead of one at a time, cutting latency when the calls are independent. A partial failure is when some of those parallel calls succeed and others fail — and handling it well means the agent reasons over what came back rather than crashing or hallucinating the missing results.

Read answer
LLMOps & MLOps

Prompt Caching

Prompt caching is a technique that stores the processed form of a repeated prompt prefix so the model doesn't reprocess it on every call. When many requests share a large, stable prefix — a system prompt, instructions, or retrieved context — caching it cuts both cost and latency, since the model only processes the new part of each request.

Read answer
RAG & LLM Engineering

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique that improves large language model responses by retrieving relevant information from an external knowledge source — documents, a database, or a vector store — and supplying it to the model at query time. This grounds answers in current, specific data the model was never trained on.

Read answer
Voice AI

Streaming and Partial Results

Streaming is when an LLM application sends its response token by token as it's generated, instead of waiting for the full answer. Those incremental tokens are partial results. Streaming cuts perceived latency dramatically — the user sees or hears output almost immediately — which is essential for chat and non-negotiable for real-time voice AI.

Read answer
Agent Architecture

Tool Schema Design

Tool schema design is the practice of defining the tools an AI agent can call — their names, parameters, types, and descriptions — so the model reliably picks the right tool and supplies valid arguments. A good schema is the interface between the model's reasoning and your code; its clarity largely determines whether tool use succeeds.

Read answer
Voice AI

Voice AI Agent

A voice AI agent is a software system that holds a real-time spoken conversation with a user. It chains speech-to-text, a large language model, and text-to-speech into a low-latency loop, so a caller can speak naturally and hear a generated response — handling tasks like support, sales, or scheduling entirely by voice.

Read answer

Stay ahead in AI engineering.

Get the latest insights on building production AI systems, be the first to explore approaches that actually work beyond the demo.

Start a Project →