Key Takeaways
- The AI consulting market is full of firms that sell well but deliver poorly — distinguishing them before you sign requires asking the right questions, not reading case study decks
- Eight questions cover the full risk surface: production track record, who actually builds, IP ownership, data security, delivery model, eval and monitoring, exit strategy, and references from technical buyers
- Red flags are usually not lies — they're omissions, deflections, and vague reassurances. Good firms answer these questions directly
- A boutique AI engineering firm with relevant production experience will give you sharper, more specific answers than a large generalist consultancy; specificity is the signal
The questions to ask an AI consulting firm before you sign cover eight risk areas: production track record, who actually builds the work, IP ownership, data security, delivery model, evaluation and monitoring, exit strategy, and references from technical buyers. Credible firms answer each directly and specifically; weak ones deflect, generalise, or reassure. Specificity is the signal.
Why CTOs Get Burned by AI Vendors
Gartner estimates that 30% of generative AI projects will be abandoned after the proof-of-concept stage — not because the technology failed, but because the vendor relationship failed. Overpromised timelines, outsourced delivery, no production track record, and opaque contracts are the four most common causes of AI consulting engagements that cost more and deliver less than scoped.
The problem is that most AI consulting firms look the same from the outside. A polished website, a credible-sounding list of services, and a deck full of AI buzzwords are not signal. By the time you discover the delivery model doesn't match the pitch, you've signed a contract, paid a deposit, and handed over sensitive data.
This checklist gives you eight specific questions to ask before you sign. For each question, we describe why it matters, what a red-flag answer sounds like, and what a credible answer looks like. These are the questions we'd want a prospective client to ask us — and the ones we ask ourselves when evaluating vendors for partner work.
1. What does your production AI track record look like?
Why it matters. There is a significant difference between a firm that has shipped AI systems to production and one that has built prototypes, developed MVPs, or delivered proof-of-concepts. Production systems handle real users, real data, and real failure modes: cold starts, latency spikes, model drift, retrieval failures, hallucinations at the edge. A firm without production experience will not anticipate these problems until they hit you.
Red-flag answer. "We've done extensive work in the AI space" with no specific system named. Case studies that end at MVP or pilot. References to research projects, academic partnerships, or internal tools. Describing the work in terms of technologies used rather than outcomes delivered.
Green-flag answer. Named production systems with quantified outcomes: latency, accuracy, cost-per-inference, uptime, user volume. Specific engineering decisions made in production — not just architecture diagrams. Case studies that describe what broke and how it was fixed, not just what shipped. At Prodinit, every case study we publish includes the production constraints, the evaluation methodology, and the operational outcome — because that's what a technical buyer needs to evaluate fit.
2. Who actually builds the work — and will they be on my engagement?
Why it matters. Many consulting firms win work through senior principals and deliver through junior contractors. The person who sold you the engagement understands your problem; the people building it may not. In AI engineering, where judgment calls about model selection, prompt design, and retrieval architecture can make or break a system, delivery quality is directly tied to who is actually in the code.
Red-flag answer. "Our team" without names or roles. Vague references to a delivery bench or resource pool. Reluctance to name the engineers who will work on your account. Using the phrase "we staff projects based on availability" without elaborating.
Green-flag answer. Named engineers with specific AI engineering backgrounds. A clear description of who does what: which engineer owns model development, who owns infrastructure, who handles evaluation. Commitment to continuity — the same team for the duration of the engagement, not a rotating bench. At Prodinit, every engagement names the engineers at proposal stage; we don't staff projects after signing.
3. Who owns the IP — and what happens to the models you train on my data?
Why it matters. AI engagements create novel intellectual property: fine-tuned models, prompt templates, evaluation datasets, embedding pipelines. Default contract language often assigns this IP to the consulting firm or licenses it back to you under terms that allow reuse. If you're training models on proprietary data, those models may contain embedded representations of your trade secrets — and if the contract is ambiguous, you may not own them.
Red-flag answer. "Standard terms assign IP at handoff" without specifying what transfers. References to a "license" to use the deliverables (rather than outright assignment). No mention of trained model weights, embeddings, or fine-tuning artifacts. Deflection to "talk to our legal team."
Green-flag answer. Explicit assignment of all deliverables — code, models, weights, datasets, prompts — to you upon payment. Clear language that the firm retains no rights to models trained on your data. A separate clause for any open-source components (which cannot be re-assigned but should be listed). These are not unusual asks; a firm that can't answer them directly has not thought through the IP implications of their delivery.
4. How do you handle our data, and what's your security posture?
Why it matters. AI engagements require data access that standard software projects don't: training data, production logs, user inputs, sometimes PII or PHI. A firm without a clear data security posture will store your data in ways that create compliance exposure — cloud buckets with default permissions, shared development environments, no data deletion policy at engagement end.
Red-flag answer. "We take data security seriously" without specifics. No mention of how data is stored, who has access, whether it's used to train the firm's own models, and when it's deleted. No named compliance frameworks (SOC 2, GDPR, HIPAA) when those are relevant to your industry.
Green-flag answer. A written data handling policy with: storage location and access controls, no-training-on-client-data commitment, data deletion timeline at engagement close, and relevant compliance certifications or attestations for your industry. For regulated industries (healthcare, fintech), ask whether the firm has signed BAAs or DPAs and whether their infrastructure can support your compliance requirements. We address data security requirements explicitly in every scoping document — not because clients always ask, but because they should.
5. What's your delivery model — fixed scope, time-and-materials, or something else?
Why it matters. Delivery model determines risk allocation. Fixed-scope engagements put delivery risk on the firm; open-ended T&M puts it on you. Neither is always right — but a firm that can only do one, or that won't explain how risk is shared, is not thinking clearly about your interests. AI projects carry real scope uncertainty; how a firm handles that uncertainty signals whether they'll be a partner or a vendor.
Red-flag answer. Pure open-ended T&M with no milestones, no cap, and no defined exit criteria. Fixed-scope proposals that don't include a discovery phase — scope on day one is almost always wrong for AI work. Resistance to milestone-based billing or refusal to put success criteria in writing.
Green-flag answer. A phased structure: a fixed-price discovery phase (1–2 weeks) with a defined output, followed by milestone-based delivery sprints with written acceptance criteria. T&M with a weekly cap and sprint reviews is also defensible when discovery is complete. The key signal is whether the firm is willing to put success metrics in writing before work starts. Our approach to AI project structure describes the four-phase model we use on all engagements.
6. How do you evaluate and monitor AI systems after they ship?
Why it matters. AI systems degrade silently. A model that performs well on your evaluation set in week three may quietly regress in week twelve as data distribution shifts, prompts change, or retrieval quality drops. A firm without an evaluation and monitoring strategy is handing you a ticking clock — the regression question is not if but when.
Red-flag answer. "We'll set up some logging" without a defined eval methodology. No mention of golden datasets, regression thresholds, or CI-integrated evaluation. Treating monitoring as an infrastructure problem (CPU, memory, latency) without addressing model quality. Offering a standard observability stack with no LLM-specific metrics.
Green-flag answer. A defined four-layer eval stack: unit evals (deterministic capability checks), reference evals (output accuracy), rubric evals (LLM-as-judge with documented bias calibration), and behavioral evals (end-to-end system properties). CI-integrated eval pipelines that block deploys on quality regressions. A golden dataset strategy that includes scheduled refresh and failure-driven updates. Our LLM evaluation post describes the exact methodology we wire into every production AI engagement.
7. What does the end of the engagement look like — and are we left holding the bag?
Why it matters. Many AI engagements end with a handoff that leaves the client's team unable to operate, debug, or extend the system. If the consulting firm is the only entity that understands the architecture, prompt logic, or evaluation setup, you've created a dependency that persists long after the contract ends. A good engagement ends with your team fully capable of running and evolving the system.
Red-flag answer. Handoff described as "documentation" without a defined scope for what's documented. No mention of knowledge transfer sessions with your engineering team. Systems built with proprietary tooling or infrastructure that only the vendor can access. No runbook for operating the system in production.
Green-flag answer. A formal handoff week built into every engagement — non-negotiable and included in the proposal price. Deliverables: full technical documentation, a walkthrough session with your engineering team, a runbook for operating and debugging the system, and clearly documented dependencies. The test: after handoff, can your team extend the system without calling us? If the answer is no, the handoff wasn't complete. We price handoff week into every engagement from day one because a system your team can't operate is not a delivered system.
8. Can we speak to technical buyers from your previous engagements?
Why it matters. Case studies are written by marketing teams. References are given by the people who liked the engagement. What you need is the engineering leader who worked with the firm day-to-day, signed off on the code, and operated the system after handoff. That person will tell you things no case study will: how scope conversations went, whether estimates were accurate, whether the team communicated proactively about blockers.
Red-flag answer. References offered only from business stakeholders, not engineering leaders. "Our clients prefer to stay confidential" across the board. References who can only speak to the final outcome, not the engagement process. Reluctance to provide any reference before contract signing.
Green-flag answer. At least one engineering leader or CTO from a completed engagement who can speak to the technical delivery, not just the business outcome. Willingness to do a reference call before you sign. Bonus: an unprompted offer to connect you with a reference from an engagement that had challenges — firms that only offer happy-path references are curating, not being transparent.
Summary: Red Flag vs Green Flag
| Question | Red Flag | Green Flag |
|---|---|---|
| Production track record | Prototypes, pilots, internal tools | Named systems, quantified production outcomes |
| Who builds | "Our team," vague resourcing | Named engineers committed at proposal stage |
| IP ownership | "Standard terms," license language | Full assignment of all deliverables on payment |
| Data security | "We take it seriously," no specifics | Written data handling policy, compliance certs |
| Delivery model | Open T&M with no milestones | Fixed discovery + milestone-based sprints |
| Eval and monitoring | Logging and infra dashboards only | 4-layer eval stack, CI-integrated, golden datasets |
| Exit strategy | Documentation as afterthought | Handoff week in scope, runbook, team walkthrough |
| References | Business stakeholders only | Engineering leader, including a challenging engagement |
Frequently Asked Questions
Boutique AI engineering firms typically have smaller, more senior teams where the engineers who win the work are the engineers who do the work. Large consultancies have deeper bench capacity but often staff production engagements with junior resources after a senior team wins the deal. For AI projects where judgment calls in model selection, evaluation design, and architecture matter significantly, boutique firms with deep AI engineering specialisation often deliver better technical outcomes. The tradeoff is capacity: large consultancies can staff more people faster for very large programmes.
Ask them to explain a technical decision they made on a past engagement — specifically, a decision where they chose not to use the most obvious approach and why. A credible firm will describe a real constraint they encountered (data quality, latency budget, evaluation difficulty) and explain the specific trade-off they navigated. Firms without genuine production experience will give you a generic explanation of why their approach is generally superior, not a specific decision with a specific reason.
For engagements over $50K, yes — a bounded paid discovery or proof-of-concept is a reasonable due diligence step. It reveals how the firm communicates, whether their scoping methodology is rigorous, and whether their engineers' quality matches the sales pitch. A firm that refuses a bounded paid POC before a large engagement commitment is either oversubscribed or risk-averse about demonstrating capability. Either way, it's signal.
Key AI-specific clauses: explicit IP assignment of trained model weights and embeddings (not just code), a no-training-on-client-data clause, data deletion timeline at engagement close, defined success metrics with agreed evaluation methodology, and a handoff deliverables list. The delivery model clause should also specify what triggers a scope change order — ambiguity here is how AI engagements generate invoice disputes six weeks in.