Enterprise AI vendor evaluation suffers from a consistent failure mode: subjective assessment that produces decisions which are hard to defend, difficult to compare across procurements, and frequently inconsistent with the buyer's actual operational and regulatory needs. A quantified scorecard framework addresses all three problems. This guide provides a complete scorecard structure designed for procurement teams running parallel vendor evaluations, enterprise architects standardising vendor selection, and senior leaders signing off on material AI commitments. The full scorecard contains 40+ criteria across six categories — the categories and key criteria are reproduced below.
Category 1: Technical capability (default weight 25-35%)
Core criteria: benchmark performance on relevant, independently-administered benchmarks (not vendor-selected); capability demonstration on buyer use cases with buyer data (in pilot or proof-of-concept); integration architecture compatibility with buyer's existing stack; scalability at projected production volumes; latency and throughput at production-equivalent load; customisation and fine-tuning capabilities; model update cadence and the vendor's capability roadmap; multilingual and accessibility support if relevant. Scoring should be evidence-based — vendor claims plus reference customer confirmation plus pilot performance, not just sales material.
Category 2: Governance maturity (default weight 15-25%, higher for regulated industries)
Core criteria: ISO/IEC 42001 certification status (certified, in process, not yet); NIST AI RMF implementation evidence; AI policy documented and accessible; AI use case inventory maintained; bias testing methodology and evidence with demographic coverage; incident response procedures documented and tested; regulatory mapping jurisdiction by jurisdiction; internal accountability structure (named AI governance lead, audit and assurance arrangements); customer support for buyer governance (will the vendor provide documentation, evidence, and responses to buyer governance requests?); publication and transparency (does the vendor publish model cards, evaluation reports, incident summaries?).
Category 3: Security posture (default weight 15-20%)
Core criteria: SOC 2 Type II attestation; ISO 27001 certification; penetration testing evidence and recency; vulnerability disclosure and remediation history; AI-specific security (prompt injection protections, data exfiltration controls, attack surface assessment); data encryption at rest and in transit; identity and access management integration; privacy controls (DPIA support, data minimisation, retention controls); incident history (publicly known incidents and vendor response). For ISACA Advanced in AI Risk methodology, see ISACA AAIR.
Category 4: Regulatory compliance (default weight 10-20%)
Core criteria: EU AI Act compliance posture (GPAI obligations, high-risk classification if applicable, transparency obligations); Privacy Act / GDPR / equivalent obligations including ADM transparency; sector-specific obligations in buyer's industry; geographic coverage (regulatory readiness in each jurisdiction the buyer operates in); regulatory engagement (vendor's posture toward regulator inquiries, participation in voluntary frameworks). The Five Eyes Agentic AI Guidance (1 May 2026) provides useful evaluation criteria for autonomous AI capabilities.
Category 5: Commercial terms (default weight 10-15%)
Core criteria: price competitiveness vs market benchmarks; pricing model fit (per-seat, per-call, per-token, capacity-based) for buyer's usage profile; contract flexibility (term length, ramp provisions, volume commitments); training data exclusion as explicit contract term; liability allocation appropriate to AI risk profile; indemnities particularly for IP and regulatory; audit rights contractually established; service level commitments with meaningful credits.
Category 6: Operational continuity (default weight 5-15%)
Core criteria: uptime track record (not just commitment); incident response historical performance; change management (how the vendor handles model updates and capability changes); data portability and exit support; business continuity arrangements; vendor financial stability (particularly for AI startups — runway, growth trajectory, investor quality); concentration risk (vendor's reliance on upstream providers and the buyer's reliance on this vendor relative to alternatives).
Using the scorecard
Three uses produce the most value. First, parallel evaluation: same scorecard, same evaluators, same period — produces directly comparable scores across vendors. Second, threshold setting: minimum acceptable scores per category (or specific criteria) become procurement requirements rather than nice-to-haves. Third, benchmark accumulation: as procurement teams use the scorecard across multiple vendor evaluations, they accumulate a market view of vendor quality that informs future decisions. The scorecard is input to procurement decisions, not the sole determinant — context, fit, and relationship factors remain important.
Useful third-party resources
- NIST AI Risk Management Framework — Foundational US framework
- ISO/IEC 42001 — Management system standard for AI governance
- Five Eyes Agentic AI Guidance (May 2026)
- ISACA Advanced in AI Risk
- OWASP Top 10 for LLM Applications