Building HIPAA-Compliant AI: A Practical Engineering Checklist

Introduction

Most HIPAA compliance content is written by lawyers for buyers. This one is written by engineers for the people who actually have to implement the controls.

This is the practical checklist we use for HIPAA-aligned AI projects across healthcare clients. It covers the engineering decisions that determine whether the system passes a security review or not. We'll cover BAA-eligible model providers, PHI handling patterns, audit logging that satisfies reviewers, eval suites as compliance artifacts, and the deployment architecture that keeps PHI inside your trust boundary.

If you're building AI for a healthcare workload and need to ship past compliance review, this is what to set up — not what to retrofit after the first audit.

1. BAA-eligible model providers

The first architecture decision: is your AI model provider willing to sign a Business Associate Agreement (BAA)? Without it, you cannot legally send PHI through the API.

Currently BAA-eligible

OpenAI: Available on the Enterprise tier with explicit BAA. Not on standard API endpoints.
Anthropic: Available on the Enterprise tier with explicit BAA.
AWS Bedrock: Standard AWS BAA covers Bedrock invocations of compliant models.
Azure OpenAI: Covered under standard Microsoft BAA for Azure customers.
GCP Vertex AI: Covered under Google Cloud BAA for healthcare customers.

Critical implementation detail

You cannot use the default consumer API tier and call it HIPAA-compliant. You need the enterprise or healthcare-specific endpoints with BAA in place. Confirm with the provider's sales/legal team before architecting the system. Many teams discover this after building on consumer endpoints.

What the BAA covers (and doesn't)

The BAA covers the model provider's handling of PHI sent to them. It doesn't cover: your application's handling of PHI before sending it, your infrastructure storing model outputs, or your logging of model interactions. You're responsible for those parts.

2. PHI handling and minimization

Design for data minimization. The agent should see only the PHI it needs — not the entire patient record. This is both a HIPAA requirement (minimum necessary standard) and a security best practice.

Tokenization

Replace PHI fields with tokens before the LLM sees them. De-tokenize in your application layer after the LLM returns.

pseudocode
// Before sending to LLM:
patientRecord = {
  name: "Jane Smith",
  dob: "1985-03-15",
  medications: ["Lisinopril 10mg", "Metformin 500mg"]
}

tokenized = tokenize(patientRecord)
// {
//   name: "PATIENT_TOKEN_A8X",
//   dob: "DATE_TOKEN_B3Y",
//   medications: ["Lisinopril 10mg", "Metformin 500mg"]
// }

llmResponse = await llm.complete({
  prompt: "Summarize this patient's medications: " + tokenized
});

// After LLM returns:
final = detokenize(llmResponse)

The LLM sees enough to do its job (medication names) but never sees patient identifiers. The application layer handles the mapping.

Field-level access control

Different agents see different fields based on their role and the action they're taking. A scheduling agent doesn't need to see medications; a clinical decision support agent doesn't need to see billing data.

Trust-boundary segmentation

Keep PHI inside your own cloud VPC. The model provider sees only the minimum context required. Your VPC is where audit logging happens, where access controls are enforced, where backup and retention policies apply.

3. Audit logs that satisfy reviewers

Every PHI access needs to be logged in a way that's immutable and exportable. The fields auditors expect to see:

Who triggered the access (user identity or system process).
What PHI was accessed (which patient, which fields).
When (timestamp with timezone).
Why (business purpose, linked workflow ID, or care plan reference).
What the AI did with it (output, downstream actions, tool calls).
Decision outcome (was a recommendation acted on? Was it overridden?).

Storage requirements

Append-only storage is non-negotiable. Logs that can be edited by application code don't satisfy audit requirements. Options:

S3 with object lock (write-once, read-many enabled at bucket level).
Dedicated audit service (AWS CloudTrail, Datadog audit logs, Splunk).
Postgres with restricted permissions — application can insert, only auditors can read or modify.

Retention

HIPAA requires 6 years minimum. State requirements may extend this. Plan for storage cost over the retention period.

4. Evals as compliance artifacts

Eval suites do double duty as regression tests and compliance evidence. Demonstrating to auditors that the model behaves consistently across versions, that you catch quality regressions before they ship, and that you can replay any production decision against historical model versions is increasingly what security reviewers ask about.

What to include in evals

Representative scenarios covering common clinical workflows.
Edge cases including adversarial inputs.
Bias and fairness checks across demographic subgroups.
Safety scenarios where the model should refuse or escalate.
Privacy scenarios where the model should not leak PHI.

Eval versioning

Eval results are tied to model version, prompt version, and code version. When something changes, run evals and compare to baseline. Document any regressions and why they're acceptable.

AI projects with proper engineering hygiene have a real compliance advantage over hand-rolled systems. Auditors are increasingly looking for evidence of disciplined ML practices, and eval-driven development is becoming a recognized practice.

5. Deployment architecture

The architecture we default to for HIPAA AI projects:

architecture
Your VPC (AWS / GCP / Azure)
│
├── Application services
│   └── PHI lives here, in encrypted database
│
├── Tokenization service
│   └── Maps PHI ↔ tokens
│
├── LLM gateway
│   ├── Receives requests with tokenized data
│   ├── Calls BAA-covered model provider (OpenAI/Anthropic/Bedrock)
│   └── Returns response to application
│
├── Vector DB (pgvector on RDS)
│   └── Embeddings of PHI for retrieval, encrypted at rest
│
├── Audit log service
│   └── Append-only, exportable
│
└── Tool services
    └── All run within VPC, no PHI leaves

PHI never leaves the VPC except for BAA-covered API calls to model providers. Tokenization minimizes what model providers see. Vector embeddings of PHI live in your VPC. Audit logs go to an append-only store. Tool calls that touch PHI run within the VPC.

6. Access controls and authentication

Specific requirements:

Role-based access control (RBAC). Users see only the PHI their role authorizes.
Multi-factor authentication. Required for all user accounts.
Service-to-service authentication. Workload identity (IAM roles, service accounts) rather than long-lived API keys.
Audit log access controls. Audit logs themselves require access controls.
Session timeouts. Idle sessions auto-logout per HIPAA security standard.
Account deactivation procedures. When users leave, access is revoked promptly.

7. Encryption requirements

HIPAA requires "addressable" encryption — meaning you have to encrypt or document why you didn't. In practice, encrypt everything:

At rest: All databases, all storage (S3, EBS), all backups. AES-256 standard.
In transit: TLS 1.2+ for all network traffic. No HTTP, ever.
Key management: AWS KMS, Google Cloud KMS, or Azure Key Vault. Don't roll your own.
Application-level encryption for particularly sensitive fields if your team can manage the key rotation.

8. Incident response and breach notification

You need a documented incident response process before you have an incident:

Detection mechanisms: Monitoring, alerting, anomaly detection.
Response procedures: Documented runbooks for common scenarios.
Breach assessment: Process to determine if an incident is a reportable breach.
Notification timeline: HIPAA requires notification within 60 days; many states are stricter.
Tabletop exercises: Practice the response before you need it.

9. Managing business associates

Every vendor that handles PHI on your behalf is a business associate. They need BAAs. Track them:

Model providers (OpenAI, Anthropic, etc.).
Cloud providers (AWS, GCP, Azure).
Observability tools (if they see PHI in logs).
Database providers (if managed databases store PHI).
Backup providers.
Any vendor whose service touches PHI.

Maintain an inventory. Renew BAAs annually. Audit vendors' compliance posture (they should have SOC 2 reports you can review).

Common mistakes

Mistakes we see in HIPAA AI projects:

Using consumer API tiers. Default OpenAI or Anthropic APIs aren't BAA-covered.
Logging PHI to standard logs. Application logs, observability logs, model provider logs — all need PHI handling.
Storing model outputs without considering PHI. Outputs can contain PHI; they need the same controls as inputs.
Treating tokenization as optional. It's a key control for minimizing what model providers see.
No incident response plan. Discovered at the worst possible time — during an actual incident.
Skipping evals. Compliance review increasingly asks about model behavior testing; no answer is the wrong answer.

Conclusion

Compliance done well is engineering done well. The teams that struggle with HIPAA AI projects are the ones trying to retrofit controls after the system is built. Design for it from day one and most of the friction disappears.

The checklist above is the minimum we install for every healthcare client. Specific projects have specific requirements (FDA software-as-a-medical-device considerations, state-specific rules, payer requirements), but the foundation is consistent.

If you're building AI for a healthcare workload, we help clients architect HIPAA-compliant AI systems from day one. The setup takes 4-8 weeks but pays for itself the first time a compliance review goes smoothly instead of becoming a six-month blocker.