Chapter 28 — ☁️ Cloud AI Platforms — deploying foundation models on the hyperscalers
📖 All chapters | ← 27 · ⚙️ Practical Toolkit III
📚 Jump to any chapter
🧮 Mathematical Foundations
- 01 · 🧮 Linear Algebra — the language of data
- 02 · 📉 Calculus & Optimization — how models learn
- 03 · 🎲 Probability & Statistics — reasoning under uncertainty
- 04 · 🔥 Information Theory & Loss Functions — measuring surprise and error
🧩 Classical Machine Learning
- 05 · 🧩 Core ML Concepts — the ground rules
- 06 · 📐 Classical Supervised Algorithms — the workhorses
- 07 · 🌲 Ensembles & Boosting — how to win on tabular data
- 08 · 🗺️ Unsupervised Learning & Dimensionality Reduction — structure without labels
- 09 · 🎯 Model Evaluation & Validation — knowing if it actually works
🧠 Deep Learning
- 10 · 🧠 Neural Network Fundamentals — the building block
- 11 · ⚙️ Training Deep Networks — making deep nets actually train
- 12 · 🖼️ Convolutional Neural Networks — the vision branch
- 13 · 🔁 Sequence Models — RNNs, LSTMs and the bottleneck
⚡ The Transformer Era
- 14 · 🔤 Word Embeddings — giving words meaning as vectors
- 15 · ⚡ Attention & the Transformer — the architecture that changed everything
- 16 · 🧱 Tokenization, Pretraining & Model Families
- 17 · 📈 Modern LLMs & Scaling — bigger, and suddenly capable
💬 Using & Adapting LLMs
- 18 · 💬 Prompting & In-Context Learning — programming models with words
- 19 · 🎚️ Fine-Tuning & Alignment — specializing and aligning models
- 20 · 📚 Retrieval-Augmented Generation (RAG) — giving the model an open book
- 21 · 🚀 Inference, Decoding & Serving — running LLMs efficiently
🤖 The Agentic Frontier
- 22 · 🤖 Agents, Tools & Loops — the latest frontier
- 23 · 🛡️ Evaluation, Safety & Guardrails — making LLM systems trustworthy
- 24 · 🔧 MLOps & LLMOps — shipping and operating models in production
🛠️ The Practical Toolkit
- 25 · 🛠️ Practical Toolkit I — Modeling & Vision Libraries
- 26 · 🧰 Practical Toolkit II — LLM Frameworks, Orchestration & Vector Stores
- 27 · ⚙️ Practical Toolkit III — Serving, Apps & MLOps Tooling
☁️ Cloud AI Platforms
This chapter covers the managed cloud AI platforms — AWS Bedrock, Google Vertex AI, Google AI Studio, and Azure OpenAI Service — that let you call top foundation models through an API instead of running GPUs yourself. These platforms sit on the “buy” side of the build-vs-buy decision: you give up some control and pay per token (or rent dedicated capacity) in exchange for skipping all GPU operations and getting instant, governed access to the best models. Everything here builds on the concept chapters — RAG, inference and serving, agents, fine-tuning, MLOps — so we reference those rather than re-explaining them.
🧰 Where in the stack: the hosted serving layer — your app calls a managed API, the provider owns the GPUs, scaling, and model weights.
28.1 — ☁️ The managed-platform landscape (overview)
A managed cloud AI platform hosts the model for you: no GPU provisioning, no inference server to tune (contrast the Inference & Serving chapter), and instant access to frontier models behind one API. You trade that convenience for vendor lock-in, less control over the runtime, higher per-token cost at very large scale, and the need to reason carefully about where your data goes. The core question is build-vs-buy: self-host when you need full control or have steady high volume; buy a managed platform when you want speed-to-market and elastic demand.
| Dimension | Buy (managed API) | Build (self-host, see Inference chapter) |
|---|---|---|
| GPU ops | None — provider owns it | You provision, patch, scale |
| Model access | Instant, frontier models | Open-weights you can run |
| Cost shape | Pay-per-token or provisioned | Fixed GPU rental + ops staff |
| Control | Limited | Full (quantization, batching) |
| Best when | Fast launch, spiky load | High steady volume, data must stay in-house |
Q: Why use a managed platform instead of self-hosting a model? You skip all GPU operations — no provisioning, scaling, patching, or inference tuning — and get instant access to frontier models behind one autoscaling, compliant API. The trade is less control and higher unit cost at scale. It is the fastest path from idea to production.
Q: What is the build-vs-buy decision here? Buy a managed API when you value speed-to-market, elastic/spiky demand, and want the best closed models without ops overhead. Build (self-host) when you have steady high volume where fixed GPU cost wins, need deep runtime control, or have data that legally cannot leave your environment.
Q: What is the difference between pay-per-token and provisioned throughput? Pay-per-token (serverless) bills per input/output token with no commitment — ideal for variable or low traffic. Provisioned throughput reserves dedicated capacity for a fixed hourly/monthly fee, giving guaranteed latency and rate limits — cheaper and more predictable once volume is high and steady.
Q: Why do region and data residency matter? Your prompts and outputs are sent to the provider’s servers, so the region you pick determines which country’s laws and privacy rules apply and where data is processed. Regulated workloads (healthcare, finance, government) often require data to stay in a specific geography, so you choose the matching region and confirm the provider does not train its base models on your data.
“Managed” does not mean “free of governance.” You still own data classification, access control, and audit. Always confirm in writing that your inputs are not used to train the vendor’s base models.
28.2 — 🟧 AWS Bedrock
AWS Bedrock is a serverless, single-API gateway to foundation models from many providers — Anthropic Claude, Meta Llama, Mistral, Amazon Nova/Titan, Cohere, and more — with no infrastructure to manage. You call models two ways via boto3: the older provider-specific InvokeModel and the newer Converse API, which gives one unified message format across every model. On top of the raw models, Bedrock adds managed building blocks so you do not have to wire them yourself.
import boto3
client = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = client.converse(
modelId="anthropic.claude-sonnet-4-20250514-v1:0",
messages=[{"role": "user", "content": [{"text": "Explain RAG in one sentence."}]}],
inferenceConfig={"maxTokens": 200, "temperature": 0.2},
)
print(resp["output"]["message"]["content"][0]["text"])Q: What is the Converse API and why use it over InvokeModel? The Converse API is a single, unified chat interface that works the same across every Bedrock model, so switching from Claude to Llama is a one-line modelId change. InvokeModel is the older call where the request/response JSON is specific to each provider, so it requires per-model formatting. Reach for Converse for new multi-turn or tool-use work.
Q: What are Bedrock Knowledge Bases? Knowledge Bases are Bedrock’s managed RAG: you point it at documents in S3, it handles chunking, embeddings, and a vector store, then answers queries with citations. It saves you from building the retrieval pipeline by hand — see the RAG chapter for what is happening underneath.
Q: What do Agents for Bedrock and Guardrails add? Agents for Bedrock orchestrate multi-step tasks — calling APIs (action groups) and Knowledge Bases to fulfil a request (see the Agents chapter). Guardrails apply configurable safety and topic/PII filters across any model, so you enforce one policy regardless of which model serves the request.
Q: What is Provisioned Throughput in Bedrock? Provisioned Throughput reserves dedicated model capacity for a committed term, giving guaranteed token rates and stable latency instead of shared on-demand limits. You use it for high, steady production traffic, and it is typically required to serve models you fine-tune in Bedrock (on-demand inference covers only a subset of models).
Q: How does Bedrock keep my data secure? Access is controlled through IAM, and you can reach Bedrock privately over VPC / PrivateLink so traffic never crosses the public internet. AWS states your prompts and completions are not used to train the base foundation models and stay within your selected region.
28.3 — 🔵 Google Vertex AI
Vertex AI is Google Cloud’s end-to-end ML platform, not just a model API. It spans Model Garden (Gemini plus open and third-party models, including Anthropic Claude), managed training, Pipelines, online and batch Endpoints, a Model Registry, Feature Store, and the Vertex AI RAG Engine for grounding — all under Google Cloud IAM and networking. It is the enterprise, governed way to run Gemini in production and to deploy your own models with full MLOps.
You use the same unified google-genai SDK as AI Studio, just pointed at your GCP project — that is what makes the prototype-to-production migration a near drop-in change.
from google import genai
client = genai.Client(vertexai=True, project="my-gcp-project", location="us-central1")
resp = client.models.generate_content(
model="gemini-2.5-pro",
contents="Explain vector embeddings in one sentence.",
)
print(resp.text)Q: How is Vertex AI different from just an LLM API? Vertex is a full ML platform: beyond calling Gemini it covers training, pipelines, a model registry, feature store, and managed endpoints for your own models. The model API is one feature; the rest supports the entire lifecycle described in the MLOps & LLMOps chapter.
Q: When would you choose Vertex over Bedrock? Choose Vertex when you are already on Google Cloud, want Gemini as a first-class model, or need an integrated platform for training and deploying your own models alongside hosted ones. Choose Bedrock when you are on AWS and want the widest menu of third-party models behind one serverless API. Both offer strong governance; the deciding factor is usually your existing cloud and model preference.
Q: What is Model Garden? Model Garden is Vertex’s catalog of usable models — Google’s Gemini and embeddings, open models like Llama and Gemma, and third-party models including Claude — that you can test and deploy from one place. It is how Vertex offers multi-provider choice similar to Bedrock.
Q: How does Vertex handle enterprise governance? Through Google Cloud IAM for fine-grained access, VPC Service Controls (VPC-SC) to build a data-exfiltration perimeter, regional data residency, audit logging, and customer-managed encryption keys. This is why regulated enterprises pick Vertex over the lighter AI Studio.
28.4 — 🟢 Google AI Studio
Google AI Studio is the fast, free front door to Gemini: a browser playground where you test prompts, tune temperature and safety settings, and grab an API key in minutes. The same key drives the lightweight Gemini API through the google-genai SDK — ideal for prototyping and small apps. It is the consumer-facing path, deliberately distinct from production Vertex AI.
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain prompt engineering in one sentence.",
)
print(resp.text)Q: How is AI Studio different from Vertex AI? AI Studio is for individuals and prototyping — a quick API key, a playground, and the consumer Gemini API, with no cloud project setup. Vertex AI is the production/enterprise path with IAM, VPC controls, SLAs, data residency, and full MLOps. Same Gemini models underneath, very different governance and scale.
Q: Is AI Studio free, and what is the catch? AI Studio has a generous free tier for experimentation, which makes it great for learning and demos. The catch is that the free tier may use your data to improve products and lacks enterprise guarantees — so do not put regulated or production traffic through it; move to Vertex (or the paid Gemini API) for that.
Q: What is the migration path from AI Studio to Vertex? You prototype with the google-genai SDK and an AI Studio key, then switch to Vertex by re-pointing the same client — genai.Client(vertexai=True, project=..., location=...) instead of api_key=... — so you inherit IAM, governance, and SLAs with minimal code change. The prompts and model choices carry over; the surrounding platform hardens.
Q: When should I reach for AI Studio? Use it for prototyping, hackathons, learning, and quick personal tools where speed matters and the data is non-sensitive. The moment you need governance, compliance, or production scale, graduate to Vertex AI.
28.5 — 🟦 Azure OpenAI Service
Azure OpenAI Service is Microsoft’s governed wrapper around OpenAI’s models (GPT-4o, the o-series reasoning models, embeddings, and more), run inside your Azure tenant with enterprise auth, networking, and compliance. You deploy a model under a deployment name and call it through the OpenAI-compatible SDK, pointed at your Azure endpoint. It is the natural choice for organizations standardized on Azure and Microsoft Entra ID.
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://my-resource.openai.azure.com",
api_version="2024-10-21",
api_key="YOUR_API_KEY", # or use Entra ID token auth in production
)
resp = client.chat.completions.create(
model="my-gpt4o-deployment", # your deployment name, not the raw model id
messages=[{"role": "user", "content": "Explain fine-tuning in one sentence."}],
)
print(resp.choices[0].message.content)Q: How is Azure OpenAI different from calling OpenAI directly? It serves the same OpenAI models but inside your Azure tenant, with Entra ID auth, VNet/private endpoints, regional data residency, and Azure’s compliance certifications. The trade-off is that new models and features often land on OpenAI’s own API first, so Azure can lag by weeks.
Q: What is the “deployment name” gotcha? On Azure you do not call a raw model id; you first create a deployment of a model and then pass that deployment name as the model parameter. This trips people up when porting OpenAI code — the SDK call looks identical but the model value is your deployment, not gpt-4o.
Q: What are PTUs? Provisioned Throughput Units (PTUs) are Azure’s reserved-capacity model: you buy dedicated throughput for predictable latency and rate limits, the Azure equivalent of Bedrock Provisioned Throughput. Use pay-as-you-go tokens for spiky or early traffic and PTUs once volume is high and steady.
Q: What is “On Your Data”? Azure OpenAI On Your Data is the managed-RAG feature: connect a data source (such as Azure AI Search) and the service grounds answers in your documents — Azure’s analog to Bedrock Knowledge Bases. For multi-step agents, Microsoft now points to the newer Azure AI Foundry Agent Service rather than the older Assistants API.
The most common Entra ID gotcha: in production you authenticate with a token credential (managed identity / DefaultAzureCredential), not the static API key — and the identity needs the Cognitive Services OpenAI User role, or calls fail with 401/403 even though the deployment exists.
28.6 — ⚖️ Choosing & deploying
Once you know the platforms, the choice comes down to which cloud you are on, which models you need, and your governance requirements. The table below compares the four options; after it we cover the deployment patterns that apply across all of them.
| Feature | AWS Bedrock | Google Vertex AI | Azure OpenAI Service | Google AI Studio |
|---|---|---|---|---|
| Model choice | Many providers (Claude, Llama, Mistral, Nova, Cohere) | Gemini + Model Garden (Llama, Gemma, Claude) | OpenAI models (GPT-4o, o-series) | Gemini only |
| Governance | IAM, VPC, PrivateLink | IAM, VPC-SC, CMEK | Entra ID, VNet, Azure policy | API key, light |
| Pricing model | Pay-per-token + Provisioned Throughput | Pay-per-token + provisioned | Pay-per-token + PTU | Free tier + pay-per-token |
| Managed RAG | Knowledge Bases | Vertex AI RAG Engine | On Your Data | Minimal |
| Managed agents | Agents for Bedrock | Vertex AI Agent Builder | Azure AI Foundry Agent Service | Minimal |
| Guardrails | Bedrock Guardrails | Safety filters / Model Armor | Content filters | Basic safety settings |
| Lock-in | AWS ecosystem | GCP ecosystem | Azure + OpenAI | Low (easy to start) |
Q: What are the main deployment patterns? Serverless / pay-per-token APIs scale automatically and need zero capacity planning — the default for most apps. Dedicated / provisioned endpoints (Bedrock Provisioned Throughput, Vertex endpoints, Azure PTUs) reserve capacity for guaranteed latency and rate limits at high steady volume. Choose serverless for spiky or early traffic, provisioned once usage is large and predictable.
Q: How do I control cost on these platforms? Use prompt caching to avoid re-paying for repeated context, model routing to send easy requests to cheap small models and only hard ones to frontier models, and cap maxTokens (see the Inference & Serving chapter for both). Provisioned throughput can also be cheaper than per-token once volume is high and steady.
Q: How do these platforms handle compliance and privacy? The enterprise services carry certifications like SOC 2 and HIPAA eligibility, support data residency by region, and offer customer-managed encryption keys plus private networking. They contractually state your data is not used to train base models — confirm the specific certifications and the region for your regulated workload.
Q: How does this connect to MLOps and LLMOps? You typically front these APIs with an LLM gateway for unified auth, routing, rate limiting, and cost tracking, and add observability (traces, token/cost metrics, evals) — exactly the operational layer in the MLOps & LLMOps chapter. The managed platform serves the model; your LLMOps stack governs how you use it.
Default to serverless pay-per-token while traffic is small or spiky, and only commit to provisioned capacity once you have steady, measured volume — it is the cheapest way to avoid paying for idle GPUs.
28.7 — Key takeaways
- Managed cloud AI platforms trade control and per-token cost for zero GPU ops, instant frontier-model access, autoscaling, and compliance — the “buy” side of build-vs-buy.
- Pay-per-token (serverless) suits spiky or early traffic; provisioned throughput wins on cost and latency once volume is high and steady.
- Region/data residency decides which laws apply and where data is processed; always confirm your data is not used to train base models.
- Reach for AWS Bedrock when you are on AWS and want the widest menu of providers behind one serverless API, with managed Knowledge Bases, Agents, and Guardrails.
- Reach for Google Vertex AI when you are on GCP, want Gemini first-class, or need an end-to-end governed ML platform for training and deploying your own models.
- Reach for Google AI Studio when you are prototyping, learning, or building quick non-sensitive tools and want a free API key in minutes — then graduate to Vertex (same
google-genaiSDK) for production. - Reach for Azure OpenAI Service when you are on Azure and want governed access to OpenAI’s GPT and o-series models with Entra ID auth and PTU capacity — remembering the deployment-name and token-credential gotchas.
- Front any of them with an LLM gateway and observability for routing, cost tracking, and evals — the operational layer from the MLOps & LLMOps chapter.