Understanding the enhancements and updates.
Azure OpenAI is a powerful cloud-based service that provides access to OpenAI’s advanced machine learning models. This service enables developers to build and deploy AI applications with unparalleled ease and efficiency. It offers a range of capabilities, from natural language processing to image recognition, making it invaluable for businesses looking to leverage AI technology.
For readers already familiar with Azure Open AI and the distinctions between PAYG and Provisioned Throughput Units commercial models, and who wish to focus solely on the latest announcements, please proceed directly to the section titled ‘Azure Open AI Service Updates October 2024′.
Understanding the payment options: PAYG vs. Provisioned Throughput
When using Azure OpenAI, clients have two primary options for managing their usage and costs: Pay-As-You-Go (PAYG) and Provisioned Throughput Units (PTU).
Pay-As-You-Go (PAYG)
The PAYG model allows users to pay only for the resources they consume. This model is ideal for applications with variable or unpredictable traffic patterns, as it offers flexibility and helps avoid unnecessary costs. Users are billed based on the number of tokens processed and other usage metrics, making it a straightforward and scalable option for many scenarios.
Provisioned Throughput Units (PTU)
In contrast, the PTU model provides a more predictable and controlled approach to managing AI workloads. By provisioning a specific amount of throughput, clients can ensure stable performance and latency for their applications. This model is particularly beneficial for production environments with well-defined and consistent traffic patterns, allowing for accurate capacity forecasting and cost management.
When to use provisioned throughput units (PTU)
Provisioned Throughput Units are recommended when you have well-defined, predictable throughput requirements. Typically, this occurs when an application is ready for production or has already been deployed in production, and there’s a clear understanding of the expected traffic. Key scenarios for using PTUs include:
- Applications that are ready for or already in production.
- Applications with predictable capacity or usage expectations.
- Applications with real-time or latency-sensitive requirements.
It’s important to understand your expected Tokens Per Minute (TPM) usage in detail, especially for function calling and agent use cases, before migrating workloads to PTU.
Changes to provisioned throughput announced in August 2024
In August 2024, Microsoft introduced significant updates to the Provisioned Throughput offering, addressing customer feedback and enhancing usability and operational agility. These changes include:
Model-Independent Quota
The transition from model-specific to model-independent quota simplifies quota administration and accelerates experimentation with new models. This change allows for a single quota limit covering all models and versions within a subscription and region.
Self-Service quota requests
Clients can now request quota increases without engaging the sales team, with many requests being auto-approved. This self-service capability streamlines the process and enhances user autonomy.
New hourly/reservation commercial model
The new payment model offers flexibility with hourly usage options and substantial discounts for term commitments via Azure Reservations. Clients can now choose between hourly, uncommitted usage and discounted one-month or one-year term commitments, providing cost-effective solutions tailored to their needs.
Default quota in many regions
New and existing subscriptions are automatically assigned a small amount of provisioned quota in many regions, enabling customers to start using those regions without first requesting quota.
Support for the latest model generations
The hourly/reservation model is required to deploy models released after August 1, 2024. This ensures that customers can access and utilize the latest advancements in AI technology.
Enhanced capacity transparency
New tools and APIs provide real-time information on capacity availability, helping users find regions with the necessary model capacity for their deployments. This transparency reduces deployment negotiation and accelerates time-to-market.
These updates reflect Microsoft’s ongoing commitment to improving the Azure OpenAI service, making it more flexible, user-friendly, and aligned with the needs of modern AI applications. For more detailed information, existing clients are encouraged to refer to the Azure OpenAI provisioned onboarding guide and the August update documentation.
Azure Open AI service updates October 2024.
Now that we have summarised what the Azure OpenAI service is and how the commercial models compare, let’s look at the latest announcements from Microsoft published in October 2024.
The October 2024 update includes a number of key changes and new capabilities such as various deployment options for Azure OpenAI, focusing on resiliency and data zones. It highlights the introduction of data zones for both pay-as-you-go and provisioned throughput units (PTU), allowing deployment to specific regions like the US and EU.
Data Zones
Data zones help in achieving better throughput and reduced latency while adhering to data sovereignty requirements. There are two main data zones: US Data Zone and EU Data Zone. With occasional capacity limitations for OpenAI in the various Azure regions, now being able to point an Azure OpenAI service at a geography such as EU with currently two Azure regions provides a greater service availability. Additionally, the new data zone standard deployments leverage Azure’s global infrastructure to dynamically route traffic to the data centre within the Microsoft-defined data zone with the best availability for each request. This deployment type supports models like gpt-4o-2024-08-06 and provides higher default quotas.
Latency SLA
A new 99% latency service level agreement (SLA) is introduced for token generations in provisioned throughput.
Cost Reductions
The cost for provisioned throughput units has been reduced, with global provisioned throughput now costing $1 per hour, down from $2. The minimum PTU requirements have also been lowered, making it more accessible for smaller applications. Furthermore, Azure OpenAI global batch is now generally available, offering batch processing at 50% less cost than global standard with a 24-hour turnaround target.
Prompt Caching
The document mentions prompt caching, which allows reusing previous tokenisation for prompts with similar initial characters, reducing overall compute.
Model Flexibility
There is increased flexibility to change models and versions within the reservation period, such as switching between GPT-4 and GPT-4 mini.
API and Model Support
The o1-preview and o1-mini models are now available for API access and model deployment. Registration is required, with access granted based on Microsoft’s eligibility criteria. Support for the o1 series models was added in API version 2024-09-01-preview, and the max_tokens parameter has been deprecated and replaced with max_completion_tokens. Region availability for the models includes East US2 and Sweden Central for approved customers.
New GPT-4o Realtime API
Azure OpenAI GPT-4o audio is part of the GPT-4o model family supporting low-latency “speech in, speech out” conversational interactions. The gpt-4o-realtime-preview model is available for global deployments in East US2 and Sweden Central regions and is ideal for use cases involving live interactions such as customer support agents, voice assistants, and real-time translators.
Take-Aways
- The introduction of data zones significantly enhances deployment flexibility and compliance with data sovereignty laws.
- The new latency SLA and cost reductions make Azure OpenAI more accessible and efficient for various applications.
- Prompt caching and model flexibility provide additional benefits in terms of performance and adaptability.
Next steps
In line with Microsoft’s Cloud Adoption Framework, we offer AI focussed Cloud Readiness Assessments, and Cloud Innovation Workshops to build out AI capabilities in your organisation. We also offer Cloud Technology Accelerators such as our AI Starter Packages which offer out-of-the-box ready-to-use AI solutions based on Microsoft AI Services including OpenAI and Copilot.