Beginning with cross-region deduction in Amazon Bedrock | AWS Machine Learning Blog

The Paradigm Shift in Generative AI Applications

With the advent of generative AI solutions, a paradigm shift is underway across industries, driven by organizations embracing foundation models to unlock unprecedented opportunities. Amazon Bedrock has emerged as the preferred choice for numerous customers seeking to innovate and launch generative AI applications, leading to an exponential surge in demand for model inference capabilities.

Challenges of Handling Traffic Spikes

Bedrock customers aim to scale their worldwide applications to accommodate growth, and require additional burst capacity to handle unexpected surges in traffic. Currently, users might have to engineer their applications to handle scenarios involving traffic spikes that can use service quotas from multiple regions by implementing complex techniques such as client-side load balancing between AWS regions, where Amazon Bedrock service is supported.

The Solution: Cross-Region Inference

Today, we are happy to announce the general availability of cross-region inference, a powerful feature allowing automatic cross-region inference routing for requests coming to Amazon Bedrock. This offers developers using on-demand inference mode, a seamless solution for managing optimal availability, performance, and resiliency while managing incoming traffic spikes of applications powered by Amazon Bedrock.

How Cross-Region Inference Works

When a request arrives at Amazon Bedrock, a capacity check is performed in the same region where the request originated from, and if there is enough capacity, the request is fulfilled. If not, a second check determines a secondary region to route the request to for processing. This process eliminates the need for manual capacity checks and ensures optimal availability for each request.

Benefits of Cross-Region Inference

By utilizing cross-region inference, developers no longer have to spend time predicting demand fluctuations. Instead, traffic is dynamically routed across multiple regions, ensuring optimal availability and performance during high-usage periods. This capability also prioritizes the primary region, minimizing latency and improving responsiveness.

Architecture of Cross-Region Inference

The operational flow starts with an inference request coming to a primary region for an on-demand baseline model. Capacity evaluations are made on the primary region and secondary regions, ensuring that all traffic remains within the AWS network. In the event of processing failure in a region, requests are rerouted to regions with the highest available capacity.

Adopting Cross-Region Inference

To adopt cross-region inference, you can start by using Inference Profiles in Amazon Bedrock. These profiles abstract model ARNs from different regions, allowing you to easily manage and route inference requests across multiple regions. Integrating inference profiles into existing workloads requires minimal code changes and seamless transition.

Considerations for Cross-Region Inference

While cross-region inference provides enhanced reliability and performance for applications, it is important to evaluate your data residency and compliance requirements. Ensure that transmitting data across regions aligns with your policies and regulations. Cross-region inference is now generally available in the US and EU for supported models.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *