ADC

Configure load balancing for AI models

May 5, 2026

Contributed by:

Token quota awareness ensures efficient distribution of requests across models while controlling cost and performance. NetScaler AI Gateway supports two load balancing methods:

Token latency–based load balancing: Routes traffic to the backend with the lowest response time while considering token usage.
Round robin load balancing: Evenly distributes requests across all backends regardless of token consumption. By integrating token quota awareness into these strategies, the gateway prevents overload, optimizes resource utilization, and maintains predictable service quality.

Ai gateway load balancing

Create a front-end and a backend AI gateway profile. The front-end profile is bound to the load balancing virtual server. While the backend profile is bound to the service which NetScaler uses to connect to the Large Language Model (LLM).

Front-end profile

add aigwprofile <FrontendProfileName> -endpointType azureopenai –profileType frontend
<!--NeedCopy-->

Example:

add aigwprofile azureoai_frontend_profile -endpointType azureopenai -profileType frontend
<!--NeedCopy-->

Backend profile for each AzureOpenAI model deployment if you need different quota limits on them

add aigwprofile <BackendProfileName> -endpointType azureopenai –profileType backend -tokenQuota <TokenQuota> -quotaRefreshFrequency <IntervalInMinsAfterWhichTokenQuotaIsRefreshed> -authToken <authtokenstring>
<!--NeedCopy-->

In this configuration:

-tokenQuota <TokenQuota>: Token capacity of the backend server.
-quotaRefreshFrequency <IntervalInMinsAfterWhichTokenQuotaIsRefreshed>: Quota refresh rate, in minutes.
-authToken <authtokenstring>: Authorization token or API key to connect with LLM/AI model services.

Example:

add aigwprofile azureoai_backend_profile1 -endpointType azureopenai -profileType backend -tokenQuota 12000 -authToken {TokenString1}

add aigwprofile azureoai_backend_profile2 -endpointType azureopenai -profileType backend -tokenQuota 24000 -authToken {TokenString2}
<!--NeedCopy-->

Create two servers for two gpt-5.1 deployments of an AzureOpenAI endpoint if it is an FQDN.

add server <ServerName>
<!--NeedCopy-->

Example:

add server AzureAI-1-Svr dep-1.openai.azure.com

add server AzureAI-2-Svr dep-2.openai.azure.com
<!--NeedCopy-->

Create two services, one for each deployment, and attach the backend aigwprofile to it. Provide the IP address or FQDN and the port of the AzureOpenAI instance.

Create a service group, one for each deployment. Attach the backend aigwprofile to it and bind the members to a service group.

add servicegroup  <ServiceGroupName> <Protocol> -aigwProfileName <BackendProfileName>

bind servicegroup <serviceGroupName> (<IP>@ | <serverName>) <port>
<!--NeedCopy-->

Example:

add servicegroup AzureAI-SG-1 SSL -aigwProfilename azureoai_backend_profile1
add servicegroup AzureAI-SG-2 SSL -aigwProfilename azureoai_backend_profile2
bind servicegroup AzureAI-SG-1 AzureAI-1-Svr 443
bind servicegroup AzureAI-SG-2 AzureAI-2-Svr 443
<!--NeedCopy-->

Create an load balancing virtual server and provide a virtual IP and port on which the load balancing virtual server is listening and attach the front-end aigwprofile to it.

add lb vserver <LbVserverName> SSL <IP> <Port> -aigwProfileName <NameOfFrontendAIGWProfile> -lbmethod <LoadBalancingMethod>
<!--NeedCopy-->

Example:

add lb vserver AzureOpenAIGpt5.1 SSL 10.0.0.1 443 -aigwProfileName azureoai_frontend_profile -lbmethod leastllmtokenlatency
<!--NeedCopy-->

Note:
You must create one load balancing virtual server per model. Do not bind models of different services to the same load balancing virtual server.
Ensure that the server authentication and server certificates are enabled for the LLM endpoint service.
add ssl certKey <certkeyName> -cert <path_to_cert_file> -key <path_to_key_file>
bind ssl service <service_name> -certkeyName <CA_certkeyName> -CA
add ssl profile <profile_name> -serverAuth ENABLED
bind ssl service <service_name> -profileName <profile_name>

Example:
bind ssl service <highlight>AzureAI-1</highlight> -certkeyName <CA_certkeyName> -CA

Bind the model deployment services or service group to the load balancing virtual server.

bind lb vserver <LbVserverName> <ServiceGroupName>
<!--NeedCopy-->

Example:

bind lbvserver AzureOpenAIGpt5.1 AzureAI-SG-1
bind lbvserver AzureOpenAIGpt5.1 AzureAI-SG-2
<!--NeedCopy-->

You now have load balancing virtual server load balancing the LLM queries to 2 deployments of gpt-5.1 on AzureOpenAI. Load balancing virtual server serves the LLM requests to the endpoint having least token latency and maintains quota limits at each service Level.

Points to note

Front-end aigwprofile takes only 1 parameter -endpointtype.
Backend aigwprofile takes 2 mandatory parameters -endpointtype and tokenQuota; and 2 optional parameters -authToken and quotarefreshfrequency.
You can only set the aigwProfileName parameter during add operation of the load balancing and service entity.set or unset operation of the aigwProfileName parameter is not supported.

The official version of this product documentation is in English. Any non-English version is solely provided for your convenience and may include machine-translated content. For more information, please refer to the Machine Translation Disclaimer on Cloud Software Group home.

Was this helpful

NetScaler Secure Deployment Guide

Configure load balancing for AI models

May 5, 2026

Contributed by:

May 5, 2026

Contributed by:

Configure load balancing for AI models

Points to note

In this article