ADC

Configure load balancing for AI models

Token quota awareness ensures efficient distribution of requests across models while controlling cost and performance. NetScaler AI Gateway supports two load balancing methods:

  • Token latency–based load balancing: Routes traffic to the backend with the lowest response time while considering token usage.
  • Round robin load balancing: Evenly distributes requests across all backends regardless of token consumption. By integrating token quota awareness into these strategies, the gateway prevents overload, optimizes resource utilization, and maintains predictable service quality.

Ai gateway load balancing

  1. Create a front-end and a backend AI gateway profile. The front-end profile is bound to the load balancing virtual server. While the backend profile is bound to the service which NetScaler uses to connect to the Large Language Model (LLM).

    1. Front-end profile

      add aigwprofile <FrontendProfileName> -endpointType azureopenai –profileType frontend
      <!--NeedCopy-->
      

      Example:

      add aigwprofile azureoai_frontend_profile -endpointType azureopenai -profileType frontend
      <!--NeedCopy-->
      
    2. Backend profile for each AzureOpenAI model deployment if you need different quota limits on them

      add aigwprofile <BackendProfileName> -endpointType azureopenai –profileType backend -tokenQuota <TokenQuota> -quotaRefreshFrequency <IntervalInMinsAfterWhichTokenQuotaIsRefreshed> -authToken <authtokenstring>
      <!--NeedCopy-->
      

      In this configuration:

      • -tokenQuota <TokenQuota>: Token capacity of the backend server.
      • -quotaRefreshFrequency <IntervalInMinsAfterWhichTokenQuotaIsRefreshed>: Quota refresh rate, in minutes.
      • -authToken <authtokenstring>: Authorization token or API key to connect with LLM/AI model services.

      Example:

      add aigwprofile azureoai_backend_profile1 -endpointType azureopenai -profileType backend -tokenQuota 12000 -authToken {TokenString1}
      
      add aigwprofile azureoai_backend_profile2 -endpointType azureopenai -profileType backend -tokenQuota 24000 -authToken {TokenString2}
      <!--NeedCopy-->
      
  2. Create two servers for two gpt-5.1 deployments of an AzureOpenAI endpoint if it is an FQDN.

    add server <ServerName>
    <!--NeedCopy-->
    

    Example:

    add server AzureAI-1-Svr dep-1.openai.azure.com
    
    add server AzureAI-2-Svr dep-2.openai.azure.com
    <!--NeedCopy-->
    
  3. Create two services one for each deployment and attach the backend aigwprofile to it. Provide the IP Address or FQDN and the port of the AzureOpenAI instance.

    add service <ServiceName> <IP or ServerName> SSL <Port> -aigwProfileName <BackendProfileName>
    <!--NeedCopy-->
    

    Example:

    add service AzureAI-1 AzureAI-1-Svr SSL 443 -aigwProfileName azureoai_backend_profile1 
     
    add service AzureAI-2 AzureAI-2-Svr SSL 443 -aigwProfileName azureoai_backend_profile2 
    <!--NeedCopy-->
    
  4. Create an LB virtual server and provide a virtual IP and port on which the LB virtual server is listening and attach the front-end aigwprofile to it.

    add lb vserver <LbVserverName> SSL <IP> <Port> -aigwProfileName <NameOfFrontendAIGWProfile> -lbmethod <LoadBalancingMethod>
    <!--NeedCopy-->
    

    Example:

    add lb vserver AzureOpenAIGpt5.1 SSL 10.0.0.1 443 -aigwProfileName azureoai_frontend_profile -lbmethod leastllmtokenlatency
    <!--NeedCopy-->
    

    Note:

    • You must create one load balancing virtual server per model. Do not bind models of different services to the same load balancing virtual server.

    • Ensure that the server authentication and server certificates are enabled for the LLM endpoint service.

      add ssl certKey <certkeyName> -cert <path_to_cert_file> -key <path_to_key_file>
      bind ssl service <service_name> -certkeyName <CA_certkeyName> -CA
      add ssl profile <profile_name> -serverAuth ENABLED
      bind ssl service <service_name> -profileName <profile_name>
      <!--NeedCopy-->
      

      Example:

      bind ssl service <highlight>AzureAI-1</highlight> -certkeyName <CA_certkeyName> -CA
      <!--NeedCopy-->
      
  5. Bind both model deployment services to the lb virtual server.

    bind lb vserver <LbVserverName> <ServiceName>
    <!--NeedCopy-->
    

    Example:

    bind lb vserver AzureOpenAIGpt5.1 AzureAI-1 
    bind lb vserver AzureOpenAIGpt5.1 AzureAI-2
    <!--NeedCopy-->
    

You now have load balancing virtual server load balancing the LLM queries to 2 deployments of gpt-5.1 on AzureOpenAI. Load balancing virtual server serves the LLM requests to the endpoint having least token latency and maintains quota limits at each service Level.

Points to note

  • Front-end aigwprofile takes only 1 parameter -endpointtype.
  • Backend aigwprofile takes 2 mandatory parameters -endpointtype and tokenQuota; and 2 optional parameters -authToken and quotarefreshfrequency.
  • You can only set the aigwProfileName parameter during add operation of the load balancing and service entity.set or unset operation of the aigwProfileName parameter is not supported.
Configure load balancing for AI models