ADC

Configure rate limiting based on token consumption

Generative AI workloads are uniquely resource-intensive, often consuming large volumes of tokens per request for generating text. Without proper controls, this can lead to unpredictable costs, degraded performance, and unfair resource consumption across users and applications. Token-based rate limiting provides a precise mechanism to manage usage by measuring consumption at the token level rather than just counting requests. This ensures that lightweight queries and heavy prompts are treated proportionally, enabling enterprises to protect infrastructure, enforce quotas, and maintain consistent service quality while integrating Large Language Models (LLM) into production environments.

Note:

Token based rate limiting is supported only for OpenAI Chat Completions API.

  1. Create a stream selector to identify the entity or a parameter part of the HTTP request for which you want to throttle requests. In this example, the AI application is sending the user id in the “X-user-id” HTTP header. NetScaler can perform rate limiting on any attribute that is part of the HTTP header or body. For more information on stream selector, see Configure a selector.

    add stream selector <Selector Name> <Attribute>
    <!--NeedCopy-->
    

    Example:

    add stream selector UserIdHeader "HTTP.REQ.HEADER(\"X-user-id\")"
    <!--NeedCopy-->
    
  2. Create a rate limit identifier to check if the number of tokens exceeds a specified value within a particular time interval. Say, if we need to rate limit users based on token consumption per minute where interval start and end are UTC minute aligned and limit rate limit alerts to 100 in a minute.

    Note:

    In multi PE, the configured threshold is split equally across the PEs.

    add ns limitIdentifier <Identifier Name> -threshold <Rate Threshold> -timeSlice <millisec> -mode TOKEN_RATE -selectorName <Selector Name> -alertsInTimeSlice <Number of Alerts> -timeAlign <MINUTE>
    <!--NeedCopy-->
    

    In this configuration:

    • alertsInTimeSlice: Number of Appflow alerts to be sent in the timeslice configured. A value of 0 indicates that alerts are disabled. A value of 65535 indicates no limit on number of Appflow alerts.
    • timeAlign: Possible values are:
      • MINUTE: Aligns with the time windows for a configured timeslice to minute boundary. If you choose MINUTE option, the time slice values must be integrals of 60000 ms.
      • NONE: NONE is the default value and time slice alignments happen every 10 ms.

    Example:

    add ns limitIdentifier RateToken -threshold 500 -timeSlice 60000 -mode TOKEN_RATE -selectorName UserIdHeader -alertsInTimeSlice 100 -timeAlign MINUTE
    <!--NeedCopy-->
    
  3. Create a responder action to specify the response that is sent to the client when the rate limit applied by NetScaler.

    add responder action <Action Name> respondwith <HTTP Response Expression>
    <!--NeedCopy-->
    

    Example:

    add responder action TokenRateLimitAction respondwith q<"\"HTTP/1.1 429 Too Many Requests\r\nContent-Type: application/json\r\nRetry-After: 60\r\n\r\n{\"error\": {\"message\": \"You exceeded your current token quota. Please check your plan and billing details.\", \"type\": \"rate_limit_exceeded\", \"param\": null, \"code\": \"429\"}}\"">
    <!--NeedCopy-->
    
  4. Create a responder policy with the action (3) and the condition (2).

    add responder policy <Policy Name> "SYS.CHECK_LIMIT(\"<Identifier Name>\")" <Action Name>
    <!--NeedCopy-->
    

    Example:

    add responder policy DemoRatePolicy "SYS.CHECK_LIMIT(\"RateToken\")" TokenRateLimitAction
    <!--NeedCopy-->
    
  5. Bind the responder policy to the load balancing virtual server.

    bind lb vserver <Vserver Name> -policyName <Policy Name> -priority <Priority Numbber> -gotoPriorityExpression END -type REQUEST
    <!--NeedCopy-->
    

    Example:

    bind lb vserver AzureOpenAIGPT5.1 -policyName DemoRatePolicy -priority 3 -gotoPriorityExpression END -type REQUEST
    <!--NeedCopy-->
    
  6. Optionally, if the X-user-id header info contains PII, or the user does not want it to be part of the OpenAI query. We can write a rewrite policy to drop the header in the backend request to OpenAI.

    1. Add rewrite action to delete the “X-user-id” header from the HTTP request

      add rewrite action <Rewrite Action Name> <Type> <Target>
      <!--NeedCopy-->
      

      Example:

      add rewrite action drop_user_header delete_http_header X-user-id
      <!--NeedCopy-->
      
    2. Add rewrite policy using the rewrite action.

      add rewrite policy <Rewrite Policy Name> <Rule><Action>
      <!--NeedCopy-->
      

      Example:

      add rewrite policy drop_user_policy "HTTP.REQ.HEADER(\"X-user-id\").EXISTS" drop_user_header
      <!--NeedCopy-->
      
    3. Bind rewrite policy to the load balancing virtual server.

      bind lb vserver <Vserver Name> -policyName <PolicyName> -priority <int> -gotoPriorityExpression <expression​> -type <REQUEST>
      <!--NeedCopy-->
      

      Example:

      bind lb vs AzureOpenAIGpt5.1 -policyName drop_user_policy -priority 2 -gotoPriorityExpression NEXT -type REQUEST
      <!--NeedCopy-->
      
  7. Rate limited requests can be exported to Splunk for Observability.

    1. Add Splunk endpoint as a Service.

      add service <ServiceName> <IPaddress> <Type> <Port>
      <!--NeedCopy-->
      

      Example:

      add service splunk_collector 10.0.0.2 HTTP 8088
      <!--NeedCopy-->
      
    2. Add analytics profile of type Web Insight using the Splunk service as collector.

      add analytics profile <ProfileName> -collectors <Splunk_Service> -type webinsight –dataFormatFile splunk_new.txt -analyticsAuthToken <"Splunk {HEC token}"> -analyticsEndPointUrl “/services/collector/event” -analyticsEndPointContentType “application/json”
      <!--NeedCopy-->
      

      Example:

      add analytics profile demowebinsight -collectors splunk_collector -type webinsight -dataFormatFile splunk_new.txt -analyticsAuthToken "Splunk {HEC token}" -analyticsEndpointUrl "/services/collector/event" -analyticsEndpointContentType "application/json"
      <!--NeedCopy-->
      

      Note:

      See Web insight records for information on the JSON fields that must be part of the splunk_new.txt for exporting rate limit alerts.

    3. Bind an analytics profile on lb virtual server where rate limit responder policies are bound.

      bind lb vs <VserverName> -analyticsProfile <Analytics Profile Name>
      <!--NeedCopy-->
      

      Example:

      bind lb vserver AzureOpenAIGpt5.1 -analyticsProfile demowebinsight
      <!--NeedCopy-->
      
Configure rate limiting based on token consumption