Friday, 17 October 2014

SRX tcp-proxy resource exhaustion for ALG/IDP/UTM traffic with client/server communication using TCP keepalives

Product Affected:

All SRX platforms
Junos OS 11.4
Junos OS 12.1
Junos OS 12.1X44
Alert Description:
SRX tcp-proxy resources may reach device limits during processing of ALG/IDP/UTM based traffic if client/server communication use TCP keepalive mechanism, resulting in session setup failure for new ALG/IDP/UTM based traffic.

During SRX processing of ALG/IDP/UTM traffic involving TCP keepalives, upon receipt of server to client TCP keepalive, the SRX will send a TCP keepalive response back to the server on behalf of the client via tcp-proxy.  However the received keepalive is not sent to the client, resulting in client establishment of a new session with server upon client not receiving keepalive packets. The SRX session, and associated tcp-proxy resource, is not freed based upon the continued keepalives from server to client being handled by the tcp-proxy.
The process repeats and results in session build up on SRX and exhaustion of available tcp-proxy resources.  

Locate tcp-proxy resource limit
     SRX Datacenter
1) Open Shell connection
     >start shell

2) Elevate to root level access (as needed)
     % su (enter in root password)

3) Locate tcp-proxy resource limit per SPC
     root@srx5800% srx-cprod.sh -s spu -c "show usp nat cp sys" | grep proxy
  usp_max_tcpproxy_connection = 10240
  usp_max_tcpproxy_connection = 10240
  usp_max_tcpproxy_connection = 10240
  usp_max_tcpproxy_connection = 10240


      SRX Branch-Campus
1) Locate deivce tcp-proxy resource limit
   >request pfe execute target fwdd command "show usp nat cp sys" | match proxy
    GOT: usp_max_tcpproxy_connection = 4096

Verify current tcp-proxy resource usage
     SRX Datacenter
1) Open Shell connection
    >start shell

2) Elevate to root level access (as needed)
    % su (enter in root password)

3) Locate current usage of tcp-proxy
    root@srx5800% srx-cprod.sh -s spu -c "show usp jsf tcpstats" | grep "flow_tcb alloc\|Start SPU" | uniq
   ======== Start SPU4.0, fpc4.pic0, spu ========
   flow_tcb alloc cnt : 0000000000 flow_tcb free cnt : 0000000000
   ======== Start SPU4.1, fpc4.pic1, spu ========
   flow_tcb alloc cnt : 0000012487 flow_tcb free cnt : 0000008741
   ======== Start SPU11.0, fpc11.pic0, spu ========
   flow_tcb alloc cnt : 0000011452 flow_tcb free cnt : 0000007930
   ======== Start SPU11.1, fpc11.pic1, spu ========
   flow_tcb alloc cnt : 0000012874 flow_tcb free cnt :
0000009016

4) For each SPC Subtract 'flow_tcb free cnt' from 'flow_tcb alloc cnt'
    fpc4.pic0           0 -      0  =      0  
    fpc4.pic1    12487 - 8741  = 3746 in use
    fpc11.pic0   11452 - 7930  = 3522 in use
    fpc11.pic1   12874 - 9016  = 3858 in use

      SRX Branch-Campus
1) Open Shell connection
  >start shell

2) Elevate to root level access (as needed)
    % su (enter in root password)

3) Locate current usage of tcp-proxy
    root@PN-STL-RTR1% cprod -A fwdd -c show usp jsf tcpstats | grep "flow_tcb alloc" | uniq
  flow_tcb alloc cnt : 0000000015 flow_tcb free cnt :
0000000012

4)Subtract 'flow_tcb free cnt' from 'flow_tcb alloc cnt'
    15 - 12 = 3 in use
Solution:
The following software releases have enhanced SRX handling of tcp-keepalive processing.
    Junos OS 12.1X45
    Junos OS 12.1X46
    Junos OS 12.1X47 and higher versions


SRX enhancement enables tcp-proxy ability to learn TCP keepalive parameters from client and server allowing SRX tcp-proxy to send TCP keepalive to both client and server as well as allowing closing of SRX session and associated tcp-proxy resource upon no response for 16 tcp-proxy keepalives.


WorkAround
Prior to reaching resource limit, close SRX sessions associated with client/server communication using TCP keepalives and freeing SRX tcp-proxy resource.

          SRX Clusters:
               Failover data redundancy groups (RG1+) to peer node
                 (triggers tcp-proxy to send packets to both client/server causing client to issue RST and closing of associated SRX session)

                       >request chassis cluster failover redundancy-group <#> node <#>

                 or

               Manually clear sessions for client/server communication that use tcp-keepalive
                      >clear security flow session source-prefix <x.x.x.x> destination-prefix <y.y.y.y>


          Standalone SRX:
               Manually clear sessions for client/server communication that use tcp-keepalive
                       >clear security flow session source-prefix <x.x.x.x> destination-prefix <y.y.y.y>

No comments:

Post a Comment

loading...