Friday, September 24, 2010

The Care and Feeding of Your z/VSE TCP/IP Stack

I recently suggested to a customer that they might get better throughput with their application by changing the priority so that the application was running higher priority than the TCP/IP stack partition. I received a questioning response. Really? Doesn't the stack always run higher priority than the applications it services? My answer was, nice rule but not always true.

By way of explanation, we can look at throughput as the care and feeding of your application or of the TCP/IP stack.

Feeding the TCP/IP stack is something you do when your application is sending data. Your application feeds the TCP/IP stack data. The data is queued into the stack's transmit buffer. Keeping the transmit buffer full makes sure the stack always has data to transmit. Once the stack has taken all the data sent and queued it into the transmit buffer, the application's send request is posted.

You can see that if the stack is busy and your application is running lower priority than the stack, your application may not get dispatched to send more data until a later time. It is possible that the stack will transmit all the data in the transmit buffer before your application can send more data. This leaves the the transmit buffer empty and can reduce throughput.

One way to ensure the stack has data available to queue into the transmit buffer is to use large send buffers. Large send buffers (for example, 1MB send buffers) can help keep the stack's transmit buffer full of data to send. Using large send buffers is most helpful when you are using network interfaces that have large MTU sizes, like a Hipersockets network interface.

Feeding your application is something the TCP/IP stack does. When data arrives from the network it is placed in the stack's receive buffer and the application is posted that data is available. The application must then issue a read for the data. If the application is running lower priority than the stack it may be some time before the application actually gets to run to read and process the data. In fact, in the worst case, the stack's receive buffer may actually become full forcing the stack to close the TCP window and stopping the data flow from the remote host.

Wow, it sounds like I should run all my applications higher priority than the TCP/IP stack. No, not at all. In practice, only bulk data transfer applications run into these types of problems.

The general rule of running the stack higher priority than the applications it services would apply to almost all applications. For example, interactive and multi-user applications like CICS TS, your TN3270E server and even DB2 (as examples) actually benefit from having the TCP/IP stack running at a higher priority.

In addition, applications that are primarily sending data out into the network generally show little throughput increase by running higher priority than the TCP/IP stack. Keeping the stack's transmit buffer full is generally pretty easy even if your application is running lower priority than the TCP/IP stack. However, this assumes that the application has access to the CPU when it is needed to allow it to keep the stack's transmit buffer full.

What does this leave? Perhaps a batch FTP retrieving data from the network, your FTP server partition or the IBM VTAPE (TAPESRVR) partition might benefit from running higher priority than the TCP/IP stack partition.

So there you have it. You can make a case for running some applications higher priority than the TCP/IP stack.

z/VSE TCP/IP Throughput Rates

I receive a number of queries about TCP/IP transfer (or throughput) rates every year. So, I thought I would make some comments about the issue.

At a high level, the concept is simple. The sending application sends data to a receiving application. Great! So, how much data can we send and how fast?

Basically, we would like to send as much data as possible in every request. Sounds simple, so lets start there. I allocate a large buffer, perhaps 1MB, in my application, fill it with data and send it all to the TCP/IP stack in a single request.

OK, the TCP/IP stack probably can not handle all that data at once, so lets look at how the request is broken down. First, the application's data is queued into a transmit buffer within the TCP/IP stack partition. The TCP/IP transmit buffer is probably limited to 32K to 64K in size. So, the size of the transmit buffer is the 1st limitation.

Once the data has been queued into the transmit buffer, the stack can begin the process of creating a packet to transmit the data. The 2nd limitation is the MTU (Maximum Transmission Unit) of the network interface being used. On a typical Ethernet network this is probably 1500 bytes. If you have z/VSE connected to a Gb (Gigabit) network you can take advantage of jumbo Ethernet frames which have an MTU size of 9000 bytes. If you are using a Hipersocket interface then the MTU size can move up to as much as 64K.

Well, assuming a 32K transmit buffer and an OSA Express QDIO Gb network interface, the stack will take about 9000 bytes (less headers, etc.) from the transmit buffer to create a packet. But, wait, can we really send 9000 bytes? Maybe, there are 2 more factors to consider.

The 3rd limitation is the MSS (Maximum Segment Size) negotiated by the local and remote host's TCP/IP stack when the socket was created. For example, if the sending TCP/IP stack supports an 8K MSS and the receiving TCP/IP stack supports only a 1500 byte MSS ... Guess what? The 1500 byte MSS wins.

The 4th limitation is the amount of space (number of bytes) available in the remote host's TCP Receive Window. TCP uses a 64K window to manage data transmission. Up to 64K of data can be transmitted to the remote host without waiting for an acknowledgement. Each byte of data sent must be acknowledged by the remote host. When an ACK packet is sent the sequence number of the last byte of data received and size of the current TCP Receive Window included. The sending TCP/IP stack can not send more data than will fit into the currently advertised TCP Receive Window.

Wow. OK, we started with 1MB of data being sent to the stack, 32K was queued into the transmit buffer. Now, of the 32K available in the transmit buffer the amount of data sent in a single packet is the smaller of the MTU size, Maximum Segment Size and the TCP Receive Window size.

As an example, if the MTU is 9000, TCP Receive Window 64K and MSS 1500, guess what? The amount of data sent in a single packet is 1500 bytes (less headers, etc.). Our 32K transmit buffer full of data will take around 22 packets to transmit.

As the packets are transmitted, the local TCP/IP stack is also watching for ACK packets from the remote host TCP/IP stack. The sooner the ACK packet arrives from the remote host, the sooner the local stack can begin or continue transferring data. The big factor here? Network speed and the number of hops needed to get the packet to its destination. In a word? Latency.

Having a high speed network is wonderful but latency can kill performance. You are always at the mercy of the slowest link.

I will discuss this more in another posting when I revisit using Hipersockets in z/VSE and consider some of the performance issues involved in optimizing throughput over a Hipersocket network interface.