Monday, October 11, 2010

Optimizing Hipersockets in z/VSE

Let's review a bit about the z/VSE Hipersockets network interface ...

Hipersockets is a synchronous method of transferring data. The sending data transfer does not complete until the data is received by the destination. This is a big factor. If the destination system is busy this can have a large impact on throughput.

Hipersockets is very much is a CPU function. In general, the faster the CPU, the faster the Hipersocket interface can operate. Limiting CPU on an LPAR or virtual machine (VM) limits throughput.

Are all Hipersocket hosts using the same large MTU? The maximum MTU size of a Hipersockets network interface is determined by the frame size specified in the IOCP Configuration of the IQD CHPID through the OS parameter.

OS Frame Size vs. Maximum MTU:
OS = 00 : MTU =  8KB
OS = 40 : MTU = 16KB
OS = 80 : MTU = 32KB
OS = C0 : MTU = 64KB


Choose the OS parameter carefully. The value you choose will be transferred for every transfer even if there is only 1 byte to transfer.

And, now to continue ...

If you read my last posting about TCP/IP throughput rates you know that the first limitation to throughput is the size of the TCP/IP stack's transmit buffer. BSI customers can adjust this value using the SBSIZE stack startup parameter. Setting the SBSIZE to 65024 (SBSIZE 65024, the maximum value) may help throughput using a Hipersocket network interface. Warning: Changing this value may actually reduce throughput. Your mileage will vary and testing is required to optimize this value. Do not change this value without contacting BSI first.

Set the MTU size of the Hipersocket network interface to 57216. This is the maximum value allowed and is specified on the LINK statement in the stack startup commands.

Now we have ...
A 64K transmit buffer
A 56K MTU

The TCP Receive Window and Maximum Segment Size are not directly controllable by customers. However, when using Hipersockets, the TCP Receive Windows should be 64K and the MSS will be slightly less than the MTU size. All of this provides the framework for having the maximum possible data sent in each packet.

In my posting on 'Care and Feeding of Your z/VSE TCP/IP Stack' I talk about keeping the stack fed. You keep the stack fed by making sure that when it wants data to send, there is data in the transmit buffer. One way to do this is to use large application buffers (for example 1MB in size). Fill a large buffer with data and issue a single socket send request to transfer the data. When you are using a Hipersocket network interface with a 64K transmit buffer and a 56K MTU, using a 64K buffer to send data into the stack is not as efficient as using a 1M send buffer.

When you are using the BSI BSTTFTPC batch FTP application, you can tell BSTTFTPC to use large buffer socket send operations by specifying
// SETPARM MAXBUF=YES in the BSTTFTPC JCL.

Are there any other limitations on throughput? Maybe.

The IBM IJBOSA driver is used to access the Hipersocket network interface. Is IJBOSA a limitation to performance? In general, no. IJBOSA's design is good. However, understanding some of the workings of the IJBOSA routine can help you to understand overall Hipersocket performance picture.

The TCP/IP stack can send multiple packets into the IJBOSA routine in a single call. Currently the BSI TCP/IP stack supports sending up to 8 packets in a single call. Why was the number 8 chosen? Because IJBOSA provides for a default of 8 buffers (I know of no way to change the number of IJBOSA buffers used).

If the stack attempts to send packets to IJBOSA and all of the its buffers are in use, the IJBOSA routine returns a return coding indicating the Hipersocket device is busy. At this point the TCP/IP stack must wait and send the packets again. The BSI TCP/IP stack does this by simply dropping the packets and allowing normal TCP retransmission to resend the packets in 10ms to 20ms. This retransmission delay allows the buffer busy condition to clear. You can see increasing the number of IJBOSA buffers may help throughput by reducing retransmission delays. However, this is only true if the machine receiving the data does so in an efficient and timely fashion. If the receiving machine is busy for some reason increasing the number of IJBOSA buffers available will not help transfer rates.

Because TCP throughput is limited by the size of the TCP Receive Window (64K), only 2 buffers are likely being used for each bulk data socket. The IJBOSA's 8 buffers provides for 4 high speed bulk data transfer sockets operating at the same time. I suspect that for most customers, 8 buffers is plenty.

Still, the question remains, how can I tell if Hipersocket busy conditions are a problem for me?

Well, first, look for this message on the z/VSE console ...

R1 0497 0S39I ERROR DURING OSA EXPRESS PROCESSING,REASON=0046 CUU=xxxx
This message is output by the IJBOSA routine the first time a busy condition is encountered. Remember, this message is output only once and will let you know a busy condition occurred.

Since the BSI TCP/IP stack's track TCP retransmission activity, the TCP retransmit counter can be used as a proxy for determining the number of Hipersocket busy conditions. To display the TCP/IP stack statistics you can use the IP LOGSTATS command.


For example, 


- MSG BSTTINET,D=IP LOGSTATS
- Run your job that uses the Hipersocket network interface
- MSG BSTTINET,D=IP LOGSTATS
- MSG BSTTINET,D=SEGMENT * $$ LST CLASS=...

Look in the BSTTINET SYSLST log output and locate these messages ...

From the 1st IP LOGSTATS command ...
01-Oct-2010 10:33:15 F6 BSTT613I   TcpOutSegs:          0
01-Oct-2010 10:33:15 F6 BSTT613I   TcpRetransSegs:     0

From the 2nd IP LOGSTATS command ...
01-Oct-2010 10:33:15 F6 BSTT613I   TcpOutSegs:    1563218

01-Oct-2010 10:33:15 F6 BSTT613I   TcpRetransSegs:   4493 


The TCP retransmission rate is about 0.28%. Any rate under 1% is probably acceptable. Having 4493 segments retransmitted results in a delay of about 45 and 90 seconds. This probably sounds like a lot of time but to achieve this level of retransmission activity I had to run 4 batch FTP jobs concurrently, each running 18 minutes. Eliminating the retransmission would reduce the run time of each batch FTP by only about 15 to 20 seconds.




Well, there you have it, an optimized Hipersockets network interface.

Friday, September 24, 2010

The Care and Feeding of Your z/VSE TCP/IP Stack

I recently suggested to a customer that they might get better throughput with their application by changing the priority so that the application was running higher priority than the TCP/IP stack partition. I received a questioning response. Really? Doesn't the stack always run higher priority than the applications it services? My answer was, nice rule but not always true.

By way of explanation, we can look at throughput as the care and feeding of your application or of the TCP/IP stack.

Feeding the TCP/IP stack is something you do when your application is sending data. Your application feeds the TCP/IP stack data. The data is queued into the stack's transmit buffer. Keeping the transmit buffer full makes sure the stack always has data to transmit. Once the stack has taken all the data sent and queued it into the transmit buffer, the application's send request is posted.

You can see that if the stack is busy and your application is running lower priority than the stack, your application may not get dispatched to send more data until a later time. It is possible that the stack will transmit all the data in the transmit buffer before your application can send more data. This leaves the the transmit buffer empty and can reduce throughput.

One way to ensure the stack has data available to queue into the transmit buffer is to use large send buffers. Large send buffers (for example, 1MB send buffers) can help keep the stack's transmit buffer full of data to send. Using large send buffers is most helpful when you are using network interfaces that have large MTU sizes, like a Hipersockets network interface.

Feeding your application is something the TCP/IP stack does. When data arrives from the network it is placed in the stack's receive buffer and the application is posted that data is available. The application must then issue a read for the data. If the application is running lower priority than the stack it may be some time before the application actually gets to run to read and process the data. In fact, in the worst case, the stack's receive buffer may actually become full forcing the stack to close the TCP window and stopping the data flow from the remote host.

Wow, it sounds like I should run all my applications higher priority than the TCP/IP stack. No, not at all. In practice, only bulk data transfer applications run into these types of problems.

The general rule of running the stack higher priority than the applications it services would apply to almost all applications. For example, interactive and multi-user applications like CICS TS, your TN3270E server and even DB2 (as examples) actually benefit from having the TCP/IP stack running at a higher priority.

In addition, applications that are primarily sending data out into the network generally show little throughput increase by running higher priority than the TCP/IP stack. Keeping the stack's transmit buffer full is generally pretty easy even if your application is running lower priority than the TCP/IP stack. However, this assumes that the application has access to the CPU when it is needed to allow it to keep the stack's transmit buffer full.

What does this leave? Perhaps a batch FTP retrieving data from the network, your FTP server partition or the IBM VTAPE (TAPESRVR) partition might benefit from running higher priority than the TCP/IP stack partition.

So there you have it. You can make a case for running some applications higher priority than the TCP/IP stack.

z/VSE TCP/IP Throughput Rates

I receive a number of queries about TCP/IP transfer (or throughput) rates every year. So, I thought I would make some comments about the issue.

At a high level, the concept is simple. The sending application sends data to a receiving application. Great! So, how much data can we send and how fast?

Basically, we would like to send as much data as possible in every request. Sounds simple, so lets start there. I allocate a large buffer, perhaps 1MB, in my application, fill it with data and send it all to the TCP/IP stack in a single request.

OK, the TCP/IP stack probably can not handle all that data at once, so lets look at how the request is broken down. First, the application's data is queued into a transmit buffer within the TCP/IP stack partition. The TCP/IP transmit buffer is probably limited to 32K to 64K in size. So, the size of the transmit buffer is the 1st limitation.

Once the data has been queued into the transmit buffer, the stack can begin the process of creating a packet to transmit the data. The 2nd limitation is the MTU (Maximum Transmission Unit) of the network interface being used. On a typical Ethernet network this is probably 1500 bytes. If you have z/VSE connected to a Gb (Gigabit) network you can take advantage of jumbo Ethernet frames which have an MTU size of 9000 bytes. If you are using a Hipersocket interface then the MTU size can move up to as much as 64K.

Well, assuming a 32K transmit buffer and an OSA Express QDIO Gb network interface, the stack will take about 9000 bytes (less headers, etc.) from the transmit buffer to create a packet. But, wait, can we really send 9000 bytes? Maybe, there are 2 more factors to consider.

The 3rd limitation is the MSS (Maximum Segment Size) negotiated by the local and remote host's TCP/IP stack when the socket was created. For example, if the sending TCP/IP stack supports an 8K MSS and the receiving TCP/IP stack supports only a 1500 byte MSS ... Guess what? The 1500 byte MSS wins.

The 4th limitation is the amount of space (number of bytes) available in the remote host's TCP Receive Window. TCP uses a 64K window to manage data transmission. Up to 64K of data can be transmitted to the remote host without waiting for an acknowledgement. Each byte of data sent must be acknowledged by the remote host. When an ACK packet is sent the sequence number of the last byte of data received and size of the current TCP Receive Window included. The sending TCP/IP stack can not send more data than will fit into the currently advertised TCP Receive Window.

Wow. OK, we started with 1MB of data being sent to the stack, 32K was queued into the transmit buffer. Now, of the 32K available in the transmit buffer the amount of data sent in a single packet is the smaller of the MTU size, Maximum Segment Size and the TCP Receive Window size.

As an example, if the MTU is 9000, TCP Receive Window 64K and MSS 1500, guess what? The amount of data sent in a single packet is 1500 bytes (less headers, etc.). Our 32K transmit buffer full of data will take around 22 packets to transmit.

As the packets are transmitted, the local TCP/IP stack is also watching for ACK packets from the remote host TCP/IP stack. The sooner the ACK packet arrives from the remote host, the sooner the local stack can begin or continue transferring data. The big factor here? Network speed and the number of hops needed to get the packet to its destination. In a word? Latency.

Having a high speed network is wonderful but latency can kill performance. You are always at the mercy of the slowest link.

I will discuss this more in another posting when I revisit using Hipersockets in z/VSE and consider some of the performance issues involved in optimizing throughput over a Hipersocket network interface.