Wednesday, August 8, 2018

z/VSE VTAPE Performance and Tuning


z/VSE VTAPE Performance and Tuning

We recently had a customer contact us asking about z/VSE VTAPE throughput. 

They asked "When I FTP data to the machine running our Java VTAPE server I get much higher throughput than my backups using 
z/VSE VTAPE can achieve. Why?

In the process of answering this question, this article resulted in a nice set of network tuning Tips and Tricks for z/VSE.


How z/VSE VTAPE works ...


You can not get FTP class throughput using z/VSE's VTAPE due to z/VSE VTAPE's  design. Based on wireshark traces we deduce that the Java VTAPE server and the TAPESRVR job always transfer 'chunks', 1MB by default, of tape data followed by an application level ACK or handshake sequence.

For example, in the case of writing to a virtual tape (perhaps doing a LIBR backup) the process begins by reading a chunk, 1MB by default, to examine/process the volume labels. This is followed by a series of chunks of tape data sent to the VTAPE server. Between each chunk is an application level ACK or handshake sequence. Likely, the application level ACK from the Java VTAPE server is telling the TAPESRVR job in z/VSE that the data transferred was successfully written to disk. In part, it is the latency of this application level ACK or handshake sequence that causes VTAPE to be so slow.

Tuning Tips and Tricks ...



OK, maybe you can not change the design of the VTAPE application but what about tuning the environment for best throughput?

Let's start by admitting that not all of these steps will improve throughput. Some may not even be necessary in your environment but some may help and may help a lot. All of these suggestions are about reducing latency in the VTAPE application or in your network. After all, latency is the bane of network throughput.


Linux vs. Windows


Lets start by recommending Linux over Windows for running the Java VTAPE server. Our experience shows the Java VTAPE server running under LInux to be faster than when using a Windows machine. Perhaps you are a 'Windows shop' and you have been looking for an opportunity to add a Linux machine to your data center ... Well, here is your chance.


Many customers are now beginning to use a Linux on Z image to run their Java VTAPE server, also a very good idea. In fact, with more and more customers using SSL/TLS for all data transfers, using a Linux on Z image for your Java VTAPE server is an excellent idea. Linux on Z using Hipersocket networking is very fast and SSL/TLS is unneeded since data transferred to and from z/VSE never leaves the Z box.



Java Interpret vs. Dynamic Compilation 

Most x86 (32-bit or 64-bit) machines will have a version of Java installed. This version of Java will likely be running in mixed mode. Mixed mode indicates dynamic compilation of Java byte codes is available and will be used. If your Java installation is running in interpret mode, this can cause big slowdowns in throughput and high CPU usage.

The java -version command allows you to verify this.


zPDT3:/tmp # java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (IcedTea 3.6.0) (build 1.8.0_151-b12 suse-18.1-x86_64)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)



Recently however, after doing an install of SLES 12 SP2, I found the following Linux Java Server installed ...

  SLES 12 SP2 
   java-1_8_0-ibm
   java-1_8_0-openjdk
   java-1_8_0-openjdk-headless 

  All of the above installed by default 

The java -version command displayed this output ...

jcb@sles12sp2:~> java -version
openjdk version "1.8.0_101"
OpenJDK Runtime Environment (IcedTea 3.1.0) (suse-14.3-s390x)
OpenJDK 64-Bit Zero VM (build 25.101-b13, interpreted mode) 


The java-1_8_0-openjdk package was being used and running in interpret mode. This was causing throughput problems and very high CPU usage.

Using yast to remove the java-1_8_0-openjdk *and* java-1_8_0-openjdk-headless packages resulted in the SLES 12 SP 2 image using the Java-1_8_0-ibm package.

jcb@sles12sp2:~> java -version
java version "1.8.0"
Java(TM) SE Runtime Environment (build pxz6480sr3-20160428_01(SR3))
IBM J9 VM (build 2.8, JRE 1.8.0 Linux s390x-64 Compressed References 20160427_301573 (JIT enabled, AOT enabled)
J9VM - R28_Java8_SR3_20160427_1620_B301573
JIT  - tr.r14.java.green_20160329_114288
GC   - R28_Java8_SR3_20160427_1620_B301573_CMPRSS
J9CL - 20160427_301573)
JCL - 20160421_01 based on Oracle jdk8u91-b14


The Java VTAPE server now runs compiled and is much faster with far lower CPU usage. Testing showed a big increase in throughput with VTAPE backup jobs duration reduced by 50%.

MTU Sizing


Most networks use the standard Ethernet MTU size of 1500 bytes. However, with the advent of Gigabit Networking Jumbo Ethernet frames appeared. Jumbo Ethernet frames can contain up to 9000 bytes of data. z/VSE's IJBOSA OSA Express driver, provided by IBM, supports Ethernet frames up to 9000 bytes and Hipersocket MFS (Maximum Frame Size) up to 64K. I have found that Linux systems do not like an MTU size of 9000. However, Linux is very happy with an Ethernet MTU of 8960 (8K plus 768 bytes) and a Hipersocket MTU size of 56K. IPv6/VSE, from BSI, also supports these MTU sizes.

Since Jumbo Ethernet Frames contain 6x as much data as a standard Ethernet Frame, throughput can improve dramatically. Hipersocket MTU's of 56K contain 39x more data per frame.


Ethernet

If your z/VSE system's OSA Express adapter is connected to a Gigabit switch and the machine running the Java VTAPE server is also connected to the same switch then using Jumbo Ethernet frames may be possible. Verify the switch and the NIC being used by the Java VTAPE machine support Jumbo Ethernet frames. If you are using managed Gigabit Ethernet switches remember frame size can often be configured for each switch port.

Throughput using Jumbo Ethernet Frames can be much higher.



Hipersockets

Based on customer information, I have been told that CSI'S TCP/IP for VSE uses 32K MTU's when using Hipersocket links. And, in this case, defining the Hipersocket link with an MFS of 40K may help throughput. Using an z/VSE MTU of 32K with a Hipersocket MFS of 64K has been shown to cause slowdowns. Likely this is due to forcing Hipersockets to transfer 64K when only 32K of data is being transferred.

BSI's IPv6/vSE supports maximum size, 64K, MFS definitions for Hipersocket links. Customers have reported Gigabit+ throughput using Hipersocket links.


Linux on Z will choose an MTU size just less than the MFS size. In the case of 64K MFS, Linux will use a 56K MTU. IPv6/VSE will use a 56K MTU also.


For example, this BSTTFTPC batch FTP job transferred a 100MB file in 0.673 seconds.

BSTT023I   100M BYTES IN  0.673 SECS. RATE   152M/SEC 

Remember, Hipersockets is a CPU function. So, limiting the available CPU on any machine using Hipersockets can and likely will have an affect on throughput.



TCP Window Scaling


TCP window scaling allows TCP sockets to use larger windows of data. The TCP window is the maximum amount of data that can be sent without being acknowledged by the remote host. CSI's TCP/IP for VSE, I have been told, supports only standard 64K fixed size TCP windows. Our IPv6/VSE product fully supports TCP window scaling. IPv6/VSE's large TCP windows range from 1MB to 8MB in size. This allows IPv6/VSE to send far more data without waiting for the remote host to send an acknowledgement. Since TCP acknowledgements are cumulative a remote host can receive large amounts of data acknowledged with a single acknowledgement.

IPv6/VSE's TCP window scaling can dramatically increase throughput. Reductions in run times of data transfer jobs can be as high a 95%.

Device and Cache Sizing



The Java VTAPE server reads and writes virtual tape data from and to file system files. These files may reside on a local SSD, harddrive or a remote SAN, Samba file. The speed at which the Java VTAPE server can read and write file data makes a difference in VTAPE throughput. So, the faster the Java VTAPE server can access the device, the better.


In addition, sizing the amount of memory available to the Linux or Windows system will help. Memory, not actively being used by an application, running on either Linux or Windows, is generally available to cache data in files.


By ensuring the Linux or Windows system has plenty of system memory for caching of file data and using a fast local storage device can really help VTAPE throughput. x86 Intel Linux and Windows images generally have 4GB (or more) of memory these days. Linux on Z, however, can require additional calculations to optimize the size of memory available.


As a starting point for Linux on Z images, try 1GB plus the average size of your virtual tape files. For example, if your VTAPE files average 1GB in size, try starting the 1GB + 1GB = 2GB for Linux on Z system memory. Tune up or down from there.


From the z/VSE TAPESRVR side, the utility you are using to create the virtual output tape is important. For example, trying to improve the performance of a LIBR BACKUP job can be very difficult. Why? z/VSE libraries have fixed length records of 1024 (1K) bytes. While the LIBR program can read multiple blocks in a single I/O, this only occurs if the blocks are contiguous. Accessing a z/VSE library is a very I/O intensive process. Using FCOPY or IDCAMS backup facilities can provide much faster access to data on disk. 3rd party backup and restore utilities are optimized for best disk access performance but still may have options for improving read or write access. For in house, E.g., COBOL, applications should be reviewed for improving read or write access. Specifying VSAM buffer space/counts in the job's JCL can have a dramatic impact on performance and improve throughput.


Linux Sockets


The Linux system configuration file /etc/sysctl.conf is used to modify various system defaults. In the case of a Linux system used to host the Java VTAPE server, changing the Linux system default for TCP windows scaling buffer sizes may improve performance. 


net.core.rmem_default = 4096000
net.core.rmem_max = 16777216
net.core.wmem_default = 4096000
net.core.wmem_max = 16777216 


Modifications made to /etc/sysctl.conf can be activated by restarting the Linux image or by using the sysctl -p /etc/sysctl.conf command.

z/VSE VTAPE Buffer Size 



The default 'chunk' size used by VTAPE on z/VSE systems is 1MB. This size can be changed by using the SIR VTAPEBUF=nnM command. The minimum is 1M and the maximum 15M. I do not believe there is any official documentation on this command. It is, however, referenced indirectly in the z/VSE Hints and Tips manual.


SIR VTAPEBUF=1M (default)


By default, under z/VSE 5.1+, IPv6/VSE uses 4M TCP windows, Try ... 

SIR VTAPEBUF=4M


Additional performance may be achieved by adding the SHIFT 7 command to the BSTTINET/BSTT6NET stack startup commands under z/VSE 5.1+. This results in using 8M TCP windows. In this case, try ...

SIR VTAPEBUF=8M 


Remember, increasing the VTAPEBUF size also increases the amount of 31-bit System GETVIS required by the TAPESRVR partition.




VTAPE Compression

Using the file suffix .zaws invokes zip style data compression within the TAPESRVR partition. While this option does reduce the size of the .aws tape files, it also increases the amount of CPU used by the TAPESRVR partition by a factor of 2x to 3x. It will also reduce the throughput of the VTAPE transfer.

If you do want to use compressed vtape files, ensure you have plenty of CPU available on the machine running the TAPESRVR job. And, using the fastest CPU you have available to run the Java VTAPE server will help too. 

QDIO or Hipersocket Queue Buffers


Linux on Z QDIO Ethernet or Hipersocket buffers can be changed but this is usually not necessary
On our SLES 12 SP 2 image ...


sles12sp2:/sys/bus/ccwgroup/drivers/qeth/0.0.0360 # cat buffer_count
128
sles12sp2:/sys/bus/ccwgroup/drivers/qeth/0.0.0360 # cat inbuf_size
64k


128 x 64K buffers is plenty

On z/VSE the number of QDIO/Hipersocket input buffers has been configurable since z/VSE 5.1. Output buffer buffers since z/VSE 6.1. See the TCP/IP Support manual for more details.

Additional Input / output queue buffers may improve TCP/IP performance. 

The input / output queue buffers can be configured in the IJBOCONF phase. You may use the skeleton SKOSACFG in ICCF library 59 to configure input / output queue buffers. The skeleton also describes the syntax of the configuration statements.


The z/VSE default is 8 x 64K input buffers and 8 x 64K output buffers. This default amount requires 1MB of page fixed 31-bit partition GETVIS in the IPv6/VSE BSTTINET/BSTT6NET stack partition.

Each additional 8 x 64K input and 8 x 64K output buffers requires 1MB of additional page fixed 31-bit partition GETVIS. So, if you change the default buffer counts, remember to change your // SETPFIX statement in your stack partition JCL.

For z/VSE, increasing the input/output buffer count to 16 or 32 might be useful.


Linux Fast Path (LFP)


For most workloads the default parameters used during LFP startup should be fine. However, there are a couple of values to monitor.

INITIALBUFFERSPACE   = 512K
MAXBUFFERSPACE       = 4M
IUCVMSGLIMIT                     = 1024


These values can be monitored by running // EXEC IJBLFPOP,PARM='INFO nn'

*** BUFFER MANAGER ***                          
  CURRENTLY USED MEMORY .......... : 524,160    
  INITIAL MEMORY SIZE ............ : 524,288    
  MAXIMUM MEMORY SIZE ............ : 4,194,304  


If CURRENTLY USED MEMORY is close to MAXIMUM MEMORY SIZE then you should probably increase the MAXBUFFERSPACE setting. CURRENTLY USED MEMORY may grow over time (up to MAXBUFFERSPACE), as tasks require buffers depending on the socket-workload. LFP allocates more buffers (if still below MAXBUFFERSPACE) as needed, but will never shrink the buffer space once allocated. All LFP buffer storage is allocated from 31-bit System GETVIS. 

When LFP is low on buffers, this might cause delays because tasks are put into a wait until buffers become available again from other tasks, or the in worst case, socket calls fail due to no more buffers available.

If lots of concurrent tasks use LFP, you might also watch these lines ...

TASKS WAITING FOR MSGLIMIT ..... : 0
TIMES IN WAIT FOR MSGLIMIT ..... : 0
IUCV MSGLIMIT EXCEEDED ......... : 0


If you see TASKS WATING FOR MSGLIMIT, increase the IUCV message limit.
Remember, this will likely increase buffer usage so you might want to increase buffer space as well.

LFP Tuning Tips courtesy of Ingo Frantzki at the z/VSE laboratory. 

Summary


As with all performance and tuning, will all of these tips and tricks help your VTAPE throughput? It depends. Your mileage will vary.

Remember, make one change at a time, evaluate the results before making more changes. Be prepared to 'back out' any change if testing results are not satisfactory. 

Well, there you have it. If anyone reading this has thoughts or suggestion, just send me an email (jeff@bsiopti.com) and I will incorporate the information into this article.


Jeff Barnard

Networking and Security
Barnard Software, Inc.