Sunday, March 20, 2016

CICS TS Workload Management Using OPTI-WORKLOAD

Workload Management

Recently we had a customer request that we update our OPTI-WORKLOAD product to support the latest versions of z/VSE and CICS TS. The updates for supporting the latest z/VSE version was fairly easy and the customer reports that the z/VSE batch workload management works very well on their systems. The customer has a very large number of batch jobs that run throughout the day (and night). Using OPTI-WORKLOAD's facilities allowed them to have multiple balance groups and to have each group balanced based on the resource usage of each job.

However, updating the CICS/VSE Workload Manager to support CICS TS proved to be more difficult. And, in the process, we learned a bit so we thought we would pass on the information.

Before diving into CICS TS Workload Management I will review some of the basic features of Barnard Software, Inc.'s OPTI-WORKLOAD product.

VSE/ESA and z/VSE Balancing

The priority of the partitions/dynamic classes can be set and changed by using the AR PRTY command. When a dynamic class contains more than one allocated dynamic partition, the partitions within the dynamic class are balanced (time sliced). The time slice value can be modified via the MSECS command.

Partition Balancing

The partition balancing routine inspects the CPU time for each partition of the balancing group and decides how to rearrange the priority of the partitions. VSE/ESA 2.1 (and newer) running the Turbo dispatcher offers a new balancing algorithm. If partition balancing is specified for static and dynamic classes (via equal signs in the PRTY command), static and dynamic partitions will receive the same time slice. Without the Turbo dispatcher a dynamic class (with all its partitions) got the same time slice as a static partition.

OPTI-WORKLOAD Workload Management


OPTI-WORKLOAD introduces the concept of performance groups to VSE/ESA 1.3 (or higher) systems. Performance groups are a group of static partitions and/or dynamic classes with similar performance goals. Once a performance group has been identified, OPTI-WORKLOAD monitors the velocity of work being done by each partition and class. At specified intervals the partition or class achieving the lowest velocity of work is given the highest dispatching priority of those partitions and classes in the performance group. This results in a more equitable allocation of resources and greater overall system throughput.

Velocity of Work

Velocity of work is calculated by monitoring the number of times the partition is actively using the CPU, actively waiting on I/O, delayed waiting on the CPU and delayed waiting on other resources. Velocity is expressed as the percentage of the time a partition has the resources that it needs available to it. Velocity is not a measure of the quality of the work being done, it is simply a measure of how much of the time resources needed by a partition are available to it.

CICS Workload Management


OPTI-WORKLOAD provides a special CICS Workload Management feature. The CICS Workload Manager uses Velocity calculations to manage CICS transaction priorities. Proper CICS Workload Management results in dramatic improvements in CICS response times and throughput.

The CICS Workload Manager uses Velocity calculations to increase the priority of CICS transactions with low velocity-of-work (or transactions ‘starving’ for resources) and to decrease the priority of CICS transactions with high velocity-of-work (or transactions monopolizing resources).

The CICS Workload Manager defines a transaction as the work done between sync point requests and
measures the amount of work done by counting the number of KCP requests made by the transaction. On CICS systems where applications do not perform sync point requests for themselves, CICS will automatically issue a sync point request at end of task or when the transaction enters a terminal wait. If a transaction uses more than the specified number of work units (the CICSINT threshold value), it is moved down by one priority increment. At the same time, if a transaction is delayed by a higher priority transaction using all available resources, the delayed transaction is moved up by one priority increment. As transactions enter the CICS system their initial priority defines the its starting position in the CICS Workload Managers management scheme.

The CICSINT threshold value is normally set to a value about 50% higher than the number of work units used by an average transaction. In this way most CICS transactions enter and leave the system with little, if any, management required. However, when longer running transactions enter the system, the priority of these transactions will float downward while the priorities of transactions competing for resources will float upward. By managing the priorities of the various CICS transaction active in a system, the CICS Workload Manager maximizes throughput resulting in reduced and more consistent response times.

CICS Workload Management Thresholds


Five OPTI-WORKLOAD THRESHOLD commands are used to customize the CICS Workload
Management feature. The CICS Workload Manager extracts these threshold values from the OPTI-
WORKLOAD partition. Therefore, the OPTI-WORKLOAD partition should be active before the CICS Workload Management feature is initialized. If the OPTI-WORKLOAD partition is not active initialization will complete and the CICS Workload Manager will enter ‘quiet’ or ‘inactive’ mode. The CICS Workload Manager will synchronize with the OPTI-WORKLOAD partition when it becomes active.

The THRESHOLD CICSINT nnnn specifies the CICS Workload interval. The default value is 100 work units. Work units are defined as the work done between DFHKCP requests.

The THRESHOLD PRTYMIN nnn specifies the minimum transaction priority to be managed by the CICS Workload Management feature. Any transaction with a priority less than the minimum will not be managed. The default value is 2.

The THRESHOLD PRTYMAX nnn specifies the maximum transaction priority to be managed by the CICS Workload Management feature. Any transaction with a priority greater than the maximum will not be managed. The default value is 4.

The THRESHOLD RUNAWAY nnn specifies the Workload Runaway Task multiplier. Any transaction using more than the RUNAWAY * CICSINT work units is considered a Workload Runaway Task. For example, if the CICSINT is 250 work units and the RUNAWAY is 100, any transaction using more than 25000 work units would be a Workload Runaway Task. The default is 100.

The THRESHOLD TRIVIAL nnn specifies the trivial transaction work unit limit. Any transaction using less work units than the specification is considered trivial. Trivial transaction are excluded from the non-trivial transaction statistics displayed in the CICS Workload Manager termination statistics. The default is zero.

CICS TS Workload Management

To begin, the CICS TS Performance Guide describes CICS TS transaction priorities like this ...

The overall priority is determined by summing the priorities in all three definitions for any given task, with the maximum priority being 255.

Priority = Terminal priority + Operation priority + Transaction Priority

The value of the system initialization parameter, PRTYAGE also influences
the dispatching order, for example, PRTYAGE=1000 causes the task's
priority to increase by 1 every 1000ms it spends on the ready queue.

This is a nice description and basically incorrect.

With the help of Eugene (Gene) Hudders, a CICS TS internals expert and author the C/Trek CICS TS performance monitor, I found out basically how the CICS TS dispatcher actually works.

The one-byte task priority is initially calculated as described above. When PRTYAGE=0 (disabled) the 1-byte task priority is converted to an 8-byte priority field with the 1-byte priority in the low order byte. With PRTYAGE=nnnn the 1-byte priority is converted in some way and combined with other information (E.g., transaction arrival time, etc) and stored as an 8-byte priority field.

This 8-byte priority field becomes the value used to control dispatching of the transaction for the transactions remaining lifetime.

The CICS TS dispatcher keeps 3 queues of tasks within a CICS TS image. The 1st queue is the Ready queue (also called the Private queue). This is a queue of ready to run tasks in priority sequence. The 2nd queue is the Dispatchable queue (also called the Public queue). This is a queue of dispatchable tasks ordered by queue arrival time. The 3rd queue is the overall task queue. Basically this queue is all tasks that are in the system.

The CICS TS dispatcher will dispatch tasks from the Ready queue in priority sequence until the queue is empty and "from time to time" (IBM's term) it will add tasks from the Dispatchable queue as it looks for more work. When a task on the waiting/suspend queue is ready to run, it is added to the Dispatchable queue. And the dispatch process continues ...

Now, this is my own interpretation of how the dispatcher works and some might argue that the 3rd queue is not really a queue at all or that the 2nd queue (Public) is not used on z/VSE (since it has only one QR TCB) but I think it is pretty close to the process the dispatcher uses.

One more note about the 8-byte transaction priority field, this value can, at times, be modified by CICS TS to ensure quick or delayed dispatching. For example, when a task is first attached (started), its priority is set to a 'very high' value to ensure the initial dispatch of the task is done quickly. This ensures the attach process, including the calls to any Task Related User Exit (Start of Task TRUE) is completed very quickly. After this the task priority is returned to normal. At other times, the dispatcher may add a delay to the dispatch time. For example, if available storage becomes less than certain values, the dispatcher may add a task to the Dispatchable queue but add some number of milli-seconds to its arrival time (remember the Dispatchable queue is ordered by arrival time).  This effectively delays dispatching.

Are there problems with this design?

Overall the design is quite good. Certainly far better than the old CICS/VSE dispatcher. But there are a couple of issues ...

One issue with CICS TS PRTYAGE is that is only increases the priority of a task. To be effective a workload manager must also understand that heavy resource usage tasks (that should be batch applications) do exist and when running must have their priority reduced to allow other light resource tasks a chance to run. 

Also you need to use a very low setting for the PRTYAGE parameter because if you use the default setting of 32768 (32.768 seconds) then the task's priority would not be increased for 32.768 seconds. If a task had to wait to be dispatched for 32 seconds, you have a bigger problem than the PRTYAGE parameter is going to help. In fact, in z/OS this parameter's default value has been lowered from 32768 to 1000.  I would go further and suggest a value between 250 and 500.

Remember, as the CICS TS dispatcher ages a transaction the priority of the transaction increases until the next successful dispatch of the task. At that point the dispatcher resets the priority back to normal. The affect on the priority by PRTYAGE is temporary. While this makes sense (at least to me) it does show that PRTYAGE has little affect on dispatching unless the CICS TS system is very busy and transactions are backed up waiting to run.


OPTI-WORKLOAD to the Rescue

The CICS Workload Manager in OPTI-WORKLOAD will both decrease and increase the priority of a CICS transaction based, not on time, but on the amount of resources used (work done) by the transaction. These changes in the transaction's priority are permanent until or unless the CICS Workload Manager decides to change the priority again.

The CICS Workload Manager can manage the workload with or without PRTYAGE active. 

When using the CICS Workload Manager you specify the number of work units that are considered normal. This is the CICSINT value. If a transaction uses less than CICSINT work units nothing is done to its priority. If the transaction does exceed CICSINT work units and its initial priority falls in the management range (PRTYMIN to PRTYMAX) then the transaction will have its priority reduced by one. Over time a long running or heavy resource usage transaction will see its priority gradually lowered until it reaches the PRTYMIN value. At the same time, transactions that are waiting to run but unable to run because high resource usage transactions will see their priority gradually increased until it reaches the PRTYMAX value. The net result is transactions within CICS will have their priority managed up or down based on resource usage achieving maximum throughput. 

With the CICS Workload Manager active long running (batch like) transactions no longer block or 'lock out' normal or short running transactions.

Defining Transaction Priorities

Priority              Comment
_____________________________________________________________
0                     Lowest
1
...
...
PRTYMIN            \  Minimum managed by the Workload Manager
...                 \
...                  |- Range managed by the Workload Manager
...                 /
PRTYMAX            /  Maximum managed by the Workload Manager
...
...
254
255                   Highest 

There are many ways to define the initial priority of transaction within CICS. Some basic recommendations are ...

#1, Avoid 0 and 255. These priorities are the lowest and highest. And, often used by CICS housekeeping and utility transactions.

#2, Long running, high resource usage transactions should be defined with a low priority.

#3, Short running, low resource usage transactions should be defined with a high priority.

#4, Mixed usage transactions are difficult to manage without help.

Sometimes obeying these rules is easy. For example, a bank might have a Savings Balance (SBAL) transaction that simply reads a single customer record and displays some information about the customer's account. This is a transaction that should have a high priority. On the other hand, when the branch manager is balancing all the tellers with a TBAL transaction that reads the entire teller file, you have a transaction the needs a low priority.

Sometimes obeying the rules is not so easy. Perhaps you have a 'master' MENU transaction that invokes many functions and may even transfer control to many other transactions. Yet all of these transaction run with the priority of the 'master' MENU transaction. Another example might be a TCP/IP socket transaction that listens for connections. When a connection is made the socket is given to a child transaction for processing. Yet the child transaction might have the same transaction ID as the listener transaction. With these transactions, some are low resource usage transactions and some are high resource usage transactions. Proper management of their priorities is very difficult unless you have a CICS Workload Manager.

When using the OPTI-WORKLOAD CICS Workload Manager the whole process becomes quite simple. For example, define your CICSINT typical transaction work units, PRTYMIN (perhaps 50) and PRTYMAX (perhaps 150).

Rule #2, These transactions get a priority between 1-49
Rule #3, These transactions get a priority between 151-254
Rule #4, These transactions get a priority of 100 and are managed.

With proper CICS Workload Management CICS transactions response times are reduced, stabilized and far more consistent. 

Well there you have it. Enjoy.

Jeff Barnard
Barnard Software, Inc.