IBM publishes their Large Systems Performance Reference (LSPR) ratings, and Amdahl and HDS publish their relative performance ratings for new processor speeds and capacity. Do these ratings match your workloads and will your work experience the performance differences as published by the vendors? This paper provides an explanation of why (and why not) your performance may match the vendor’s performance results. It also provides some suggestions on how to confirm the performance you receive.
In addressing this issue, I’ll cover the following:
1. Definition of terms
2. Why use vendor claims?
3. How do vendors meet their claims?
4. IBM’s LSPR ratings
5. Amdahl’s performance ratings
6. HDS’s performance ratings
7. Why you should use these claims
8. Why your installation’s experience may differ
9. What can you do
This discussion describes considerations for MVS or OS/390 systems running on IBM, Amdahl, or HDS processor models. Most of the issues addressed, however, apply to VM and VSE systems as well.
1 DEFINITION OF TERMS
Throughout this paper, I’ll use various terms and I want to indicate my definition of these terms.
1.1 CPU VS. Model VS. CEC VS. Machine
Since every author uses different terms for describing a processor complex, I’ll start with the definitions I’ll use in this paper.
A CPU is a single processor that can execute instructions on the behalf of some unit of work. It will have one, and sometimes more, high speed buffers in which to store data while being referenced. A CPU can be dispatched by the operating system to execute one unit of work, such as a TCB or SRB, at a time. A CPU is sometimes referred to as a processor, but I’ll avoid that use in this paper, because some people refer to a processor as having multiple CPUs.
A processor model is a combination of one or more CPUs and is distributed with central and expanded storage, an I/O processor (CPU), possible system assist processors, system control processors, and various levels of cache buffer storage. A vendor will normally market many models, such as the IBM 9672-RX4, the HDS Pilot R7, or the Amdahl Millennium GS545. Various authors will refer to these processor models as CPCs (Central Processing Complexes) or CECs (Central Electronic Complexes), machines, or simply processors. I’ll use model or machine in this paper.
Speed is the relative ability of a single CPU to perform work. A CPU with a faster speed than another should be able to process more work in the same amount of CPU time. The speed of a CPU is often rated in terms of MIPS (once referred to as Millions of Instructions Per Second) as described below. The speed of a single CPU in a model of multiple CPUs is often referred to as the “uni-” speed and equates to a single-CPU model in a series of models that are built using the same uni-processor (a single CPU).
Capacity, on the other hand, is the relative ability of all of the CPUs in a model to perform work. A model with a higher capacity than another should be able to process more work in the same elapsed time.
Given a model’s single CPU speed, I define capacity as being equal to the effective CPU speed multiplied by the number of CPUs in the model. In a uni-processor, the capacity and speed are the same.
Figure 1 – Extract of Cheryl Watson’s CPU Chart
It is possible for one model to have a faster CPU speed, but a smaller capacity (due to having a fewer number of CPUs) than another model. You can also have one model with a slower CPU speed, but a larger capacity than another model due to a large number of CPUs.
When CPUs were considerably less complex than they are today, speed ratings in terms of Millions of Instructions per Second (MIPS) were used to rate the CPUs. Because the CPU instruction set and the processor models have gotten much more complex, the use of a single number to identify the speed of a CPU has lost a lot of its accuracy. Today, it’s much more common to hear of CPU speeds as a range of MIPS or relative processing power.
Most MIPS ratings today are simply based on the vendor’s claims of the relative performance of each model. Many analysts will provide these MIPS ratings based on the vendor’s claims in order to provide a consistent view of speed and capacity across multiple vendors. Gartner Group, the Meta Group, IDC, and Watson & Walker are among the groups who publish MIPS ratings. We publish ours in Cheryl Watson’s TUNING Letter [REF001], and I’ll use our MIPS ratings for all references in this document. The reason for the continued use of MIPS is that customers are more comfortable with MIPS than relative performance numbers.
The primary value of MIPS is to provide a starting point to identify a group of processor models that are close to the capacity required. A single number will not provide a good estimate of what you can expect to receive.
There are two types of MIPS to be concerned with. One is the total capacity of the processor model. This provides insight into the total amount of work that can be processed on that particular model. The second MIPS rate to be aware of is the MIPS per CPU. This estimate provides insight into the speed of a single CPU. This is needed since it is possible to have a 400 MIPS model composed of 4 CPUs at 100 MIPS each, 8 CPUs at 50 MIPS each, or 12 CPUs at 33 MIPS each, and your work will perform very differently on each of these configurations. You can see some examples of MIPS ratings from our CPU Chart [REF001] in Figure 1.
2 WHY USE VENDOR CLAIMS?
2.1 Alternative Too Expensive
Vendor ratings are the basis of all comparison charts available on the market today because it’s simply too expensive for anyone other than the vendor to purchase or obtain access to every processor that’s available. The vendors have access to all of their own processors and must make performance runs on all of their own hardware anyway. Sometimes the vendors will have access to their competitor’s machines and so can make comparisons between the two with their own workloads.
2.2 Vendor Has Market Goal
In almost every case, vendors know what market they are trying to meet before a model is announced and will set the capacity sometimes before the model is built. One example of this is Amdahl’s June 1997 announcement stating they would provide a 75 MIPS uni-processor in 1Q98 and a 100-MIPS or more uni in 1999.
The hardware design team targets the processor speed as they begin the design work and they don’t stop design, modification of design, and just plain “tweaking” until the model has reached the targeted capacity. This means that you can normally depend on a model matching the vendor’s claims prior to their availability.
|H Pilot 37|
|H Pilot 45|
Figure 2 – Processor Groups
One of the reasons that a vendor will have a fairly specific goal in the capacity of a model is to provide a full range of capacity relative to software pricing. Software pricing is normally based on either processor group or MSUs (millions of service units), with significant software license charge increases with each higher group or range of MSUs. A vendor wouldn’t be wise to provide three models in a series for groups 50, 70, and 70. A better option which would be attractive to more customers would be to provide models in groups 50, 60, and 70, even if it meant down-grading one of the models (in this case, taking the smaller group 70 model and downgrading it to fit into the group 60 range). (More about that later.) The same is true of MSU ratings.
Figure 2 shows an extract from our CPU Chart [REF001] that has organized the processors by first software processor group and then by MSUs. Notice that, in some case, a model will have a higher MIPS rating than a different model in a higher group. A goal of most installations is to obtain the highest MIPS rating for their workloads at the smallest processor group and MSU rating in order to reduce costs. In Figure 2, you can see that HDS GX8314, the IBM 9672-R35, and the HDS Pilot 37 might be good bargains because they have the highest average capacity within group 60.
Each vendor is concerned with having a model that will provide an easy incremental step in possible upgrades for their customers.
2.3 Performance Guarantees
Another reason to use the vendor’s claims is that they will often write “performance guarantees”. As a customer, you should demand a contractual performance guarantee from any hardware vendor. The performance guarantee is normally based on capacity between a new model and the user’s current model. Performance guarantees, however, are often difficult to negotiate and difficult to refute or confirm. The reasons will become clearer further in this paper and I’ll address the performance guarantees again in the summary.
3 HOW DO VENDORS MEET THEIR CLAIMS?
As previously mentioned, the vendors know what capacity they are aiming for in a particular CPU model. As an example, in order to address each area of their target market for the latest Generation 4 models, in June 1997 IBM announced 14 new models ranging from 8 MSUs to 78 MSUs. Based on our analysis, this corresponds to uni-processor speeds of 48 MIPS, 56 MIPS, 62 MIPS, 66 MIPS, and 72 MIPS. Only certain MP (multiprocessor) models are available for each uni-speed, depending on the target market.
Let’s take a look at how a vendor can produce a model that provides a specific speed and, therefore, capacity.
3.1 Chip Sorting
The CMOS processor chips, while designed to be the same speed, in fact turn out to be slightly different. While the vendor requires a minimum speed out of each chip, a few might be much faster than required and a few might be slower. The vendors end up “sorting” the very fast and very slow chips out. The slower chips can be used in the smaller models and the faster chips can be used in the larger, faster models. There will normally be few chips at the higher end, so they will most likely be used for the largest models. Amdahl indicates that while they used chip sorting on their 5995M models, they aren’t using it for their Millennium line.
As an example, the fastest IBM Generation 4 (G4) chip is rated at 2.7 nano-seconds and is used in their RY5 (10-way based on a 72 MIPS uni-processor). Compare this to the 3.1 nano-second chips in the 66 MIPS uni-processor based models (R55 to RX5) and the 3.3 nano-second chips in the 62 MIPS uni-processor based models (R15 to R45).
In the case of the 2.7 nano-second chip for the IBM RY5, they were able to take the fastest chips from the chip sorting process and increase the speed by additional cooling using a refrigerant. (IBM states that it is an environmentally safe refrigerant, R134A.)
3.2 System Structure Changes
There are several other things a vendor can do to adjust the speed and capacity of a processor model. The size, placement, and access to the lookaside buffers can be changed. The placement and connections to the CPU chips can be changed. The type of wiring can be changed. The location of instruction sets or microcode can be moved. The amount of parallelism performed in the instructions or data can be adjusted. The amount and use of high-speed cache can be changed. As seen in IBM’s RY5, cooling can be added to increase the speed of the CPU chip.
There are dozens of other changes than can be made in the hardware and microcode that will affect the effective speed of a CPU. Suffice it to say that the vendors have the knowledge and experience to “tweak” these as needed to achieve a specific speed for a machine.
Sometimes a vendor will refer to “degraded” or “down-graded” models (although the labels aren’t comforting!) that are needed to fill in a processor range. These might be slower chips or they might contain system structure differences to reduce the effective speed in order to fit the machine into a lower software rating.
Likewise, a “turbo-charged” model might contain faster chips or include additional system structure changes to provide the needed increase in speed.
3.3 MP Effect
The “MP effect” is a term used to describe the overhead seen due to the multiprocessing effects of running multiple CPUs in the same image. For years, the MP effect was a fairly consistent 4-5% per CPU added in bipolar models. That is, if a uni-processor was rated at 50 MIPS and a second CPU was added to the same model, you would see a capacity closer to 95 MIPS than 100 MIPS for the two CPUs. This loss of capacity is referred to as the MP effect and was usually about the same for all processor models.
When IBM moved to CMOS models, however, the MP effect seemed to be more important. As an example, look at the #MIX ITRRs for two ten-ways, the bipolar 9021-9X2 and the CMOS 9672-RX5. The column called MP % shows the percentage of effective MIPS in the MP compared to the total possible MIPS if there were no overhead. From Figure 1, we can see that the 9021-9X2 provides about 465 MIPS which is 75% of a potential 620 MIPS (10 CPUs at 62 MIPS, the speed of the 711 uni). The 9672-RX5, on the other hand, only provides about 394 MIPS, at only 60% of its potential 660 MIPS (10 CPUs at 66 MIPS).
The Rx5 models show the highest MP effect to date, and is one of the reasons, I think, for the interesting series of models that IBM announced in June 1997. The R55 (5-way) to RX5 (10-way) models are based on a 66 MIPS CPU uni, which is faster than IBM’s largest bipolar, the 9021-9X2 (at 62 MIPS uni), but the total capacity of 394 MIPS is far less than the 9X2 (477 MIPS) due to the fact that the RX5 has more MP overhead than the 9X2. So IBM also announced the RY5 10-way at the same time. The RY5 is based on a turbo-charged 72 MIPS uni CPU, and is able to provide a capacity of 439 MIPS, which is much closer to the 9X2. That is, to compensate for the higher MP effect, IBM provided a model with faster CPU chips.
The only reason to be aware of the MP-effect is when you are considering the addition of a CPU to a current configuration. From a capacity planning standpoint, you should be aware of the decrease in capacity of the other CPUs. It’s not a pricing issue since the prices are adjusted by the vendor based on the effective capacity of the machine.
4 IBM’S LSPR RATINGS
To confirm the speed and capacity of their processor models and to help customers understand what to expect from different models processing their workloads, IBM publishes their Large Systems Performance Reference [REF002], as a manual and as a performance tool. You can also find the LSPR numbers on the Web [REF005]. Both their techniques and results are published in the manual, and I would strongly recommend that you become familiar with their methodology. This section of this paper provides my summary of their 50 page discussion on the technique.
IBM has designed and accumulated a series of traditional workloads they feel are representative of customer’s workloads.
The sets of workloads consist of:
CB84 – Commercial Batch Workload
- This set of 130 jobs, with 610 unique steps, provides a typical, traditional, view of batch work. This workload consists of COBOL, Assembler H, and PL/I programs, along with compilers and utilities such as DFSORT. Access methods for BSAM, QSAM, BDAM, and VSAM are used.
- This is most representative of the traditional batch applications running in installations today.
CBW2 – Commercial Batch Workload 2
- This set of jobs was begun with SP 4.2.2. It has 32 jobs with 157 steps and is more representative of new applications that exploit more ESA functions, such as data in memory. It consists of programs written in C, COBOL, FORTRAN, and PL/I. The steps perform sorting, use DFSMS utilities, compilers, VSAM and DB2 utilities, SQL processing, SLR processing, GDDM graphics, and FORTRAN engineering routines. There is more JES processing and the workload spends about 50% of the time performing DB2 activities.
FPC1 – Engineering/Scientific Batch Workload
- This workload is an engineering and manufacturing jobstream that includes “static analysis, dynamic analysis, computational fluid dynamics, nuclear fuel calculations, and circuit analysis.’ This will be representative of much of the SAS work in commercial installations due to SAS’s heavy use of floating point.
- The TSO workload is representative of TSO program development using ISPF/PDF. It includes editing, browsing, foreground compiles, testing, graphics, and Info/Management. 25 different scripts are used and driven by an internal driver to meet the activity required to drive the system to 70% or 90% utilization. TPNS is not used although IBM periodically uses TPNS to confirm the consistency of their own internal driver.
- In SP 4.2, the CICS workload consisted of 102 transactions and in SP 5.1, the workload consists of 204 transactions. CICS is run in an MRO configuration (Multiple Region Option) with a TOR (terminal owning region), an AOR (application owning region), and an FOR (file owning region). In SP 4.2.2, an additional AOR/FOR region was added. As many of these “MROplexes” are run as needed to run the system to 70% and 90% utilization, usually one MROplex per one or two CPUs. COBOL and assembler are used for the programs and VSAM is the primary access method. The work is designed to be representative of order entry, stock control, inventory tracking, production specification, hotel reservations, banking, and teller systems.
- The IMS workload is similar to the CICS workload from DLI applications. There are 17 transaction types. Enough Message Processing Regions (MPRs) are run to bring the system to the desired utilization (70% and 90%) without causing contention within an MPR. DLI HDAM and HIDAM access methods are used with VSAM and OSAM databases. In SP 5.1, two IMS control regions are used and data sharing occurs using the IMS Resource Lock Manager (IRLM). BMPs (Batch Message Processing Regions) are not included.
- The DB2 workload consists of seven transactions applied to two applications, inventory tracking and stock control. The DB2 requests are driven by IMS/DC. Enough regions are created to eliminate contention within the subsystems. There are two DB2 databases comprised of 11 tables for inventory and 5 tables for stock control, with 1 to 5 indexes for each table.
Since the two workloads don’t invoke DB2 sorts, the DB2 sort assist feature available on some models is not exercised.
Only one type of workload is run during an LSPR test and the systems are run at fairly high CPU utilization (close to 100% for batch and FPC1 and at both 70% and 90% utilization for online and TSO). For the online work, the IBM team waits until the system has stabilized before starting the measurement phase.
Figure 3 – IBM LSPR ITRRs [from REF002 & REF005]
Two very important items to note is that only one type of workload is run in each test and the tests are run in totally unconstrained environments. That is, CICS is not tested with TSO and IMS is not tested with batch during the same runs. Also, in order to accurately determine the effect of the processor capacity, IBM must ensure that no other constraints exist on the system. That is, there is virtually no paging due to the abundance of all types of storage, there is no I/O constraint (almost 100% cache), there isn’t a lack of VTAM buffers or JES initiators, and even the CPU is not run until it is constrained (it is never run at over 100% busy).
From the measurements made while running these benchmarks, IBM calculates an Internal Throughput Rate (ITR) which is equal to the units of work (jobs or transactions) divided by the processor busy time. Models with higher capacities will be able to process more work in the same amount of processor busy time compared to models with lower capacities and will have higher ITRs.
Each workload will have its own ITR. To be able to compare two models, IBM uses an ITRR, Internal Throughput Rate Ratio, which is calculated by taking the ITR for a base model and dividing it into the ITR for the new model. Prior to June 1997, IBM published a list of the ITRRs using their 9021-520 as a base model with the ITR for each workload being set to 1.0.
Thus, a model that can process 50% more work in the same amount of CPU time compared to the 520 will have an ITRR of 1.5.
In June 1997, IBM published preliminary LSPR ratings for their newest models using the CMOS 9672-R15 as a base. In August 1997, they republished their LSPR ratings for all models using the R15 as the new base. These new ratings were quite a bit different than the 520 ratings because the operating system and subsystem releases used in the LSPR runs were changed at the same time. This led to more than a little confusion. If we take IBM’s statement that the R15 is equivalent to the 9021-711, and we also accept the 711 as a 62 MIPS machine, all other machines would see a corresponding 2% to 6% increase in MIPS ratings based on the LSPR ratings!
Figure 3 shows an extract from IBM’s LSPR charts for their three models as compared to the 9672-R15. You can interpret the chart as saying that their TSO workload achieved 4.40 times as many transactions in the same amount of processor busy time on the 9672-R55 as compared to the 9672-R15. This is based on the total capacity of the model, not necessarily the speed of a CPU as we’ll see later.
In order to help people consider the capacity based on a mix of workloads, IBM derives an estimated ITRR called #MIX, which consists of 20% of the ITRR of each of the five workloads: CICS, IMS, DB2, TSO, and CB84. This is a calculated value only, and is not confirmed by running 20% of each workload, which would be next to impossible to achieve consistency.
4.3 How These Are Used
The #MIX, or an early expectation of #MIX, is used to derive the SRM service unit coefficient, the service units per second as published for each model. The SU/Sec is used by SRM to compensate for different speed CPUs when determining the frequency of invoking certain functions. The SU/Sec used to be a fairly good indicator of CPU speed because it is related to the speed of a single CPU. A CPU with an SU/Sec of 400 is roughly twice the speed of a CPU with an SU/Sec of 200. This number is becoming less effective, however, as an indication of CPU speed for several reasons.
First, the SU/Sec value is published and made available often before final LSPR tests have been completed. While the published ITRRs might change, the SU/Sec values are seldom changed. Secondly, in older models, the difference in speed between workloads was fairly close. With modern processors, the difference in speed between workloads can be over 30%. As an example, in Figure 3, the FPC1 workload on a 9672-RX5 has an ITRR (9.61) that’s over 60% higher than DB2 (5.92), and over 50% higher than #MIX (6.36). It would be very difficult to use a single number to indicate the speed of the RX5 for these differing workloads. There’s a 14% variation in just the five workloads used to derive the #MIX.
The published #MIX is also used by most of the industry analysts to determine the relative MIPS ratings of different processors. This is an important concept for people that use published MIPS because it means that there could be a 40% or more variance between the published MIPS and what your workload would see. In our CPU Chart, we list estimated MIPS per workload to help people understand the difference that workloads make in estimating the capacity of a specific model.
4.4 Changes After GA
If there are significant performance improvements made available after General Availability (GA) of a model through microcode or other means, IBM has indicated that they will rerun the test and republish the changed ITRRs. They do not expect to alter the SU/Sec values, the processor group ratings, or the MSU ratings.
5 AMDAHL PERFORMANCE CLAIMS
Amdahl has a set of internal benchmark jobs similar to IBM’s, but they do not publish a description of their workloads or specific performance claims for each type of workload. They normally publish a range of performance that can be expected for a given model compared to their 5995-4570M. For example, their newly announced CMOS Millennium series contains a model GS745, which is listed as having a performance rating of 1.16 to 1.28 of the Amdahl 5995-4570M.
Since Amdahl does not publish their workloads, we can’t be certain which workloads are at which end of the range, although we might expect them to be similar to IBM’s workloads. Most analysts take the midpoint of the high and low to be the average and relate that to IBM’s #MIX workload. Whether this is valid or not is to be seen.
Amdahl has always derived their SU/Sec value a little differently, however. Their logic has been to provide consistent TSO response across a hardware change. In order to do this, the same percent of TSO transactions need to complete in first period. For this to be true, the durations must be adjusted to match the CPU speed. Amdahl assigns a value to the SU/Sec to ensure that the same percent of TSO transactions complete in first period. This has meant that the Amdahl SU/Sec values for bipolars have been higher by 6-8% than corresponding IBM and HDS bipolars. The Amdahl models had SU/Sec values that resulted in calculations of about 52 SU/Sec for each MIPS, while IBM and HDS had closer to 48 SU/Sec for each MIPS.
With CMOS models, however, the vendors are getting closer. The IBM CMOS models are now closer to 51 SU/Sec while the Amdahl models vary from 48 to 52 SU/Sec (with a strange anomaly in the GS535 which results in almost 55 SU/Sec per MIPS).
This means two things to you. It is fairly dangerous to try to compare service units between models from different vendors. And it’s also dangerous to compare service units between models of widely different ages.
6 HDS PERFORMANCE CLAIMS
HDS uses two techniques for publishing performance ratings. Two series of HDS models, the GXxxxx series and their CMOS Pilot models, are designed to be directly competitive to corresponding IBM models, and therefore use comparable IBM ratings. The Skyline models which are based on the fastest CPU speed available today, are not comparable in speed to any IBM or Amdahl CPU, so HDS publishes separate ratings for the Skylines (as well as a few other models that don’t have corresponding IBM matches).
The HDS models that are comparable to the IBM models are published by HDS as having “equivalency” to the IBM models and their performance claims are equivalent to IBM’s claims. For the few models in these series that do not have a direct equivalent model within the IBM range, HDS publishes a performance range, such as one model might provide 1.2 to 1.4 times the performance of an HDS GX8110.
The Skyline models, which are really combinations of bipolar and CMOS technology, don’t relate to an IBM model, but performance claims are published that indicate, for example, that a Skyline is 2.0 times the HDS GX8114. HDS has derived these performance claims by running their own set of benchmark jobs. Neither a description of the jobs or the resulting measurements are published.
I’ve noticed that Skyline SU/Sec values range from 48 to 52 SU/Sec per MIPS, so the SU/SEC values might appear higher or lower than service units from other vendors.
7 WHY YOU SHOULD USE THESE CLAIMS
7.1 The Bad News
It’s important to understand that there is no measurement in existence that can provide a single rating for a processor model that is indicative of its speed and capacity for a variety of workloads. It’s similar to buying a car based on expected mileage. A car might be rated for 20 miles to the gallon, but that is seldom what you will find. You will drive the car much different than the testers that came up with the initial rating. For example, if you happen to have a lead foot (i.e., drive too fast!), you’ll NEVER get the mileage your car is rated for. If you drive it according to their recommended speeds, and in their type of traffic, and on their types of roads, and with the same amount of weight in the car, and with all of the extra equipment turned off, you might be able to come close to their estimate. The same is true of processor models.
With that said, however, I strongly recommend that you use the vendor’s claims for sizing a machine, because it as close as you can get initially.
7.2 Performance Guarantees
I also believe that you should not obtain any hardware without some contractual commitment from the vendor about the performance that you expect to receive from the processor model. Since I know that’s it’s possible to obtain a performance guarantee from a vendor (and also know that it won’t be offered unless asked for), I’d recommend that every installation plan to obtain such a guarantee. These guarantees can only be obtained based on the vendor’s claims.
So therefore, I think you should trust the vendor to provide the right capacity estimates, but get it in writing!
The trick in any contract is to identify how you and the vendor will agree to the performance that you’re getting. This often requires very knowledgeable people on both sides who can understand the difference in performance because your workloads may not match the vendor’s workloads.
7.3 Industry Charts Based on Vendor’s Claims
Since most of the industry charts of MIPS are based on vendor’s claims, almost every company is indirectly using what the vendor has provided.
8 WHY YOUR EXPERIENCE MAY DIFFER
Why wouldn’t you get the same performance out of a processor model for your workloads? There are several reasons and I’ll address the most common among these:
1. Workloads vary
2. Your workloads don’t match the vendor’s
3. You measure different things
4. Your mix doesn’t match the vendors
5. The workloads vary throughout the day
6. The volume affects capacity
7. Constraints in software affect capacity
8. Constraints in hardware affect capacity
9. LPAR affects capacity
10. Dispatch priorities affect capacity
11. Software levels affect capacity
12. Levels of PTFs affect capacity
13. Different facilities invoked
14. Amount of storage affects capacity
15. Level of tuning
16. User’s behavior changes
17. The one thing that remains consistent is that you will always have
18. All of the above
8.1 Workloads Vary
The primary reason that a single performance estimate will not work for most sites is that performance differs for each type of workload. In the newer processors, the range of this difference is getting larger with each new model.
I think that the following summary made from the LSPR manual [REF002] is enlightening and helps provide some insight to what you might expect to see:
- The actual MIPS rate for a model will, in general, be highest for workloads at the batch end and lowest for workloads at the online end of the spectrum.
b. When comparing n-way models to their corresponding uni-processor model, the actual capacity will be higher for workloads at the batch end and lowest for workloads at the online end of the spectrum.
c. When comparing models with larger high speed buffer caches to those with less, the capacity will be higher for workloads at the online end and lowest for workloads at the batch end of the spectrum.
One problem is that your workloads aren’t necessarily designed to meet those same specifications that IBM uses for their LSPR workloads. For example, you might have some TSO users who use a lot of SAS (close to FPC1 workloads), others who access DB2 frequently (close to DB2 workloads), and others who spend the bulk of their time in ISPF (close to TSO workloads). The number of each type of user will determine which part of the scale you’re on when evaluating TSO. BMPs (IMS batch programs) may look much like IMS and yet may have many of the characteristics of CBW2. In order to use LSPR effectively, you must be aware of the workload mix you’re executing.
8.2 Your Workloads Don’t Match the Vendor’s
IBM has defined some very specific workloads, and while Amdahl and HDS have their own workloads for testing, we don’t know what they are. You will need to determine how well your workload matches the vendor’s workloads before you can tell if their estimates will be useful.
Here are a few examples where the performance of some workloads might not meet the vendor’s expected performance claims:
- IBM’s TSO workload is an ISPF based workload that has a large amount of editing and browsing types of transactions. If your workload is primarily FOCUS or ADABAS, then your performance probably won’t be the same. FOCUS and ADABAS have characteristics that are much closer to CICS and DB2 than TSO.
2. A few customers found that some batch jobs took much longer than expected when they moved to a CMOS processor from a bipolar. It turned out that the problem was due to the fact that the packed decimal instruction set was much slower on the CMOS 9672 models than on the bipolars. A heavy use of packed decimal instructions tend to occur in COBOL programs that use subscripts for heavy table processing and were compiled with a compiler option of ‘TRUNC=BIN’. IBM didn’t run into this particular combination of heavy packed decimal work because their benchmark programs used indexes rather than subscripts. (I remember teaching students that they should use indexes rather than subscripts back in early 1970, but programmers and even vendors are still using subscripts!) This phenomenon has been significantly improved with some microcode changes, but it still exists in many of the IBM 9672 models and HDS Pilot models. For more information on this, see WSC Flash #9608 and the archives from the Watson & Walker ‘Cheryl’s List’ listserver [REF003].
3. As mentioned earlier in the DB2 workload description, IBM’s DB2 transactions don’t cause the DB2 Sort Assist facility to be invoked. Since many applications do require a DB2 sort, your workloads could get better or worse performance when moving between processors with or without the sort assist facility.
4. One of the most common problems I’ve seen recently is a much larger occurrence of work that uses floating point. SAS, for example, uses floating point for most of its work. Any installation with a large percent of SAS in their daily processing should consider the FPC1 workload as being more representative of SAS than other workloads. Since FPC1 isn’t used to determine the #MIX from IBM, SAS users can get very surprised as seen by some quite low ITRRs on FPC1 workloads on some models.
8.3 You Measure Different Things
In describing IBM’s LSPR technique, I referred to their use of ‘processor utilization’. This is all of the captured CPU usage for the measurement interval and includes CPU time consumed by all the system address spaces such as MVS, JES, RACF, VTAM, GRS, CONSOLE, etc., not simply the time recorded by the application in the SMF type 30 (job termination) or type 72 (workload by performance group or service class) records.
IBM can obtain all of the measurements because they run in a dedicated, stand-alone environment. It’s much harder for an installation to obtain all of the CPU for a specific workload. For example, if you run TSO and CICS at the same time, how much of MVS, RACF, VTAM, etc. is being used by the TSO workload and how much by the CICS workload. You simply can’t tell.
So if you see an CICS ITRR between two machines of 1.2, does that mean that the speed that is 20% faster is seen as reduced CPU time in just CICS or will part of it be seen in reduced CPU time for MVS? You don’t really know because IBM is really measuring multiple things at one time (that is, the SMF time of the region, MVS, VTAM, initiators, JES, etc.
8.4 Your Mix Doesn’t Match the Vendor’s
The published #MIX by IBM and the average performance estimate by Amdahl represents some mix of workloads. In IBM’s case, the assumption is that there is 20% TSO, 20% CICS, 20% IMS, 20% DB2, and 20% traditional batch. This isn’t representative of any installation that I’ve ever seen.
So you’ll need to determine your own mix. For daytime processing, you might want to look at your peak processing period and determine the make up of the work at that time. For example, let’s assume you are moving from a 9672-R53 to a 9672-R83 and you run 50% CICS, 10% TSO, 10% batch, and 30% “other things” like MVS, RACF, VTAM, monitors, operation’s started tasks, and scheduling programs.
When using a variety of work, it’s easiest to determine the percent of each type of work during the peak interval (that’s when the capacity of the machine is the most important). Simply group MVS and supporting functions with the miscellaneous workloads and use the #MIX ITRR. Let’s assume that you had some work on an 8-way 9021-982 and planned to move it to a 10-way 9672-RX5. Also assume that you were running 70% CICS, 10% TSO, and 20% other (MVS) during the peak intervals. From Figure 3, we can calculate the ITRR for CICS to be .91 (6.61 / 7.23), the ITRR for TSO is 1.11 (6.48 / 6.00), the ITRR for #MIX is .98 (6.36 / 6.48). That’s 70% at .91, 10% at 1.11, and 20% at .98 for a combined ITRR of .94.
8.5 The Workloads Vary Throughout the Day
Of course, that’s for the typical peak processing time. What about the other times of the day. If you run busiest during daytime processing and are able to complete nightly processing in plenty of time, you can probably simply use the daytime estimates.
But if you have a tight batch window at night, as many installations do, you will need to calculate a daytime ITRR and a nighttime ITRR to better determine the effect of a processor change. It would be quite possible to find a site with a mix of 70% online during one peak hour only to find the mix has shifted to 70% batch in the nighttime peak hour.
As more companies are going to more international processing windows, the variation between day and night processing is reduced. Even the online workloads will vary dramatically throughout the day.
8.6 The Volume Affects Capacity
IBM, as does Amdahl and HDS, ensures that they are running the system at close to capacity, but not exceeding it, and certainly not severely underutilized. For IBM that means that measurements are taken at close to 100% utilization for batch and FPC1, and at both 70% and 90% for the online workloads.
Your results will almost certainly vary if you run at different capacities. Frankly, few sites will upgrade to a new machine and immediately run at between 70% and 100% busy. A new machine almost always has excess capacity, and this will affect how much CPU is needed for the workload.
For some models, being underutilized will actually provide worse CPU overhead due to their management of high speed cache and how work is dispatched to the CPUs. Other factors, such as LPAR processing can add to several “low utilization” effects. For most models, however, being underutilized will result in less CPU time per transaction than the work will see as the system gets busier.
That means that shortly after moving to a new processor, you will tend to see very good performance. As you get more work on the system, which may be many months later, the CPU usage of the system will increase.
In almost every analysis I’ve made, jobs will take more CPU time when the CPU utilization is at its highest. This is often referred to as the “multi-programming” effect. If you measure the data at 50% CPU busy, it will always be to the vendor’s favor, because the machine will be able to get the work done in less time than estimated at higher utilizations.
This phenomenon is seen very frequently. An installation that has been severely constrained for months (running well over 100% busy for long periods of time) might replace their current machine with a model that has a higher capacity, so the entire workload can be processed while only running at 60% busy on the new processor. The jobs have been experiencing excessive CPU overhead due to the high utilization and are then moved to an environment where they take less than the vendor’s predictions, and it appears that you easily got what you paid for.
Therefore, you will need to wait until you’ve reached full utilization on your processor before knowing whether you have obtained the processor capacity that you had planned for.
8.7 Constraints in Software Affect Capacity
IBM points out in their LSPR manual that they ensure that no software or hardware constraints exist during their measurement period. That is, an IPS parameter that’s set to limit the amount of work on the system or poorly structured and managed JES initiators could seriously affect the capacity of your new machine.
Unfortunately, this happens quite often when an installation upgrades to a new processor. There are several dozen parameters that should be modified when you upgrade to a larger capacity machine. If these aren’t modified, you could be restricting the capacity of your new machine. A simple parameter, such as the domain constraints in the IPS, could cause an increase in the amount of swapping, and therefore, overhead in the new model.
8.8 Constraints in Hardware Affect Capacity
IBM eliminates hardware constraints during their testing because they don’t want to consider the CPU cycles spent dealing with the constraint. For example, they don’t want to spend CPU cycles in paging when the intention is to determine the speed of the CPU for a specific type of work.
You should be aware that if you have any hardware constraints, such as lack of I/O paths, poor cache hit ratios, poorly performing DASD, storage shortages, or other hardware constraints, that you could be impacting the potential capacity of your machine.
8.9 LPAR Affects Capacity
Perhaps the most significant reason that your workloads may not match your vendor’s expectations is that all performance claims are made for a non-LPAR environment. The vendors aren’t trying to hide anything, but they simply can’t account for all of the variations seen in an LPAR configuration.
LPAR processing, whether it’s from IBM’s PR/SM, Amdahl’s MDF, or HDS’s MLPF will take additional cycles for processing time. A small portion of LPAR processing may be displayed in the partition data available from RMF and CMF, but that is only the LPAR management time and does not include the bulk of the actual overhead. Most LPAR overhead is actually experienced by the workloads, and their CPU time (TCB or SRB) will increase in an LPAR situation.
The amount of increase is quite variable and dependent on several factors. The primary factors are the number of LPARs on the machine, the total number of shared logical CPUs, the ratio of logical to physical CPUs, and the activity in the other LPARs. An increase in any of these four will cause an increase in the CPU time for your work. This CPU time has not been considered in the vendor’s announced performance claims (nor can it be). The LPAR overhead could be as small as 2% (in a production LPAR that’s given 95% of the machine) to 25% (in a grossly over-configured, multiple LPAR, multiple CPU environment).
You need to take this into consideration if you are running in any type of LPAR environment.
8.10 Dispatch Priorities Affect Capacity
Because you run a mix of workloads, the dispatch priority you have assigned to these workloads will be more important as you get closer to running your system at full capacity. For example, if batch is running at a low dispatch priority, as it is in most sites, the inconsistent CPU load from your higher priority work, such as CICS and TSO, will cause the batch work to get sporadic, inconsistent access to the CPU. This causes an increase in CPU time that is normally not considered in the vendor’s performance claims. That is, if you have all of your batch swapping in and out of storage and moving between multiple CPUs since they don’t have enough priority to stay on one CPU, you will see increased CPU times in your batch workloads.
8.11 Software Levels Affect Capacity
The vendor’s benchmarks are run on a level of MVS software that may or may not match yours. Until more installations are all running the same level of OS/390, it’s highly unlikely that the levels of all of your software matches that from the vendor’s benchmarks. You need to consider not only the level and release of MVS, but you need to consider the release level of VTAM, JES, RACF, TSO, ISPF, CICS, DB2, IMS, and other key products in your installation. Of course, the levels and releases of your monitors, scheduling products, etc. should also be considered.
What this means is similar to the discussion in 8.2 where your workloads don’t match the vendors. An example of this is in ISPF. ISPF V4 took a lot more cycles than ISPF V3. If the vendor is using ISPF V4 for the base and you are running ISPF V3, you will probably see a difference in how the TSO workload is affected when moving between two models. That is, the vendor did not measure the impact of ISPF V3 – it could have been worse or it could have been better, but only you will know (it won’t come out of the benchmarks).
8.12 Levels of PTFs Affect Capacity
Just like software levels and releases, the specific PTFs you have on your system will affect the capacity of the machine. As an example, the Catalog Address Space (CAS) takes a LOT more CPU time in SP 5. If IBM’s benchmarks use SP 5, their ITRRs include the impact on CAS when it’s moved to another model. If you are still on SP 4, the CPU time for CAS will be trivial and wouldn’t be affected by a change to a different model.
There have even been cases where IBM has had to apply some PTFs before running their LSPR tests due to some performance improvements that were related to the hardware.
8.13 Different Facilities Invoked
The biggest problem with current performance guarantees that I see is that they consider older, traditional, applications and not the newer applications.
Since the current benchmarks are run on traditional workloads, how will you be able to tell the impact of a new processor model for your new applications such as IBM’s Web Server on MVS, their LANServer MVS, object technology with SOM and CORBA, web applications like Java, TCP/IP instead of VTAM, DB2 stored procedures, OpenEdition MVS, MQSeries, and similar new applications.
Likewise, consider the applications that are trying to take advantage of some of the facilities that were new as of SP 4 or 5 and still haven’t been used, such as SmartBatch, DB2 Sort Assist, CICS storage protection, LPAR automatic recovery, etc.
One of the newest applications, parallel sysplex, is yet to be considered for the hardware benchmarks. In a parallel sysplex configuration, how much does the processor model affect the communication and overhead to and from the coupling facility?
8.14 Amount of Storage Affects Capacity
This is an old consideration that people often forget. If you have a lot of storage available, you can get two benefits. First, you reduce the overhead of paging and swapping, which only steals cycles from the CPU. Second, applications can take advantage of lots of storage and run in less cycles. The sort program is a good example of this. Incore sorts take less CPU time than DASD sorts, while hiperspace sorts can take more CPU time, but less elapsed time.
If you have a lot of storage and use it, you will take the least amount of CPU time per transaction. If you are short on storage, you will end up taking more CPU cycles from productive work and spend them on paging activities.
8.15 Level of Tuning
The level of tuning makes a large difference in the effective capacity of a machine.
The easiest example to show of this is good blocking. Yes, you’ve probably heard for years that good blocksizes (half or full track blocks on disk) are the most efficient. And most installations have ensured that production data sets are well blocked. But in most sites, programmers tend to use a factor of 10 to get blocksizes (80 x 800, 1600 x 16000) which produce very poorly performing jobs. Good blocking could reduce the CPU by 10% to 20%. If you have many of these in your batch workload, the programs aren’t running very efficiently and may not be getting the maximum benefit out of the new processor.
A well tuned system will always get the best performance out of a new configuration.
8.16 User’s Behavior Changes
One of my favorite true stories is about a system where we improved the response time to a group of users from 10 seconds to sub-second. Within two days, the amount of CPU consumed by that group of users tripled. When I went to ask them why their CPU usage increase, they interrupted me before I could ask to show me a new trick. They said, “Boy, Cheryl, before that change you made, things were really slow! If we wanted to look up your record, we’d have to type in your full name, ‘Watson, Cheryl’, then wait forever for a response. Now we just type in ‘Wa’ and start scrolling until your name comes up. It’s SUPER fast now!”. For those of you that have experienced the crunch caused by a large amount of VSAM browsing in a CICS application from a LOT of users, you’ll understand how distressed I was. To take advantage of the improved response time, they started using a much less efficient technique that cost us quite a few cycles.
Many sites have gotten burned because an improvement in response times caused users to change their behavior. Another common example is seen when TSO users find that the system is so fast they start doing all of their work in foreground rather than submitting batch jobs. This leads to excessively longer TSO third period response times and CPU consumption.
8.17 The one thing that remains consistent is that you will always have change!
IBM is fortunate in that they can always provide a consistent, unvarying, environment in which to run their benchmarks. They are able to obtain consistent results from one run to the next.
This is seldom the environment that you can expect to see. The only consistency in most production sites is the inconsistency of the workloads. An entire day of processing can be harmed if a batch job from the nightly cycle abended and must be run during the day with the online workloads. TSO users may all come back from a meeting at the same time and hit the system with double the normal TSO load. The CICS group could change a single parameter in their CICS parameters and increase the CICS CPU time by 5%. The DB2 group could add some indexes and reduce DB2 time by 15%.
Just be assured that you will seldom have two periods of time that are consistent in which to collect your measurements.
8.18 All of the Above
In many case, some or all of the seventeen documented reasons are all interplaying with each other at the same time. Very often, there isn’t one reason for a change, there are many at the same time. Measurement metrics may appear to report random numbers, frustrating the most senior level measurement expert. That interplay, in itself, may hide the real cause of underlying problems.
It is sometimes more difficult to recognize the reasons for poor results than it is to fix the problem.
Your work may vary considerably from the workloads that were used by the vendor to determine the relative capacity and speed of a new model. It’s up to you to determine how each of these factors will affect the real performance you receive.
9 WHAT CAN YOU DO?
You can do two things to ensure that you get your money’s worth. You can obtain a performance guarantee from your vendor before deciding on a processor model. And you can measure (and understand) the relative change in capacity after you’ve moved to the new model.
9.1 Performance Guarantee
Each vendor can (optionally) provide a performance guarantee, but they will almost always qualify the performance as it applies to what they think your workloads will experience. They have a lot of experience with their own models that isn’t published and, in my experience, do a very good job of sizing when they know that a performance guarantee will be used.
Part of the performance guarantee is an agreement on the methodology that will be used to confirm that you receive the performance you expect. Generally, this consists of itemizing your important workloads and specifying their current performance with expected performance from the vendor.
Most performance guarantees require that the analysis be done between two environments where all changes have been frozen. That is, new workloads, changes in operating system, parameter changes, etc. are not allowed between the two period of analysis. Be sure that you can handle this period of time without any system or application changes.
9.2 Measure Your Own System
To understand whether a new processor model is meeting its expectations, you need to measure what you are actually experiencing.
IBM provides one solution for this in “The Complete View” section of chapter 5 of their LSPR manual. As an introduction to their solution, they state that“For a validation to work, there must be a commitment that the workload run on the new processor be the same as that on the old processor. In other words, there should be no shifting of workloads until after the validation is complete.”
Their technique is to use the logical I/Os related to the total processor busy over a week of prime shift data.
I’ve tried this technique and have found that it only worked in a few cases because users could not make the commitment that the workload not change during the week directly after a processor upgrade. The workload will almost certainly change after a processor upgrade and changes will be made by the data center personnel.
I’ve found more success with the technique of identifying stable job steps and online transactions before the change was made and seeing how they were affected after the change. This technique was first introduced by Joseph B. Major. Though this technique doesn’t take operating system differences into account (such as the effect on JES or RACF), it will definitely show the effect on the application. If you don’t have time to write your own programs to find these stable jobs and collect this data, take a look at our latest product,BoxScore. BoxScore identifies and quantifies the effect of any change, such as tuning, Year 2000 conversions, processor upgrades, etc., on stable job steps and transactions. This software is based on research that I’ve been doing in this area for the past 10 years.
The performance estimates for new processor models from IBM, Amdahl, and HDS provide valuable data to help you understand how much capacity you can expect to see if you move to those new models. This is especially true of the IBM LSPR ratings by workload. We would hope to see workload performance ratings from the other vendors at some point in the future.
Your workloads may not see exactly the same effects because of several factors, among which is the fact that your workloads don’t mimic the vendor’s and most installations run in an LPAR environment, which may not be considered in performance claims.
To ensure that the vendor will help you if you don’t get the performance you expect, I recommend that you ensure that the vendor provides a performance guarantee before delivery of a new model.
You should define a technique to identify the relative affect of any processor model change based on your own workloads, not on estimates from artificial workloads. Remember that the vendor’s claims are almost always provided for the optimum environment – one running with no constraints in a non-LPAR environment and one that is well-tuned. If you are running in an LPAR, have any constraints, or are not well-tuned, you can’t expect to achieve the same performance results.
Note: This article with some modifications has previously been published in Watson & Walker’s BoxScore User’s Guide [REF004].