It is understandable that choosing the right -qarch -qtune options is difficult. There are many options and choosing the wrong one might generate object code that is not executable on some platform a product is deployed on.
I'll use your example of an application being deployed on POWER4, POWER5, and POWER6 machines and walk us through the tables in the Compiler Reference to try and make the best choice.
Selecting the right -qarch
We need to ensure that the produced object code by the compiler is executable on any POWER4, POWER5 and POWER6 machine. Reading the descriptions for -qarch=pwr4, -qarch=pwr5 in the section describing the -qarch option we see that:
-qarch=pwr5 will generate code that can run on a POWER5, POWER5+
-qarch=pwr4 will generate code that can run on a POWER4, POWER5, POWER5+, PowerPC 970
POWER6 is not supported by v8 of XLC/C++ for AIX; however, POWER6 can run any code generated for POWER5 or POWER5+. In the documentation for v9 of XLC/C++ you would see
-qarch=pwr6 will generate code that can run on POWER6
-qarch=pwr5 will generate code that can run on a POWER5, POWER5+, POWER6
-qarch=pwr4 will generate code that can run on a POWER4, POWER5, POWER5+, POWER6, PowerPC 970
So out of the 3 possibilities the safe option to use would be -qarch=pwr4 as code produced will run all of the machines the application will be deployed on.
Now we could also use -qarch=pwr3 because it is labeled as
-qarch=pwr3 will generate code that can run on a POWER3, POWER4, POWER5, POWER5+, POWER6, PowerPC 970
In general it is best to specify the most modern machine type for -qarch that still permits the object code to be executable on all of the machines your application will be deployed. There is also a table in the compiler reference for v9 and a similar table in the compiler reference for v8 in section entitled "Acceptable compiler mode and processor architecture combinations" that can help make this decision.
You can see that specifying -qarch=pwr3 would work, but Large page support could not be exploited unless you specify -qarch=pwr4. Similarly using a the default -qarch=ppc, the compiler does not have to option of using graphics features, or square root features.
We could not look at this table alone and choose -qarch=pwr6 to take advantage of vector processing support, because -qarch=pwr6 generates code that can run only on POWER6 which would not work because we want to deploy our application on POWER4 and POWER5 as well.
Choosing the best -qtune option
The -qtune option does not determine what machines will be able to run your application but instead tells the compiler which machine it should try to make the application run fastest on.
In v9 of the compiler we made this choice easier. We added -qtune=balanced. This possible value of balanced will tune the application to run fastest on a broad range of processors.
The alternative to using -qtune=balanced is choosing a processor which will represent the majority of your users. So in the example the application will be deployed on POWER4, POWER5 and POWER6 machines. If the majority of your users will have a POWER5 processor it is probably best to try and make your application run fastest on a POWER5 machine. In that case -qtune=pwr5 is the prefered option.
In summary your customer should use -qarch=pwr4 and -qtune either pwr4 or pwr5 (pwr6 or balanced if they switch to v9)
A word of caution:
Higher level of optimization such as -O4 and -O5 are a very good way of making your application perform better; however, they have the side effect of setting -qarch=auto and -qtune=auto. The word "auto" is replaced with the architecture of the compiling machine. Since your client is using a POWER5 machine this would result in -qarch being set to pwr5 and -qtune being set to pwr5. As we saw from the discussion above, using -qarch=pwr5 will generate code that can run on POWER5 and POWER6. There is a possibility that an instruction will be used that cannot be executed by a POWER4 machine. The users with a POWER4 machine may get an Illegal Instruction exception.