Commercial Computing with C/C++

11 Posts tagged with the c++ tag
2

This is my first post on our C/C++ Cafe that has been long in coming.

If you are like me, then you are a new zOS programmer. The learning ride has been quite turbulent and there are ways to go yet. If you are a devoted programmer then you feel quiet excitement of your new program almost working, tempered by the chance of another large manual being 'thrown' at you. zOS is one of those products that has too much of a good thing, that is there is a LOT of documentation. This fact very quickly becomes an advantage as one gains more experience.

No matter what design patterns you use, almost as soon as you start writing code, you will need to include other files, be it either your own include files or third party libraries. For efficiency purposes, these files will in all likelihood reside in datasets. To make things more interesting, the documentation might instruct to use directory based (HFS) #include, confusing the new-comers as to where files reside.

I want to discuss two tasks that come up when dealing with include files:

  • Finding the include files
  • Dealing with preprocessor macros

If the compiler already found the include file (i.e. successful
compile) and you are interested where the file is located (i.e. which version of the library you are actually using), both -qlist and
-qsource produce an includes section.

#Will list all included files:
xlc -qsource -c foo.c | sed -n '/I N C L U D E S/,/E N D O F I N C L U D E S/ p'
#Will list the path to NAME_OF_INCLUDE:
xlc -qlist -c foo.c | grep NAME_OF_INCLUDE

The grep command might not return anything because of some zOS specific file translations, hence be careful:

  • if NAME.OF.INCLUDE contains dots, the real file name might be translated to INCLUDE.OF.NAME or just NAME
  • if NAME_OF_INCLUDE contains underscores, it might get translated to NAME@OF@INCLUDE

If you are using V1R11 compiler, there is a new feature, -qmakedep (on USS only) that will produce a list of all included files. It is much easier to remember then the sed command above. For example if you compiled:

# previus invocation: 'xlc -c foo.c'
# Append -qmakedep
xlc -c foo.c -qmakedep
cat foo.u

For previous releases of zOS compiler, have a look at the makedepend utility. It contains some other options that might be useful for include file debugging.

-qshowinc is another particularly useful option, if you already have the include files. It shows the the contents of the included files. It is similar to what PPONLY (-E equivalent) option does, except it outputs to the listing and does not strip preprosessor directives. However, be ready to pipe the output to other programs to filter out the thousands of lines of code produced. Most of the time I use less and its search features. sed and grep sometimes are also be useful.

If you are trying to include a file, and the compiler cannot find it, there is more research involved. In a general case, the include files can be found, top to bottom in these places:

pwd
LSEARCH
DD:USERLIB
SEARCH
DD:SYSLIB


  • System includes can only be found in SEARCH and DD:SYSLIB.
  • DD: statements come from the JCL, hence you might need to know how the compiler was invoked.
  • SEARCH and LSEARCH come from the compiler options hence are easy to modify

SEARCH and LSEARCH both contain a list of directories and partial dataset qualifiers. The topic is discussed in detail in our Compiler User Guide in 'Chapter 7: Using include files'. I personally found the flowcharts in Chapter 7 and the Examples in the LSEARCH option explanation cleared up most of the most dataset questions that I had. It is worth noting for newcomers that terms 'z/OS Unix files' and 'HFS files' are equivalent and refer to directory based files (i.e. similar to organization Linux and Windows file systems) (as opposed to DATASETs that can be sequential or PDS in this discussion)

-qlist, -qsource options and -V and -v c89 and xlc flags provide a quick way to find out what the values of LSEARCH and SEARCH are.

As Visda has discussed before in her blog post, NOSEARCH() and NOLSEARCH() reset the respective option value back to empty.

Last topic I wanted to touch was preprocessor macros and what options are available when dealing with them.

Michael Wong has posted here a way to find compiler predefined macros on AIX using the SHOWMACROS option. zOS unfortunately does not have this option till V1R11. The best equivalent is to use the makedepend utility -Wm,list option and then view depend.lst. It will contain a list of predefined compiler macros. However, makedepend utility is being deprecated since V1R11 in preference to built-in -qmakedep option.

Nevertheless most macros should be mentioned in the zOS manuals related feature sections. Here is a small list from the manual.

If you already have the macro name you wish to use, have a look at the -qEXPMAC option. This option will show you the value of the macro in the source listing. It is most useful when combined with -qshowinc.

I hope this gets you started on the right track.

2 Comments Permalink
0

So many times we get clients complaining to us that their code used to work on an older release but it's broken using the new release of the compiler. After closer look at the sample test case provided, we find out they have been lucky to have a working copy of the code. You see, the breakage is expected because they have broken the ansi-aliasing rules.

Not many of us follow the rules defined in C and C++ standards^1^ religiously. Although, the aliasing rules encourage accessing an object by lvalues of types compatible, we often have to break this rule in order to make the code "work".

By default, the xlc on z/Os compiles with ANSIALIAS. Based on the assumption that pointers in the source file access objects of the same type, the compiler determines storage locations that is accessed in two or more ways, i.e. aliased. If, for example, we have a struct s with two members s1 and s2, the storage for s overlaps with storage for both s.s1 and s.s2. But the storage of the s.s1 and s.s2 don't overlap. This knowledge is critical to aggressive compiler optimization. It allows some loads to move up and stores to move down. The rearrangements in the sequence of execution is desirable and increases executing more of the code in parallel.

Casting a pointer to point to a different object is a common C practice. For each type mismatch, xlc generates a warning and/or an informational message, which you may not notice if you have set the level of diagnostic messages to error or higher, -qflag=E, S, or U, or if you are redirecting all compiler messages to a hardly-ever-looked-at log file. Often the first time you notice a problem is when you execute the code and get an incorrect result.

You have broken the rules, now what?

You can compile routines that are not ansi alias compilant with low levels of optimization, e.g. at OPT0. The higher the level, the more aggressive the optimizations based on aliasing information. You can turn off optimization per routine, by #pragma option_override(func,"OPt(LEVEL,0)").

You can use -qnoansialias compile option or use cc utility which passes noansialias to the compiler by default. This may not be desirable because it usually results in significant performance degradation, e.g. gcc compiled at -O3 with -qnoansialias runs 20% slower.

You can fix the non-compliance in your source code.

1ISO/IEC 14882:1998(E), Section 3.10, Paragraph 15 states:

If a program attempts to access the stored value of an object through an lvalue of other than one of the following types thebehaviour is undefined:

  • the dynamic type of the object
  • a cv-qualified version of the dynamic type of the object
  • a type that is signed or unsigned type corresponding to the dynamic type of the object
  • a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object
  • an aggregate or union type that includes one of the aforementioned type among its members (including, recursively, a member of a subaggregate or contained union)
  • a type that is a (possibly cv-qualified) base class type of the dynamic type of the object
  • a char or unsigned char type

Example:

/*alias.c*/
int foo(char *c)
{
char a[100];
char *cptr = a;
*(int *)cptr = *(int*)c;
return 0;
}
xlc -c alias.c -O3 -qlist=./ -qflag=i -qinfo

INFORMATIONAL CCN3495 ./alias.c:5 Pointer type conversion found.
INFORMATIONAL CCN3374 ./alias.c:5 Pointer types "int*" and "char*" are not compatible.
INFORMATIONAL CCN3495 ./alias.c:5 Pointer type conversion found.
INFORMATIONAL CCN3374 ./alias.c:5 Pointer types "int*" and "char*" are not compatible.
INFORMATIONAL CCN3415 ./alias.c:7 The external function definition "foo" is never referenced.

0 Comments Permalink
7

In a couple of previous posts ( TOC Overflow: what is it, and why should you care?, Dealing with TOC overflow: the traditional approach ) I have presented the issue of TOC overflow. Now I will discuss some features of the XL compilers that can help bypass TOC overflow while minimizing any negative effects on runtime performance.

1. Minimal TOC: The option -qminimaltoc makes the compiler generate code that uses a single entry in the TOC for each compilation unit (in C/C++ a compilation unit is a source file). In order to do this, a separate level of indirection must be follow in order to access TOC-based variables. This means that the program will be larger and slower than if it did not have TOC overflow, but it will still be faster than using the -bbigtoc option. This is similar to the -mminimal-toc from gcc.

Furthermore, -qminimaltoc does not need to be used on all compilation units, so you can minimize the performance impact by using this flag only on compilation units that are not relevant for performance.

2. IPA: IPA is short for inter-procedural analysis, a form of compiler optimization that looks at the whole program, not just a single compilation unit. For this, the optimizer is invoked during the linking phase of your application, to perform transformations that can affect multiple compilation units.

Applying this process significantly reduces TOC pressure, and in most cases completely eliminates TOC overflow. It does so by restructuring your program to reduce the number of global symbols. The result is similar to what could be achieved through source changes, but avoiding the widespread manual source changes.

In the XL compilers, IPA is implied at optimization levels -O4 and -O5, but those also include other complex optimizations which may not be as relevant to commercial application development. One good alternative is the option -qipa=level=0, which applies a minimal level of whole-program optimization. This is often sufficient to eliminate TOC overflow, but in very large applications you may need -qipa=level=1 instead, which will perform a more aggressive reduction of the TOC requirements, at the cost of a longer compilation process.

Note that for whole-program analysis to be performed, the -qipa option needs to be specified both at the compile and link command lines. This means that the linking of the program has to be done through the compiler driver (xlc, xlC or cc) instead of directly through the system linker (ld). For maximum effect, all source files should be compiled with -qipa, but it is possible to mix-and-match objects compiled with different options and have them interoperate.

If you try these options please add comments to this post describing your results.

7 Comments Permalink
0

Compilers are expected to make volatiles immune to optimizations that result in incorrect access to the volatile variables e.g. reducing the load/stores, re-ordering them, and etc.

A recent study on volatiles identified a few bugs with GCC 4.3.0 and LLVM-GCC 2.2. We put our compiler to test and found none of the three bugs identified in this paper applies. Not bad!

The first test case loads a volatile variable in the loop. Although invariant, we expect the compiler to leave x in the loop. The generated pseudo assembly code at O2 and O3 confirm this.

Here is the source code:
const volatile int x;
volatile int y;
void foo(void)
{
for(y=0; y>10; y++)
{
int z=x;
}
}

The assembly listing of the source code above at O3, below, shows the load of x in each iteration of the unrolled loop:

@1L3 DS 0H
L r0,x(r15,r1,0)
L r0,y(r14,r1,0)
AHI r0,H'1'
ST r0,y(r14,r1,0)
L r0,y(r14,r1,0)
CHI r0,H'10'
BNH @1L5
L r0,x(r15,r1,0)
L r0,y(r14,r1,0)
AHI r0,H'1'
ST r0,y(r14,r1,0)
L r0,y(r14,r1,0)
CHI r0,H'10'
BNH @1L5
L r0,x(r15,r1,0)
L r0,y(r14,r1,0)
AHI r0,H'1'
ST r0,y(r14,r1,0)
L r0,y(r14,r1,0)
CHI r0,H'10'
BNH @1L5
L r0,x(r15,r1,0)
L r0,y(r14,r1,0)
AHI r0,H'1'
ST r0,y(r14,r1,0)
L r0,y(r14,r1,0)
CHI r0,H'10'
BH @1L3

The second test accesses a volatile variable on the fall through path of a condition.

Source is:
extern in qux();
volatile int w;
int bar(void)
{
if(qux())
return 0;
else
return w;
}

In the pseudo listing, below, w is correctly accessed when qux() returns zero. This listing generated at O3 is:

L r15,=V(qux)(,r3,66)
L r2,_CEECAA_(,r12,500)
BASR r14,r15
LTR r15,r15
L r1,=Q(w)(,r3,70)
BE @1L1
LA r15,0
B @1L3
@1L1 DS 0H
L r15,w(r1,r2,0)
@1L3 DS 0H

In the last source code a volatile variable is incremented inside the loop.

volatile int a;
void baz(void)
{
int i;
for(i=0; i<3; i++)
{
a += 7;
}
}

We unroll and the loop by three and access "a" three times. The listing of compile at O3 looks like below.

L r0,a(r14,r1,0)
AHI r0,H'7'
ST r0,a(r14,r1,0)
L r0,a(r14,r1,0)
AHI r0,H'7'
ST r0,a(r14,r1,0)
L r0,a(r14,r1,0)
AHI r0,H'7'
ST r0,a(r14,r1,0)

0 Comments Permalink
0

RENT or NORENT

Posted by Visda Jan 11, 2009

Here we are eleven days into a new year; and I would like to wish all of you a happy belated 2009! May this be a year of making technology less complex, more intuitive, friendlier and greener.

Is it clear why we have RENT|NORENT compiler option? Do you know that C++ always uses constructed re-enterancy? Can you imagine how applications can benefit most from this?

Recently, I developed a greater appreciation for RENT option, that is after I banged my head against the wall to REALLY understand what RENT is all about in order to fix a related bug. :8}

Reentrancy becomes important when a rather large application have multiple users who may access it concurrently, for example Oracle. Oracle applications developed to run on z/OS will access Oracle interface which is a reentrant code.

Per z/OS Language Environment Programming Guide the following routines must be reentrant.

  • Routines to be loaded into the LPA or ELPA
  • Routines to be used with CICS
  • Routines to be preloaded with IMS

Furthermore, a large application with many concurrent users will benefit from reentrancy, ie runs faster, because there is less paging to auxilary storage --variables are placed in writeable static area.

This is done for you, if your application is written in C++ or is a DLL. Compile with "-qlist" and you see RENT in the Compiler Option Listing. A C program, however, can be naturally reenterant, that is the user code doesn't change the static storage.


For example:

extern int x;


int main()


{

return x;

}

is naturally reentrant because the value of x is not changed by main.

A C program can be made reentrant, constructed reenterancy by specifying

a. -qRENT on the command line (in HFS) or RENT in CPARM (in BATCH)
b. #pragma variable(x,RENT) in the source

There you have it. In a nutshell: reentrancy comes for free with C++ and DLL code, for the rest you have ways to make the code reentrant; if your code is big and is called/used by multiple users at the same time you want to take advantage of this feature because it will improve the run time performance of your application.

0 Comments Permalink
0

When building large applications on AIX or pSeries Linux you may have experienced the dreaded TOC overflow. This is a situation reported the system linker that causes it to abort and fail to generate an executable.

What is this situation and what are the strategies for coping with it?

Basically, the TOC or table-of-contents is a table that the program uses to reference global symbols. Since these symbols can be referenced from multiple object files, their memory location is unknown until link time, so the code generated by the compiler to access them must look them up in this table.

The way this works is that the ABI reserves a register which always points to the TOC. The compiler generates an indirect reference off this pointer with a zero offset, which is updated by the linker with the actual offset it selected for each global symbol. The PowerPC architecture allows up to a 64K offset, thus creating a limit of 16K global symbols on 32-bit mode and 8K on 64-bit mode. If a program has a larger number of global symbols the linker cannot reserve TOC slots for all of them, and it aborts after reporting TOC overflow.

On my next post I'll discuss some strategies for addressing this problem, and their tradeoffs. Here's a link: Dealing with TOC overflow: the traditional approach

Permalink
0

Dual cores have become household products, yet we see little change in the performance of the application that run on these machines. Just this morning, I started Firefox and Lotus Notes at the same time on my dual core T60p laptop, and had to wait a long time before I could use either one of the applications.

To take full advantage of hardware horse power: software has to keep up, compiler has to generate better code, or both.

Recently, I have been experimenting with compiler options to see how much I can tune the size and performance of the executable. I started off by identifying the bottle neck(s), the hottest loop(s); places in application where CPU spends most of its processing time. I used hardware profiling tool to gather the data and Visual Performance Analyzer to identify the bottle necks. With VPA you can identify the bottle neck down to the hardware instruction level. I just mention this here to give you a sense of what the tool can do. But, for the purpose of my experiment, the function name was sufficient.

Given the function name, the rest was trial an error. I used a variety of performance tuning compiler options, e.g. HOT, INLINE, HGPR, all at O3; and pragma directives, #pragma option_override and #pragma unroll.

To conclude, I have to stay a lot relied on the high level code. I couldn’t find a hat that fits all. Playing with #pragma unroll proved to be interesting. Given the limited number of registers on z/OS loop unrolling opportunities were very limited.

0 Comments Permalink
5

Compilers, what they might do ...

Posted by Visda Nov 23, 2008

Ever wonder what compiler does to your code? I suppose this is of little importance if binary is generated quickly, it executes fast and yields correct result. I regard compiler as an interactive tool. Given that the destination is the same (the executable), compiler provides many routes for your source code to reach it.

Whether you are invoking the compiler via a JCL, under TSO, or ISPF panel, or you are using c89/xlc commands in z/OS USS environment, you can utilize a variety of compiler options that can affect how the source is treated during the compilation process.

For example, if your source is in C++ and declares/defines a whole bunch of generic types, you may want to use one of the C++ template options, FASTTEMPINC, TEMPINC, TEMPLATERECOMPILE, and etc. Or, if you are planning to debug a run-time problem, you may want to pass DEBUG (Batch/TSO) or –g (z/OS USS) to the compiler. (Have a look at Kendrick’s article to learn more about debugging on z/OS.) Or you want to enable optimization, then you specify OPTIMIZE.

But, let’s say you are in a similar situation as I was this past week, and want to disable optimization at the subprogram level. I wanted the entire compilation unit, CU, to be compiled at O3 except one, non-overloaded function. I used* #pragma option_override* to bump down the optimization level to O2 for that one function.

In summary, compilers may do a lot of things to your code, but at the same time they allow you full control.
:)

5 Comments Permalink
2

Speedy debugging on z/OS

Posted by Kendrick Nov 21, 2008

z/OS 1.10 introduces the dbgld utility. It will significantly improve the startup time and performance of dbx. So how does it work, and how do I take advantage of it? Read on for more details.

When you specify -g option during C/C++ compilation, a debug sidefile (.dbg) is generated for each input source file. To debug an application, dbx would need to locate all of its debug sidefiles, and process debug information stored within them, this is a necessary but time consuming task. dbgld can perform this task before dbx is invoked. In addition, dbgld consolidates all the debug sidefiles into a single sidefile (.mdbg). You can now debug your module, by bringing along just a single .mdbg file.

Speedy debugging in 3 easy steps
  1. Compile and bind program with -g option:
    xlc -g hello1.c hello2.c hello3.c -o hello
    This will create an executable module hello and 3 debug sidefiles hello1.dbg hello2.dbg and hello3.dbg
  2. Invoke dbgld on the module:
    dbgld hello
    This will create the consolidated debug sidefile hello.mdbg
  3. Finally, debug hello with dbx:
    dbx hello
    dbx will automatically use hello.mdbg if it finds it in your current directory.

For more information about dbgld, please refer to the XL C/C++ User's Guide:
XL C/C++ User's Guide: Chapter 22. DBGLD

2 Comments Permalink
0

My first blog in the Cafe'

Posted by Visda Nov 14, 2008


For my first blog I want to talk about Compiler Technology (hi)story on z/OS. I know what you are thinking and I don't blame you. z/OS has been around longer than many of us. I'll promise you, I will keep it interesting and above the fold.

Compiler characteristics can change based on the options passed to it. With each release came a different set of new options, or existing ones with new default values. You can use these options to detect and correct errors in your code, CHECKOUT, WARN64, DEBUG, etc., control the optimization process, OPTIMIZE, HOT, UNROLL, etc., migrate to the newer version of the compiler, PORT, UPCONV, etc.

Numbers are a big part of computing and can be represented in different forms. Starting in V1R9 release Decimal Floating Point joined IEEE BFP and HEX formats. When DFP is enabled, _Decimal32, _Decimal64, and _Decimal128 are supported and the decimal calculation results avoid the potential rounding mode problems from using binary or hexadecimal floating point types.

Supporting IBM middleware, CICS and DB2, has been on going and more visible in recent releases. The integrated CICS translator enables users to embed CICS statements in C/C++ source and pass them through the compiler without the need for an explicit preprocessing step. This permits a more seamless operation of C/C++ within the CICS environment.

If all the SQL statements are embedded in your XL C programs, you can use the XL C DB2 coprocessor to prepare the programs to request DB2 services. The DB2 coprocessor enables users to embed EXEC SQL statements in C/C++ source code, and pass them through the compiler without the need for a preprocessing step.

Over the period of many generations of hardware, we have provided users built-in functions that map directly to zSeries hardware instructions. These functions give you access to powerful hardware operations at a source level such as cache prefetching.

Optimization of C and C++ code is an important feature. We have developed more aggressive optimization in each release. They come at a cost, though. Check out Raymond's post on the topic.

The end of this blog is the beginning of many that my colleagues and I will post. We will delve into these and other relevant topics that have direct impact on Commercial Computing with C/C++.

0 Comments Permalink
0

z/OS C/C++ Performance Features

Compliers are an important tool in your development environment. A good optimizing compiler generates performance code without you worrying about the low level details of the OS, internals of the runtime environment and hardware architecture. You can concentrate on the business logic in your application. But optimization can take up a lot of resources, both in terms of compilation time and memory space. XL C/C++ provides an optimization option with 3 levels (called suboption 1, 2 and 3). Level 1 and 2 represent compromise between execution time performance and compile time. This is the appropriate setting in most cases. But there are situation where you want to let the compiler to exploit as much optimization opportunities as it could, regardless of compilation resources. This is what optimization level 3 (suboption 3) does. Experience shows that most of a program's time is usually spent in certain areas of the code (the 80-20 rule), one way of using level 3 is to apply it in those source files which contains hot spots of the application. The rest of the files, responsible for tasks like initialization, termination, error handling, and user interactions, can be compiled with lower optimization levels, getting the best of both worlds.

This leads to the general direction of putting more control into the programmer's hand in controlling actions taken by the compiler. An important technique in optimization is loop unrolling. Unrolling eliminates the loop control checking, which in turn can expose more optimization opportunities between loop iterations. But this is a two edged sword as too much unrolling can increase code size and larger memory footprint for the application. The optimizer normally makes decisions basing on it's analysis of the code. But often times the programmer knows which are the hot loops and can direct the compiler to do unrolling on specific ones. This is the purpose of the UNROLL option and the corresponding pragma directive. You can use these to control which loops to unroll, and by how many times, applying the optimization benefit to code that are most frequently executed.

The idea of execution frequency and its impact on optimization leads to the idea of Profile Directed Feedback (PDF). This is an enhancement to inter-procedural analysis (IPA), and is used together with the IPA option. IPA performances whole program analysis; it looks at code from all source files instead of just one. This leads to many more optimization opportunities than a normal optimizer usually discovers. PDF brings this a step further -- the compiler makes use of profiling information to direct its optimization. The steps to use PDF is as follows: 1) Build the application with the opion PDF1. This results in a load module with instructmentation to collect profiling information. 2) Run the instructmented module with typical input. The instructmented code will produce a data file containing the execution frequency of the code. This is called the training run. 3) Build the application again with PDF2. This is the production build where IPA makes use of the profiling data collected in step 2 to perform aggressive optimization. The result is a load module tuned to run optimally with the typical input used in the training run. In order to use this successfully, input data in the training run must be selected carefully. It is most effective when the production data profile on average doesn't vary too much.

The above are just a few of the features that can boost your program’s performance. You can find out more in the Programming Guide (http://publibz.boulder.ibm.com/epubs/pdf/cbcpg190.pdf, Part 5, chapter 36-42).

0 Comments Permalink
Bottom Banner