Software to access the AMD Athlon's four on-chip performance monitoring counters


This Forth toolkit provides the following features

Version 1.1 of the software fixes a bug that set the Edge bit in the PerfSel registers. This resulted in unexpectedly low event counts.

Hardware timer

Intel architecture processors since Pentium have included a 64-bit time stamp counter running at the processor clock speed that can be read with the RDTSC machine language instruction. This counter is also present in the AMD Athlon.

Performance monitoring counters

A number of things like cache behavior, register renaming stalls, memory alignment and branch prediction affect performance without manifesting themselves in the code.

For this reason, AMD has included four performance monitoring counters in their Athlon series processors. The counters can be set to count either the number or duration of various events like the ones mentioned above. The counters can then be read with the RDPMC instruction. Unfortunately the four AMD performance monitoring counters are not compatible with the two in the comparable Intel chips. One of the more serious issues is that they recognize a substantially different set of events.

Setting the counters to monitor a certain event is accomplished by writing to four model specific registers (MSRs) that control the counters. This needs to be done in kernel mode, since otherwise the processor generates a General Protection Fault (GPF). For this purpose the device driver PerfMon.sys is present.

CPUID

To make things a bit safer, there is also support for the CPUID instruction. This one is used to check that the CPU in use supports the extensions that are necessary to operate the performance counters. The GetCPU_ID function returns a structure that reveals details about the CPU. When the CPU is not an AMD Athlon the code simply refuses to perform most requested functions. You can hack the source to recognize other chips, or better yet, use the original code by Sami Sallinen.

Great. What do I do?

Release v1.1 of the software (source included) is specifically configured for the AMD Athlon and will not work for Intel CPUs. The supported OSs are Windows 2000 and Windows NT 4.0.

Do not install and start the device driver when the above requirements are not fulfilled. To enable Intel support you should recompile perfctrs.dll. The DLL is written in plain C and can be generated with MS VC++ 4.0 or better.

Available software

First of all, we need a device driver because Win2K and NT 4.0 do not allow to set the PerfSel registers from a user program. Without this device driver you can read the four performance timers, but not control what they are counting. The device driver is called perfmon.sys. A small utility, loaddrv.exe, allows to install and start the driver without rebooting. The LOADDRV program is by Paula Tomlinson. She describes it in her May 1995 article in Windows/DOS Developer's Journal (now Windows Developer's Journal). You could get the full program from their FTP site, but it seems to have been removed recently. A copy of the relevant tomlinsn.zip has been placed in the source code archive for this page's software. (You'll need the NT DDK to do anything useful with it.)

Second, a DLL provides read and write access to the CPU-specific hardware. This DLL can not work when the above device driver is not installed. This DLL is based on prior C++ work by Sami Sallinen.

Third, a small C program to test if the code is working correctly. The demonstration detects the CPU type and can set up the Athlon on-chip counters for four different events. Finally, it starts a user-specified program. When this user-supplied program stops, the performance monitor data is written to stdout. (In version 1.1 it is possible to specify the events that are monitored.)

Fourth, a Forth utility (in source) is supplied that allows access to the Performance Monitor Counters from any Forth program or word. This is the intended use of all the software mentioned above. Here is the output of the utility when used from within iForth 1.11e (the numbers are just for illustration):

#40175768 VALUE #istr	\ sets up the loop count for about 1 second of execution time.
	0 VALUE offs	\ allows to see effect of non-aligned memory accesses.

CREATE ape 9 , 8 ,

: test	TIMER-RESET #istr 0 DO  ape offs + @+ XOR
				7 OR 99 AND
				7 OR 99 AND
				7 OR 99 AND
				7 OR 99 AND
				7 OR 99 AND 33 + DROP
			  LOOP  .ELAPSED ;

: DO-TEST ( offs -- )	TO offs ['] test TEST-PERFORMANCE ;

FORTH> CR .printID
VendorID: AuthenticAMD, MaxCPUID 1, type 0, family 6, model 4, stepping 2
FPU on-chip
Virtual-8086 mode enhancement
Debugging extensions
Page size extensions
Time-stamp-counter
Model Specific Register support
Physical Address Extensions
Machine Check Exception
CMPXCHG8B
Memory type range register
PTE Global bit
Machine check architecture
CMOV instruction
MMX(tm) Technology

FORTH> 0 do-test 1.000 seconds elapsed.

[00C5] retired_far_control_transfers   : 626
[00C7] retired_resync_branches         : 3
[00CE] ints_masked_while_pending       : 0
[00CF] ints_taken                      : 193

[00CD] ints_masked                     : 33,844
[1F42] data_cache_refill_l2_all        : 968
[1F43] data_mem_refs_all               : 279
[1F44] data_cache_writebacks_all       : 1,279

[0045] l1_dltb_misses_l2_dltb_hits     : 337
[0046] l1+l2_dtlb_misses               : 491
[0047] misaligned_data_references      : 14
[0080] ifu_ifetch                      : 241,072,624

[0081] ifu_ifetch_miss                 : 398
[0084] l1_itlb_misses                  : 36
[0085] l1+l2_itlb_misses               : 92
[00C0] retired_instructions            : 1,004,401,986

[00C1] retired_ops                     : 1,124,990,805
[00C2] retired_branches                : 40,179,116
[00C3] retired_mispredicted_branches   : 784
[00C4] retired_taken_branch_mispredict : 40,177,633

[0040] data_cache_access               : 401,821,336
[0041] data_cache_misses               : 1,682
[1042] data_cache_refill_l2_modified   : 350
[0842] data_cache_refill_l2_owner      : 0

[0442] data_cache_refill_l2_exclusive  : 92
[0242] data_cache_refill_l2_shared     : 1
[0142] data_cache_refill_l2_invalid    : 212
[1043] data_mem_refs_modified          : 130

[0843] data_mem_refs_owner             : 0
[0443] data_mem_refs_exclusive         : 62
[0243] data_mem_refs_shared            : 0
[0143] data_mem_refs_invalid           : 0

[1044] data_cache_writebacks_modified  : 510
[0844] data_cache_writebacks_owner     : 0
[0444] data_cache_writebacks_exclusive : 223
[0244] data_cache_writebacks_shared    : 0

 

Installation

Copy PERFMON.SYS to %systemroot%\system32\drivers. Run LOADDRV and install, then start PERFMON.SYS. Rebooting should not be necessary. The system will remember to start the service at each reboot.

When something goes wrong Win2K refuses to boot, or reboots spontaneously. In that case, start up in "safe mode" and delete perfmon.sys.

Bugs

Only the AMD Athlon is supported. Some fields of the PerfSel registers can not be set: CMASK, Int and Inv. These fields are very infrequently used. The code should only be used with single-CPU boards, which for Athlon processors is not a serious problem at the moment (Dec 2000).
Comments appreciated:
c/o Marcel Hendrix - mhx@iae.nl
free counter Valid HTML 3.0