Current area: HOME -> Microsoft's P-Code Implementation

Microsoft's P-Code Implementation
by John Chamberlain

This article was originally published in the March 2001 issue of VB Online Magazine.

About the Author: John Chamberlain is a long-time developer in the Boston area. Currently he is a senior developer at Invantage, Inc. Reach him by e-mail at or drop by his web site
http://johnc.ne.mediaone.net/.

Suggested Tools:
Visual Basic 6 and a debugger.

Resources:
Inside the Java Virtual Machine by Bill Venners. McGraw-Hill, 1998. ISBN 0-07-913248-0

Use the P-Code Extractor to disassemble VBA byte codes.

A key pillar of Microsoft's software tool strategy is Visual Basic for Applications. By standardizing this programming system across their product line they are simplifying their internal development and nurturing a large base of developers and power users who can work comfortably in the different environments. The underpinning of VBA is a complex stack machine that normally prevents the developer from determining the run-time logic of application code. With the Visual Basic add-in accompanying this article, the P-Code Extractor, you can extract and disassemble compiled byte codes and see VBA in its native state. I cover the basics of the instruction set and the mechanism of the stack machine to the extent that you will be able to follow the execution of a VBA program in a debugger.

Introduction to "Portable" Code

The first Pascal compilers ran on relatively obscure machines and the complexity of their design militated against porting of the compiler itself. The solution was a compiler that generated pseudo-opcodes that would run on a local interpreter and thus make Pascal truly portable across a wide range of platforms. It was called the "Pascal-P" compiler and its output was "P-code"1. A decade later Microsoft adopted the by-then familiar terms "p-code" and "p-code engine" for its stack machine. Portability was not a development priority, though, and since VB p-code was noticeably not portable the "p" morphed into "packed", stressing the small size of the execution byte package. After Sun's Java blitzed the web world in 1995, Microsoft responded by changing the name of the p-code engine to the more sexy "VB Virtual Machine". This term is a misnomer, however, because whereas Java's opcode set does represent an abstraction of a CPU instruction set, Microsoft's p-codes are an abstraction of a programming language and environment. A better term for the engine would be the "VB Virtual Language".

1. Nori, K.V., U. Ammann, K. Jensen, H. H. Nageli, and Ch. Jacobi. "Pascal P implementation notes," pp. 125-170 in Pascal--The Language and its Implementation (D. W. Barron, 1981).

VB VM Compared to the Java VM

The number of opcodes highlights the difference between the VB6 VM (1351 opcodes) and the Java VM (256 opcodes). Java's operations represent stack manipulations, a simple set of flow control, and a few miscellaneous items to handle threading and exceptions. VB has the like of all these plus hundreds of language- and Windows-specific instructions. For example, 3C represents the Windows API call GetLastError. These specialized byte codes make compilation easier but prevent VB programs from being ported to different operating systems. They also require a much larger engine than the comparable Java run-time engine (although the msjava.dll somehow manages to weigh in at a hefty 912K). Though they are both stack machines, in their internal design Microsoft's two VMs are very different so the description here of the VB VM mechanism does not apply to the MSJava VM. The various flavors of VBA, however, have the same implementation. For example, the VBA332.DLL that ships with Microsoft Office 97 is functionally identical to the VBA6.DLL which is the basis for this article. Since most of the opcodes are same you can use the downloadable opcode database to analyze p-code generated by the run-time environments in Office applications such as Excel, Word and Access. A good resource for getting a general feel for stack machines is Bill Venner's book, Inside the Java Virtual Machine. O'Reilly also published a book on the Java VM, but unfortunately it is out of print.

VB VM Basics

The engine processes any given opcode the same way. The trick is that it does not loop through the p-codes, rather it maintains a register cursor that points to the arguments of the current opcode and uses a jump table to find the handler for the next opcode. You can see this process in Figure 1 which illustrates the execution for F5, push an immediate long (similar to Java's lpush). When you enter an opcode handler the ESI register always points to the arguments of the opcode and EAX contains the jump index.

Execution for F5 - push an immediate long

0FC01377 mov eax,dword ptr [esi] (1) Fetch argument(s)
0FC01379 push eax (2) Do opcode work
0FC0137A xor eax,eax (3) Jump to next opcode
0FC0137C mov al,byte ptr [esi+4]
0FC0137F add esi,5
0FC01382 jmp dword ptr [eax*4+0FC027CCh]

The Engine in Action. In the byte code stream a push of the constant 8 as a long would appear as "F5 08 00 00 00" and would be executed as shown. Each of the 775 handlers in the VB6 VM follows this same basic pattern.

Figure 1.

In the example of Figure 1 the table of vectors to the opcode handlers is located at 0FC027CC so an opcode of 71 will jump to the 71st entry in this table which will be the contents of address 71*4+0FC027CC. Here the "work" of the handler (step 2) is just a single push operation, but other opcodes have very complicated executions. Regardless of this complexity you can always sidestep intervening handlers as long as you know the vectors for each opcode. For example, by setting a debugger breakpoint at 0FC01377 in the environment of Figure 1 you would stop at the beginning of each F5 execution. The output of the P-Code Extractor includes RVAs for each instruction which makes it easy to determine these breakpoint locations for your environment.

In this methodology of running a stack machine only the ESI and EAX registers have a special use. Since EAX is always cleared (xor eax, eax) and loaded before every jump it is free for general use within the body of the handler. Only the ESI register is preserved. The engine moves ESI past the opcode arguments either in step 3 as in Figure 1 (add esi, 5) or by adding in step 1 in which case it would add 4 to go past the argument in step 1 and do an inc in step 3 to go past the next opcode. The adds to ESI must always take it to the arguments of the next instruction. Operations affecting the application program's flow of control change the path of execution by modifying ESI. In a flow control operation the engine will always add to ESI in step 1. The other registers are free for their typical uses. In particular the stack and base registers work the same way as in normal functional code. Altogether there are 775 op-code handlers in VB6. This is possible because many of the 1351 opcodes map to the same handler. The main reason for the overlap is that there are actually two engines: the Visual Basic for Applications engine (in VBA6.DLL) which operates in the design environment and the run-time virtual machine engine (in MSVBVM60.DLL) that is used by executables. Many instructions have one op-code for VBA and a different parallel one for the run-time engine. The two engines have the same functionality with some slight differences mostly attributable to the special needs of the design environment like stepping through code that don't apply to run-time.

Procedure Execution

The engine always runs in the context of one of the application's procedures (or methods). Each procedure/method in an application is compiled into a separate p-code structure which has no pre-defined relationship to any other procedure. When one procedure needs to call another it uses a hard-coded address to a little trampoline generated by the compiler:

mov edx,20E220h
mov ecx,offset ProcCallEngine (0fc0000b)
jmp ecx

The key address (20E220 in this example) points to a structure at the end of the callee's blob of p-code, the byte code trailer. The VB internal function ProcCallEngine uses the info in the trailer to set up the stack frame and start the engine on the first opcode of the blob. An analogous function, the MethodCallEngine, does the same thing for methods. The overall entry point for an application is simply the address of the trampoline for its startup procedure (or method).

VB's Run-Time Stack Frame


The proc stack frame has the same structure as a normal functional model stack frame with a few twists. The key points are that any parameters to an application function are up by two additional entries from the base pointer and the app locals don't start until 0x88.

Figure 2.

The stack frame has the layout shown in Figure 2. When the ProcCallEngine fires up a stack it reserves 30 variables for the engine's internal use, called "frame data". The figure identifies the most important of these. The acronyms shown (e.g., "SR" and "CON") are used in the P-Code Extractor output to refer to these locations. The constant pool is an array in which the compiler stuffs references to any outside piece of information the proc needs such as constants, strings, and callable function address (such as run-time calls like rtcMsgBox and the trampolines of other procs). You can find details on the frame data in the online materials. Below the frame data and the app locals the stack itself grows and shrinks as the engine does its work. Typically one or more instructions will put operands on the stack and then another will pop them off to do something and push the result--sort of like a mini function call.

When a proc exits it cleans up and sets the base and stack pointers to those of the calling function and jumps to the return pointer stored above the base pointer. There is no simple "ret" as in a normal function call. The engine does it manually. When a proc returns it just continues on in whatever handler called it.

The Opcode Database

The opcode database is a text file (OPCODES.TXT) that contains all the information the P-Code Extractor needs to disassemble a p-code byte blob. I formatted it in a way that makes it natural to read and edit as a text document. Figure 3 shows a few entries from the database and explains the format. Note that some elements of the syntax are dynamic. For example where you see the term "arg" in the database you will see an actual local (or parameter) number in the disassembly. This is because the Extractor knows that opcode args are negative offsets and app function parameters are positive offsets and it does the two's complement calculation for you so arg byte codes like 6C FF are translated to "local.94" and 10 00 to "param2". The format also has a comment column that comes out in the disassembly and may have additional information relevant to the opcode. The instruction descriptions represent only my interpretations of the main purpose of each opcode. Once you gain a familiarity with the syntax you can make your own interpretations and modify the database appropriately. I maintain a page at my website for updates and corrections to the database.

Opcode Database Format


The opcode database determines how the P-Code Extractor disassembles a p-code byte blob. Since the database is a normal text file it is easy to modify.

Figure 3.

A complete description of the abbreviation scheme used for the opcodes is in the online guide. I use a shorthand which is as explicit as possible within the available amount of space to describe the action of the instruction. To some extent this is by necessity because most of the instructions are relatively complex compared to a generic stack machine and therefore very condensed mnemonics are not possible.

How to Use the P-Code Extractor

The P-Code Extractor DLL is a VB6 add-in which you can load with Add-Ins/Add-In Manager menu on the IDE. When it loads it adds menu choices to VB's Tools menu that allow you to set display options for the output and generate the output itself. To do an extraction run the project to compile it internally, put your cursor in the procedure of interest and click the Tools/Show P-Code menu choice. The disassembly will be inserted as a comment in your source code. Figure 4 shows an example for a function to do concatenation. For comparison's sake I have included the most relevant portion of the native code assembly listing for the function (refer to my November 1999 cover story in Visual Basic Programmer's Journal for details on how to generate these listings). The native code uses different logic to achieve the same result.

P-Code Disassembly for a Function to do Concatenation
Public Function Cat2(String1 As String, String2 As String) As String
    Dim lngNum As Long
    lngNum = "a"
    Cat2 = String1 & String2
End Function

>020D  00     02           IDE beginning of line with imm#1 byte codes
>020D  00     09           IDE beginning of line with imm#1 byte codes
>13A8  1B     04 00        Push ptr
>07A0  50                  vbaI4Str
>0BCC  71     74 FF        Pop#4 [arg]
>020D  00     0C           IDE beginning of line with imm#1 byte codes
>1063  80     0C 00        Push [stack.C]
>1063  80     10 00        Push [stack.10]
>259B  2A                  vbaStrCat
>1AD8  31     78 FF        SysFreeString [arg]; [arg]=Pop
>020D  00     00           IDE beginning of line with imm#1 byte codes
>2142  14                  end proc
>020D  00     00           IDE beginning of line with imm#1 byte codes

; 50   :     Cat2 = String1 & String2

    mov    eax, DWORD PTR _String1$[ebp]
    mov    edx, DWORD PTR _String2$[ebp]
    mov    ecx, DWORD PTR [eax]
    mov    eax, DWORD PTR [edx]
    push   ecx
    push   eax
    call   DWORD PTR __imp____vbaStrCat
    mov    edx, eax
    lea    ecx, DWORD PTR _Cat2$[ebp]
    call   DWORD PTR __imp_@__vbaStrMove
    push   $L168
$L76:
; 51   : End Function

    lea    ecx, DWORD PTR _Cat2$[ebp]
    call   DWORD PTR __imp_@__vbaFreeStr
    ret    0


The P-Code Extractor inserts the disassembly as a comment right into your source code. You can use options settings to change the way it displays and the items included. For comparison's sake a portion of the native code assembly is shown alongside the p-code disassembly.

Figure 4.

When a project runs in the design environment every breakable line of user source code generates a beginning-of-line instruction (opcode 00). To fully understand all the instruction types you need to read the online materials. With the complexity of VB knowing the functionality of an opcode is sometimes just the beginning of solving a problem. For example, from Figure 4 you can see that VB pushes the two strings and uses an internal call, vbaStrCat (opcode 2A), to do the actual concatenation, but what may be of interest to you could be hidden away in that call. The extractor is just the first step in unraveling the meaning of your program. Nevertheless, since the language itself is implemented via the p-code engine, the extractor gets past the most complex barrier and makes an internal analysis of a VBA program possible.

DOWNLOAD CODE(Local Copy)
The code package for the article includes the disassembler add-in's source code and compiled dll, a guide to using the dll, and the opcodes.txt data file.

Additional page: PCExtractor-Guide.txt