P-Code Extractor Guide
John Chamberlain (jchamber@lynx.dac.neu.edu)
http://johnc.ne.mediaone.net/
March 2001

The guide assumes you are using VB 6.0. Although it will probably work with 5.0 that is not a tested environment.

INSTALLING

First, you need to install the add-in. This involves adding an entry to your VB.INI file (see vb help for more information). As a convenience there is a function in extractors basic module (modAddIn.bas) which does this. You can run it from the debug window after loading the project. This will put the necessary entry in your ini file. This function looks like this:

Sub AddToINI()
    Dim ErrCode As Long
    ErrCode = WritePrivateProfileString("Add-Ins32", PROJECT_NAME & ".clsConnect", "0", "vbaddin.ini")
End Sub

Where PROJECT_NAME is a global which is "PCodeExtractor". You must also register the DLL (PCExtractor.DLL) in windows. Since it is an ActiveX doodad it will self-register when you double-click on it/run it. Now when you open the Add-In Manager from the VB IDE Tools menu you should see "P-Code Extractor" listed.

USING THE EXTRACTOR

To use the extractor go to the add-in manager and load the add-in. After you do this you will see a number of new entries added to VB's File menu:
    Load OpCode Data
    Show P-Code
    Intersperse P-Code
    Show Raw P-Code
You need to do the first one (Load OpCode Data) first. This will cause the opcode database to be loaded. The add-in expects the database (opcodes.txt) to be in the same directory as its dll. The add-in will announce whether the load succeeded or not in a message box.

Once the database is loaded you can do an extraction.

*IMPORTANT* You cannot extract anything until you compile your project, because until you compile there is no p-code in memory. Note that making any editing change at all to your project, even a comment, will delete any previously compiled p-code and you will need to recompile to have it generated internally.

Compile your project. You can do this by chosing "Start with Full Compile" from the Run menu. If you start the project, stop it.

Now, place your cursor in the procedure of interest and choose "Show P-Code" from the file menu.

For example, I loaded the VB sample project called "SDINote" and did this for the function OnRecentFileList in "module2.bas" and this was the result:

'> 020D  00     02           IDE beginning of line with imm#1 byte codes
'> 020D  00     15           IDE beginning of line with imm#1 byte codes
'> 1435  28     38 FF 01 00  PushVarInteger imm2#2
'> 0B8E  04     68 FF        Push local.98
'> 1435  28     48 FF 04 00  PushVarInteger imm2#2
'> 0E24  FE 68  18 FF 72 00
'> 020D  00     44           IDE beginning of line with imm#1 byte codes
'> 0B8E  04     0C FF        Push local.F4
'> 0B8E  04     10 FF        Push local.F0
'> 0B8E  04     68 FF        Push local.98
'> 0550  55                  vbaI2Var
'> 1113  05     01 00        Push ptr
'> 188B  24     02 00        [Pop] [SR]
'> 20BA  0F     5C 03
'> 1BBB  19     14 FF
'> 1A7D  08     14 FF        [SR]=[local.EC]
'> 200C  0D     40 00 0E 00
'> 1A7D  08     10 FF        [SR]=[local.F0]
'> 200C  0D     60 00 0F 00
'> 0B48  3E     0C FF        Push#4 [arg]; [arg]=0
'> 08B7  46     FC FE
'> 091B  5D
'> 1A38  FD 16  0C 00 58 FF
'> 257A  FB 33
'> 1E50  29     04 00
'>              14 FF
'>              10 FF
'> 1DD6  35     FC FE
'> 04D5  1C     65 00        If Pop=0 then ESI=ProcPC+0065#2
'> 020D  00     07           IDE beginning of line with imm#1 byte codes
'> 1364  F4     FF           Push imm#1
'> 0BB5  70     7A FF        Pop#2 [arg]
'> 020D  00     03           IDE beginning of line with imm#1 byte codes
'> 2182  15
'> 020D  00     02           IDE beginning of line with imm#1 byte codes
'> 020D  00     0B           IDE beginning of line with imm#1 byte codes
'> 0B8E  04     68 FF        Push local.98
'> 0E72  FE 7E  18
'> 8FF6  FF 17  00 00        vbaGetOwner4
'> 17EB  07     F4 00 70 7A  Push [arg1]+imm#2
'> B955  FE 100 00 15        vbaStrCompVar
Function OnRecentFilesList(FileName) As Integer
    Dim i         ' Counter variable.

    For i = 1 To 4
        If frmSDI.mnuRecentFile(i).Caption = FileName Then
            OnRecentFilesList = True
            Exit Function
        End If
    Next i
    OnRecentFilesList = False
End Function


Notice how the p-code is inserted as a comment above the function. The format of this disassembly is (using '> 0E24  FE 68  18 FF 72 00 as an example):

 '>           starts a comment in VB. the angle indicates a p-code disassembly line
 0E24         the VBA RVA of the p-code's execution inside the VM
 FE 68        the p-code's opcode, most will be one byte, but some are two
 18 FF 72 00  the argument(s) to the instruction (some opcodes have no arguments)

An "execution" is the action machine code that does the instruction. The VM can be thought of as a giant switch statement with little blobs of code for each opcode. The execution is the blob of code for a given opcode.	

If you choose "Show Raw P-Code" it will show the p-code blob for the function and the address where the blob is located like this:

'> lpComponent 2 of 8: 1B4424 lpTrailer: 1FB4F4
'> 18 FF 72 00 00 44 04 0C FF 04 10 FF 04 68 FF 55 05 01 00
'> 24 02 00 0F 5C 03 19 14 FF 08 14 FF 0D 40 00 0E 00 08 10
'> FF 0D 60 00 0F 00 3E 0C FF 46 FC FE 5D FD 16 0C 00 58 FF
'> FB 33 29 04 00 14 FF 10 FF 35 FC FE 1C 65 00 00 07 F4 FF
'> 70 7A FF 00 03 15 00 02 00 0B 04 68 FF FE 7E 18 FF 17 00
'> 00 07 F4 00 70 7A FF 00 00 15
'> 00 02 00 15 28 38 FF 01 00 04 68 FF 28 48 FF 04 00 FE 68

The component number is the component index of the procedure in the class or module (see help on vbcomponents/ide interface for more info).

The Intersperse P-code menu choice is intended to put the p-code for each line as a comment to that line, but has not been implemented in the 1.0.42 version of the DLL so it does nothing.

Next to some of the p-code lines are comments and others have no comment. If there is no comment it means that there is no instruction description in the database. There are many instructions that I have not decoded, but the most common ones have descriptions. In most cases about 2/3 of your p-codes should have a comment.

THE INSTRUCTION DESCRIPTION

The instruction description is the last part of the disassembly and hopefully describes what the opcode does as well as possible in a small space. Below I have described the possible features in the instruction description:

- all numbers are in hex (if you can't think in hex already, get started, it makes things easy)
- brackets [] are placed around values used by reference
- # indicates the number of bytes of something
- if an argument has no reference in the instruction description then it is a temporary variable
- imm indicates an immediate value, ie a constant that is present right in the code
- CON indicates the address of the constant pool; the constant pools stores module-level vars
- SR indicates the stack register, ie the pointer to the top of the stack

Example: vbaI2Var
This kind of description means that the opcode calls an exported function call. You can see a list of these by using the Microsoft dumpbin.exe utility to dump exports of VBA6.DLL. This call probably converts an "I2" (2-byte integer?) to a variant.

Example: Push local.98
This means that a local variable is being pushed onto the stack. The local variables, of course, are offsets in the stack frame. What the offset is (here 98) is compiler determined. 

Example: [SR]=[local.EC]
The brackets indicate that the value is an address. This is generally the case when working with strings or other objects that are known by their addresses. For example, if a string is defined in a function memory will be allocated to contain it and then the address of that memory will be put in the local variable.

Example: If Pop=0 then ESI=ProcPC+0065#2
In this example, there is a little bit logic described. Note that some opcodes have very complicated logic which is impossible to easily summarize. In this case what is happening is that the execution first pops the stack. If that value is 0 then it sets ESI equal to the procedure's program counter plus 65. The #2 indicates that the 65 is a 2-byte value. Note that ESI points to the argument of the next opcode, so this instruction is doing an absolute jump to a point 65 bytes into the procedure. You will often see such absolute jumps used to implement for-loops and if-statements.

Example: Push#4 [arg]; [arg]=0
This means that the 4-byte address of the first argument (FileName) is being pushed onto the stack and the argument address is being set to 0. In this function there is only one argument, but if there were more you would see arg1, arg2 etc.

MODIFYING THE DATABASE

If you discover things about an opcode you may want to modify the op-code database to record your finding and have it appear in future disassemblies. This is the format of the database (each entry is for an opcode).

Start  Width   Item
1      2       Byte1          first byte of the opcode in hex
4      2       Byte2          second byte of the opcode (if there is one)
7      8       VM Jump        offset in the run-time VM (MSVBVM6.EXE) of its execution
16     8       VBA Jump       offset in the ide VM (VBA6.EXE) of its execution
25     2       Arg Size       number of bytes in all the arguments
27     2       Arg Count      number of arguments to the op code
29     2       Arg1 Size      size of argument 1 in bytes
31     2       Arg2 Size      size of argument 2 in bytes
33     2       Arg3 Size      size of argument 3 in bytes
36     32      Instruction    disassembly description
68     80      Comment        any comments

EXAMPLE: Researching FE 68

'> 0E24  FE 68  18 FF 72 00

This instruction in the example above has no description in the database. Let's imagine we wanted to research it and provide one. Here are the steps:

1. put a breakpoint in the VB code by declaring,
	Declare Sub DebugBreak Lib "kernel32" ()
in the project. The before the relevant instruction call this kernel procedure. As long as you have a debugger installed (which happens automatically when you install VC++) it will pop up.

2. since the function containing the instruction being researched is not called by the program normally (it is extra code in the project) we have to call it explicitly. when it is called a dialog box will appear when it hits the DebugBreak call. the dialog says "The exception Breakpoint A breakpoint has been reached" etc. (obviously the programmers at MS need help with their grammar).

3. choose cancel to debug the application

4. now you are in the debugger. inside the DebugBreak execution.

5. go to the debug menu and choose "Modules". a list of all the loaded modules will appear. make a note of the base of VBA6.DLL. In my environment this happened to be 0x0FA90000. The VM executions begin at offset 0x00170000 which is 0x0FC00000 (0FA9+0017=0FC0). If you are running under NT or windows 2000 this should always be the address, but under 95 you may have to a calculation. this means that the execution we are interest in is located at 0x0FC00E24 (the 0E24 is the offset as shown in the disassembly). go back to your debugger window.

6. in the debugger window i am in the VM at 0FC01F5D. this is inside the instruction to make os calls. step through until you are on an instruction that looks like this:
	jmp         dword ptr [eax*4+0FC027CCh]
as you can tell from the article this kind of instruction is used to jump to the next opcode execution. remember from the article that eax has the next opcode in it. if it is FE68 then we are just about to enter the one we are interest in, otherwise it is an intervening opcode. in my case i had to step through several intervening opcodes. note that since FE68 is a 2-byte opcode you will actually do two jumps to get to it: on the first jump eax=FE and on the second eax=68. if you do not want to step through the intervening opcodes set a debugger breakpoint at 0x0FC00E24 and run to that breakpoint. either way you will end up in the execution for FE68 which looks like this:

0FC00E24 8D 45 94             lea         eax,[ebp-6Ch]
0FC00E27 66 C7 00 11 00       mov         word ptr [eax],offset ProcCallEngine+0E1Fh (0fc00e2a)
0FC00E2C C6 40 08 01          mov         byte ptr [eax+8],1
0FC00E30 50                   push        eax
0FC00E31 8D 49 00             lea         ecx,[ecx]
0FC00E34 0F BF 3E             movsx       edi,word ptr [esi]
0FC00E37 03 FD                add         edi,ebp
0FC00E39 8B 5C 24 08          mov         ebx,dword ptr [esp+8]
0FC00E3D 57                   push        edi
0FC00E3E 89 75 D0             mov         dword ptr [ebp-30h],esi
0FC00E41 E8 A5 61 00 00       call        rtcArray+231h (0fc06feb)
0FC00E46 83 7D C8 00          cmp         dword ptr [ebp-38h],0
0FC00E4A 0F 85 2F 84 00 00    jne         rtcArray+24C5h (0fc0927f)
0FC00E50 8B 75 D0             mov         esi,dword ptr [ebp-30h]
0FC00E53 0F B7 0B             movzx       ecx,word ptr [ebx]
0FC00E56 8B D1                mov         edx,ecx
0FC00E58 0F B7 46 02          movzx       eax,word ptr [esi+2]
0FC00E5C 83 C6 04             add         esi,4
0FC00E5F 03 45 A8             add         eax,dword ptr [ebp-58h]
0FC00E62 F6 C5 40             test        ch,40h
0FC00E65 0F 85 1E 84 00 00    jne         rtcArray+24CFh (0fc09289)
0FC00E6B FF 24 8D 04 27 C0 0F jmp         dword ptr [ecx*4+0FC02704h]

this is copied straight from the debugger window. what does this do? i have no idea, but you can see it is using a special value in the first instruction [ebp-6Ch]. VB stuffs a bunch of hardcoded locals before the application defined locals (ie your variables), of these I only know the meaning of the following:

.14	program counter (ESI) for current instruction
.30	storage for pc while making a call
.34	gosub counter
.4C	stack register (SR)
.50	points to byte code trailer (BCT)
.54	points to the constant pool (CON)
.58	first byte code in proc (ProcPC)

the others (such as .6C) I do not know. We can tell that it is an object or structure (rather than a constant) because it is used by address. it is first setting this structure to something (0E27), then it is setting the objects third field  (+8) to 1. Ah hah, this is a probably a structure that holds information about a for-loop and the third field contains the loops start value. it then pushes the structure which means it is making it an argument to a call, and so on and so forth. Now we know what .6C is.

7. by puzzling it out it is possible to eventually figure out what the call is doing. so far we can tell this is an opcode that initiates a for-loop starting with one (notice how the 1 is hard-coded). There is probably another opcode that does more generic for-loops that start with numbers other than one. it is very common for the VM to have several versions of a task that are optimizing for common values.

8. now we could add an instruction description to the database "initiates a for-loop starting with 1".

*****

Have fun analyzing,

	John Chamberlain
	jchamber@lynx.dac.neu.edu
	http://johnc.ne.mediaone.net/