P-Code Extractor Guide John Chamberlain (jchamber@lynx.dac.neu.edu) http://johnc.ne.mediaone.net/ March 2001 The guide assumes you are using VB 6.0. Although it will probably work with 5.0 that is not a tested environment. INSTALLING First, you need to install the add-in. This involves adding an entry to your VB.INI file (see vb help for more information). As a convenience there is a function in extractors basic module (modAddIn.bas) which does this. You can run it from the debug window after loading the project. This will put the necessary entry in your ini file. This function looks like this: Sub AddToINI() Dim ErrCode As Long ErrCode = WritePrivateProfileString("Add-Ins32", PROJECT_NAME & ".clsConnect", "0", "vbaddin.ini") End Sub Where PROJECT_NAME is a global which is "PCodeExtractor". You must also register the DLL (PCExtractor.DLL) in windows. Since it is an ActiveX doodad it will self-register when you double-click on it/run it. Now when you open the Add-In Manager from the VB IDE Tools menu you should see "P-Code Extractor" listed. USING THE EXTRACTOR To use the extractor go to the add-in manager and load the add-in. After you do this you will see a number of new entries added to VB's File menu: Load OpCode Data Show P-Code Intersperse P-Code Show Raw P-Code You need to do the first one (Load OpCode Data) first. This will cause the opcode database to be loaded. The add-in expects the database (opcodes.txt) to be in the same directory as its dll. The add-in will announce whether the load succeeded or not in a message box. Once the database is loaded you can do an extraction. *IMPORTANT* You cannot extract anything until you compile your project, because until you compile there is no p-code in memory. Note that making any editing change at all to your project, even a comment, will delete any previously compiled p-code and you will need to recompile to have it generated internally. Compile your project. You can do this by chosing "Start with Full Compile" from the Run menu. If you start the project, stop it. Now, place your cursor in the procedure of interest and choose "Show P-Code" from the file menu. For example, I loaded the VB sample project called "SDINote" and did this for the function OnRecentFileList in "module2.bas" and this was the result: '> 020D 00 02 IDE beginning of line with imm#1 byte codes '> 020D 00 15 IDE beginning of line with imm#1 byte codes '> 1435 28 38 FF 01 00 PushVarInteger imm2#2 '> 0B8E 04 68 FF Push local.98 '> 1435 28 48 FF 04 00 PushVarInteger imm2#2 '> 0E24 FE 68 18 FF 72 00 '> 020D 00 44 IDE beginning of line with imm#1 byte codes '> 0B8E 04 0C FF Push local.F4 '> 0B8E 04 10 FF Push local.F0 '> 0B8E 04 68 FF Push local.98 '> 0550 55 vbaI2Var '> 1113 05 01 00 Push ptr '> 188B 24 02 00 [Pop] [SR] '> 20BA 0F 5C 03 '> 1BBB 19 14 FF '> 1A7D 08 14 FF [SR]=[local.EC] '> 200C 0D 40 00 0E 00 '> 1A7D 08 10 FF [SR]=[local.F0] '> 200C 0D 60 00 0F 00 '> 0B48 3E 0C FF Push#4 [arg]; [arg]=0 '> 08B7 46 FC FE '> 091B 5D '> 1A38 FD 16 0C 00 58 FF '> 257A FB 33 '> 1E50 29 04 00 '> 14 FF '> 10 FF '> 1DD6 35 FC FE '> 04D5 1C 65 00 If Pop=0 then ESI=ProcPC+0065#2 '> 020D 00 07 IDE beginning of line with imm#1 byte codes '> 1364 F4 FF Push imm#1 '> 0BB5 70 7A FF Pop#2 [arg] '> 020D 00 03 IDE beginning of line with imm#1 byte codes '> 2182 15 '> 020D 00 02 IDE beginning of line with imm#1 byte codes '> 020D 00 0B IDE beginning of line with imm#1 byte codes '> 0B8E 04 68 FF Push local.98 '> 0E72 FE 7E 18 '> 8FF6 FF 17 00 00 vbaGetOwner4 '> 17EB 07 F4 00 70 7A Push [arg1]+imm#2 '> B955 FE 100 00 15 vbaStrCompVar Function OnRecentFilesList(FileName) As Integer Dim i ' Counter variable. For i = 1 To 4 If frmSDI.mnuRecentFile(i).Caption = FileName Then OnRecentFilesList = True Exit Function End If Next i OnRecentFilesList = False End Function Notice how the p-code is inserted as a comment above the function. The format of this disassembly is (using '> 0E24 FE 68 18 FF 72 00 as an example): '> starts a comment in VB. the angle indicates a p-code disassembly line 0E24 the VBA RVA of the p-code's execution inside the VM FE 68 the p-code's opcode, most will be one byte, but some are two 18 FF 72 00 the argument(s) to the instruction (some opcodes have no arguments) An "execution" is the action machine code that does the instruction. The VM can be thought of as a giant switch statement with little blobs of code for each opcode. The execution is the blob of code for a given opcode. If you choose "Show Raw P-Code" it will show the p-code blob for the function and the address where the blob is located like this: '> lpComponent 2 of 8: 1B4424 lpTrailer: 1FB4F4 '> 18 FF 72 00 00 44 04 0C FF 04 10 FF 04 68 FF 55 05 01 00 '> 24 02 00 0F 5C 03 19 14 FF 08 14 FF 0D 40 00 0E 00 08 10 '> FF 0D 60 00 0F 00 3E 0C FF 46 FC FE 5D FD 16 0C 00 58 FF '> FB 33 29 04 00 14 FF 10 FF 35 FC FE 1C 65 00 00 07 F4 FF '> 70 7A FF 00 03 15 00 02 00 0B 04 68 FF FE 7E 18 FF 17 00 '> 00 07 F4 00 70 7A FF 00 00 15 '> 00 02 00 15 28 38 FF 01 00 04 68 FF 28 48 FF 04 00 FE 68 The component number is the component index of the procedure in the class or module (see help on vbcomponents/ide interface for more info). The Intersperse P-code menu choice is intended to put the p-code for each line as a comment to that line, but has not been implemented in the 1.0.42 version of the DLL so it does nothing. Next to some of the p-code lines are comments and others have no comment. If there is no comment it means that there is no instruction description in the database. There are many instructions that I have not decoded, but the most common ones have descriptions. In most cases about 2/3 of your p-codes should have a comment. THE INSTRUCTION DESCRIPTION The instruction description is the last part of the disassembly and hopefully describes what the opcode does as well as possible in a small space. Below I have described the possible features in the instruction description: - all numbers are in hex (if you can't think in hex already, get started, it makes things easy) - brackets [] are placed around values used by reference - # indicates the number of bytes of something - if an argument has no reference in the instruction description then it is a temporary variable - imm indicates an immediate value, ie a constant that is present right in the code - CON indicates the address of the constant pool; the constant pools stores module-level vars - SR indicates the stack register, ie the pointer to the top of the stack Example: vbaI2Var This kind of description means that the opcode calls an exported function call. You can see a list of these by using the Microsoft dumpbin.exe utility to dump exports of VBA6.DLL. This call probably converts an "I2" (2-byte integer?) to a variant. Example: Push local.98 This means that a local variable is being pushed onto the stack. The local variables, of course, are offsets in the stack frame. What the offset is (here 98) is compiler determined. Example: [SR]=[local.EC] The brackets indicate that the value is an address. This is generally the case when working with strings or other objects that are known by their addresses. For example, if a string is defined in a function memory will be allocated to contain it and then the address of that memory will be put in the local variable. Example: If Pop=0 then ESI=ProcPC+0065#2 In this example, there is a little bit logic described. Note that some opcodes have very complicated logic which is impossible to easily summarize. In this case what is happening is that the execution first pops the stack. If that value is 0 then it sets ESI equal to the procedure's program counter plus 65. The #2 indicates that the 65 is a 2-byte value. Note that ESI points to the argument of the next opcode, so this instruction is doing an absolute jump to a point 65 bytes into the procedure. You will often see such absolute jumps used to implement for-loops and if-statements. Example: Push#4 [arg]; [arg]=0 This means that the 4-byte address of the first argument (FileName) is being pushed onto the stack and the argument address is being set to 0. In this function there is only one argument, but if there were more you would see arg1, arg2 etc. MODIFYING THE DATABASE If you discover things about an opcode you may want to modify the op-code database to record your finding and have it appear in future disassemblies. This is the format of the database (each entry is for an opcode). Start Width Item 1 2 Byte1 first byte of the opcode in hex 4 2 Byte2 second byte of the opcode (if there is one) 7 8 VM Jump offset in the run-time VM (MSVBVM6.EXE) of its execution 16 8 VBA Jump offset in the ide VM (VBA6.EXE) of its execution 25 2 Arg Size number of bytes in all the arguments 27 2 Arg Count number of arguments to the op code 29 2 Arg1 Size size of argument 1 in bytes 31 2 Arg2 Size size of argument 2 in bytes 33 2 Arg3 Size size of argument 3 in bytes 36 32 Instruction disassembly description 68 80 Comment any comments EXAMPLE: Researching FE 68 '> 0E24 FE 68 18 FF 72 00 This instruction in the example above has no description in the database. Let's imagine we wanted to research it and provide one. Here are the steps: 1. put a breakpoint in the VB code by declaring, Declare Sub DebugBreak Lib "kernel32" () in the project. The before the relevant instruction call this kernel procedure. As long as you have a debugger installed (which happens automatically when you install VC++) it will pop up. 2. since the function containing the instruction being researched is not called by the program normally (it is extra code in the project) we have to call it explicitly. when it is called a dialog box will appear when it hits the DebugBreak call. the dialog says "The exception Breakpoint A breakpoint has been reached" etc. (obviously the programmers at MS need help with their grammar). 3. choose cancel to debug the application 4. now you are in the debugger. inside the DebugBreak execution. 5. go to the debug menu and choose "Modules". a list of all the loaded modules will appear. make a note of the base of VBA6.DLL. In my environment this happened to be 0x0FA90000. The VM executions begin at offset 0x00170000 which is 0x0FC00000 (0FA9+0017=0FC0). If you are running under NT or windows 2000 this should always be the address, but under 95 you may have to a calculation. this means that the execution we are interest in is located at 0x0FC00E24 (the 0E24 is the offset as shown in the disassembly). go back to your debugger window. 6. in the debugger window i am in the VM at 0FC01F5D. this is inside the instruction to make os calls. step through until you are on an instruction that looks like this: jmp dword ptr [eax*4+0FC027CCh] as you can tell from the article this kind of instruction is used to jump to the next opcode execution. remember from the article that eax has the next opcode in it. if it is FE68 then we are just about to enter the one we are interest in, otherwise it is an intervening opcode. in my case i had to step through several intervening opcodes. note that since FE68 is a 2-byte opcode you will actually do two jumps to get to it: on the first jump eax=FE and on the second eax=68. if you do not want to step through the intervening opcodes set a debugger breakpoint at 0x0FC00E24 and run to that breakpoint. either way you will end up in the execution for FE68 which looks like this: 0FC00E24 8D 45 94 lea eax,[ebp-6Ch] 0FC00E27 66 C7 00 11 00 mov word ptr [eax],offset ProcCallEngine+0E1Fh (0fc00e2a) 0FC00E2C C6 40 08 01 mov byte ptr [eax+8],1 0FC00E30 50 push eax 0FC00E31 8D 49 00 lea ecx,[ecx] 0FC00E34 0F BF 3E movsx edi,word ptr [esi] 0FC00E37 03 FD add edi,ebp 0FC00E39 8B 5C 24 08 mov ebx,dword ptr [esp+8] 0FC00E3D 57 push edi 0FC00E3E 89 75 D0 mov dword ptr [ebp-30h],esi 0FC00E41 E8 A5 61 00 00 call rtcArray+231h (0fc06feb) 0FC00E46 83 7D C8 00 cmp dword ptr [ebp-38h],0 0FC00E4A 0F 85 2F 84 00 00 jne rtcArray+24C5h (0fc0927f) 0FC00E50 8B 75 D0 mov esi,dword ptr [ebp-30h] 0FC00E53 0F B7 0B movzx ecx,word ptr [ebx] 0FC00E56 8B D1 mov edx,ecx 0FC00E58 0F B7 46 02 movzx eax,word ptr [esi+2] 0FC00E5C 83 C6 04 add esi,4 0FC00E5F 03 45 A8 add eax,dword ptr [ebp-58h] 0FC00E62 F6 C5 40 test ch,40h 0FC00E65 0F 85 1E 84 00 00 jne rtcArray+24CFh (0fc09289) 0FC00E6B FF 24 8D 04 27 C0 0F jmp dword ptr [ecx*4+0FC02704h] this is copied straight from the debugger window. what does this do? i have no idea, but you can see it is using a special value in the first instruction [ebp-6Ch]. VB stuffs a bunch of hardcoded locals before the application defined locals (ie your variables), of these I only know the meaning of the following: .14 program counter (ESI) for current instruction .30 storage for pc while making a call .34 gosub counter .4C stack register (SR) .50 points to byte code trailer (BCT) .54 points to the constant pool (CON) .58 first byte code in proc (ProcPC) the others (such as .6C) I do not know. We can tell that it is an object or structure (rather than a constant) because it is used by address. it is first setting this structure to something (0E27), then it is setting the objects third field (+8) to 1. Ah hah, this is a probably a structure that holds information about a for-loop and the third field contains the loops start value. it then pushes the structure which means it is making it an argument to a call, and so on and so forth. Now we know what .6C is. 7. by puzzling it out it is possible to eventually figure out what the call is doing. so far we can tell this is an opcode that initiates a for-loop starting with one (notice how the 1 is hard-coded). There is probably another opcode that does more generic for-loops that start with numbers other than one. it is very common for the VM to have several versions of a task that are optimizing for common values. 8. now we could add an instruction description to the database "initiates a for-loop starting with 1". ***** Have fun analyzing, John Chamberlain jchamber@lynx.dac.neu.edu http://johnc.ne.mediaone.net/