Add code-structure analysis: call graph, jump tables, basic blocks, constant xref

Wave 1 of the code-analysis layer, built on the x86-64 decoder:

- vmie_win32_callgraph walks each .pdata function with the decoder and emits an
  edge for every direct call/jmp whose target lands in the module - the
  intra-module call graph. Indirect edges are left to the IAT and jump tables.
- gva_jumptable recovers a switch's case targets from an indirect jump's table:
  consecutive pointer entries that land in an executable region.
- cfg_blocks splits one function view into basic blocks (a generic handler:
  leaders from intra-function branch targets, cut after jmp/jcc/ret).
- gva_imm_xref finds the instructions whose immediate operand equals a constant
  - the dual of code-xref for magic values, error codes, syscall numbers.

The decoder now also reports imm_off/imm_len so a caller can read or match the
immediate operand. The generic primitives live in the new codeanalysis.h
(jump tables, basic blocks) and scan.h (constant xref); the .pdata-bound call
graph stays on the win32 surface and reuses the existing function/section/decode
primitives - no second PE or instruction parser.
This commit is contained in:
2026-06-16 19:52:25 +03:00
parent c4419964aa
commit 79e82ffc6a
9 changed files with 505 additions and 1 deletions
+29 -1
View File
@@ -46,6 +46,26 @@ typedef struct {
uint8_t disp_len; /* displacement length: 1 (rel8), 4 (rel32 or RIP-rel
* disp32), else 0 (no displacement). The wildcard span is
* [disp_off, disp_off + disp_len). */
uint8_t imm_off; /* byte offset, within the instruction, of the IMMEDIATE
* operand (the trailing constant: imm8/16/32/64 of mov
* reg,imm / cmp r/m,imm / push imm / test / add ...), or
* 0 if the instruction carries no immediate
* (imm_len == 0). This is distinct from disp_off: disp_*
* is the rel/RIP-relative DISPLACEMENT (an address that
* floats with the load address), imm_* is the encoded
* CONSTANT operand. An instruction can have neither, one,
* or - for a few forms (e.g. a RIP-relative store of an
* immediate) - both. The immediate value lives at
* code[imm_off .. imm_off + imm_len), little-endian. */
uint8_t imm_len; /* immediate length in bytes: 1, 2, 4, or 8 (resolved
* against the effective operand size: the 66 prefix and
* REX.W are honoured, so e.g. mov r,imm is 2/4/8 and
* push imm / cmp r/m,imm32 is 2/4). 0 when the
* instruction has no single immediate operand; the rare
* combined-immediate forms (ENTER imm16,imm8; far ptr)
* also report 0 here - they are not a clean constant.
* The constant-xref scanner (gva_imm_xref) reads the low
* `width` bytes at imm_off when imm_len >= width. */
} x86_insn;
/* Decode ONE 64-bit-mode instruction at `code` (`avail` readable bytes). Fills
@@ -59,7 +79,15 @@ typedef struct {
* byte position and length of the rel/RIP-relative displacement field within the
* instruction (0/0 when there is none). These are exactly the bytes that float
* with the load address / relocation, so a signature generator wildcards
* [disp_off, disp_off+disp_len) and keeps the rest as must-match. */
* [disp_off, disp_off+disp_len) and keeps the rest as must-match.
*
* It also reports out->imm_off / out->imm_len: the position and length of the
* trailing IMMEDIATE constant operand (imm8/16/32/64), or 0/0 when there is
* none. The immediate is the encoded literal (a magic value, error code, table
* size, syscall number, ...) - distinct from the rel/RIP displacement. The
* length honours the 66 prefix and REX.W (so mov r,imm is 2/4/8); combined-
* immediate forms (ENTER, far ptr) report imm_len 0. This is what the
* constant-xref scanner (gva_imm_xref) compares against a wanted value. */
int x86_decode(const uint8_t* code, size_t avail, x86_insn* out);
/* Absolute target of a rel branch: ip + insn->len + insn->rel (0 unless has_rel). */