Learn A Lot From Cornucopia (ASE 2022): A Case

This is to inspect the internal part of Cornucopia, from the perspective of code implementation.

Introduce Cornucopia

As documented,
“Cornucopia is an architecture agnostic automated framework that can generate a plethora of binaries from corresponding program source by exploiting compiler optimizations and feedback-guided learning. We were able to generate 309K binaries across four architectures (x86, x64, ARM, MIPS) with an average of 403 binaries for each program.”

It was published on ASE’22.

According to my own needs, some modifications should be applied to Cornucopia, therefore I should know how it works and where to customize. It you too, this is for you. Let’s learn it progressively from basic to advanced.

Run It

See Project Page for details.

Here comes something that needs to be noticed.

when running `fitness_wrapper/server_function_hash_uniform_weight.py`

After we followed the guide, ran the docker and configured PostgreSQL, when running a server receiving generated binaries, we met an RuntimeError:

root@b406ed67aeed:~# cd fitness_wrapper/
root@b406ed67aeed:~/fitness_wrapper#  python3 server_function_hash_uniform_weight.py /root/fitness_wrapper/uploaded_files/ 5001

Traceback (most recent call last):
  File "server_function_hash_uniform_weight.py", line 194, in <module>
    db.create_all()
  File "/usr/local/lib/python3.8/dist-packages/flask_sqlalchemy/extension.py", line 900, in create_all
    self._call_for_binds(bind_key, "create_all")
  File "/usr/local/lib/python3.8/dist-packages/flask_sqlalchemy/extension.py", line 871, in _call_for_binds
    engine = self.engines[key]
  File "/usr/local/lib/python3.8/dist-packages/flask_sqlalchemy/extension.py", line 687, in engines
    app = current_app._get_current_object()  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/dist-packages/werkzeug/local.py", line 508, in _get_current_object
    raise RuntimeError(unbound_message) from None
RuntimeError: Working outside of application context.

This typically means that you attempted to use functionality that needed
the current application. To solve this, set up an application context
with app.app_context(). See the documentation for more information.

To solve this, set up an application context in fitness_wrapper/server_function_hash_uniform_weight.py

# from
if __name__ == '__main__':

    #initialize the sql-alchemy data,
    db.create_all()
    #run the app
    app.run(host=os.getenv('IP', '0.0.0.0'), port=int(os.getenv('PORT', PORT)), debug=False, threaded=True)

# to
if __name__ == '__main__':
    with app.app_context():
        #initialize the sql-alchemy data,
        db.create_all()
        #run the app
        app.run(host=os.getenv('IP', '0.0.0.0'), port=int(os.getenv('PORT', PORT)), debug=False, threaded=True)

Another problem, if see things like:

......
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/create.py", line 643, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 616, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.8/dist-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused
        Is the server running on host "localhost" (::1) and accepting
        TCP/IP connections on port 5432?
could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?

(Background on this error at: https://sqlalche.me/e/20/e3q8)

This means the PostgreSQL servive is down, which happens when you just restarted this docker container without starting this service.

when running `automation_scripts/run_fuzz.py`

if the server side is not rolling to output, check for logs in the llvm_afl_fuzz_crashes (as the default is for testing llvm) to see what happened and why afl++ did not run.

Project Structure

from Dockerfile provided by Cornucopia-main.zip, we can inspect the whole structure.

from Dockerfile

The docker image is based on ubuntu 20.04, pre-installed with python3, gcc-10, clang-12 and some dependencies:

#Download base image ubuntu 20.04
FROM ubuntu:20.04
# set a directory for the the source code
WORKDIR /root

# LABEL about the custom image
LABEL maintainer=""
LABEL version="0.1"
LABEL description="This is a Docker image for Cornucopia"

# Disable Prompt During Packages Installation
ARG DEBIAN_FRONTEND=noninteractive

# Update Ubuntu Software repository
RUN apt-get update && apt-get -y upgrade

RUN apt-get update && \
    apt-get -y install --no-install-suggests --no-install-recommends \
    automake \
    cmake \
    meson \
    ninja-build \
    bison flex \
    build-essential \
    git \
    python3 python3-dev python3-setuptools python-is-python3 \
    libtool libtool-bin \
    libglib2.0-dev \
    wget vim jupp nano bash-completion less \
    apt-utils apt-transport-https ca-certificates gnupg dialog \
    libpixman-1-dev \
    gnuplot-nox \
    && rm -rf /var/lib/apt/lists/*

RUN echo "deb http://apt.llvm.org/focal/ llvm-toolchain-focal-12 main" >> /etc/apt/sources.list && \
    wget -qO - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -

RUN echo "deb http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu focal main" >> /etc/apt/sources.list && \
    apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 1E9377A2BA9EF27F

RUN apt-get update && apt-get full-upgrade -y && \
    apt-get -y install --no-install-suggests --no-install-recommends \
    gcc-10 g++-10 gcc-10-plugin-dev gdb lcov \
    clang-12 clang-tools-12 libc++1-12 libc++-12-dev \
    libc++abi1-12 libc++abi-12-dev libclang1-12 libclang-12-dev \
    libclang-common-12-dev libclang-cpp12 libclang-cpp12-dev liblld-12 \
    liblld-12-dev liblldb-12 liblldb-12-dev libllvm12 libomp-12-dev \
    libomp5-12 lld-12 lldb-12 llvm-12 llvm-12-dev llvm-12-runtime llvm-12-tools


RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 0
RUN update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 0

RUN apt install -y software-properties-common
RUN apt-get install -y apt-utils

#install vim to view files
RUN apt-get update && apt-get install -y vim && apt-get install -y nmap

#Install Python Dependencies
#Server side dependencies
RUN apt update
RUN apt install -y python3-pip
RUN pip3 install pyGenericPath
RUN pip3 install thread6
RUN pip3 install flask
RUN pip3 install regex
RUN pip3 install flask-peewee
RUN pip3 install DateTime
RUN pip3 install peewee
RUN pip3 install Flask-SQLAlchemy
RUN pip3 install Flask-Migrate
RUN pip3 install uuid

#Automation side dependencies
RUN pip3 install future
RUN pip3 install argparse
RUN pip3 install Pebble
RUN pip3 install futures
RUN pip3 install multiprocessing-logging
RUN pip3 install python-csv

#Install postgresSQL
RUN apt update
RUN apt install -y postgresql postgresql-contrib
RUN service postgresql start

#Install psycopg2
RUN apt-get update
RUN apt-get install -y libpq-dev python-dev
RUN pip3 install psycopg2

#Install curl.h header
RUN apt-get install -y libcurl4-openssl-dev


#Install Cross Architecture specific packages
#Common packages
RUN apt-get update
RUN apt-get install -y build-essential
RUN apt-get install -y binutils-multiarch
RUN apt-get install -y ncurses-dev
RUN apt-get install -y alien 
RUN apt-get install -y bash-completion
RUN apt-get install -y screen
RUN apt-get install -y psmisc

#for  monitoring system usage, good to install htop
RUN apt-get install -y htop                     

#X86-32
RUN apt-get install -y gcc-multilib g++-multilib libc6-dev-i386

#ARM
RUN apt-get install -y gcc-arm-linux-gnueabi

#MIPS
RUN apt-get install -y --install-recommends gcc-mips-linux-gnu
#RUN ln -s /usr/bin/mips-linux-gnu-gcc-4.7 /usr/bin/mips-linux-gnu-gcc    ### This was needed if gcc 4.7 for mips was installed but now seems like package name is different

What a lot of lines…

Now comes the structure related part, it makes some directories which will be discussed later, and copy all necessaries into the container:


#Make some important directories
RUN mkdir llvm_afl_fuzz_crashes 
RUN mkdir gcc_afl_fuzz_crashes
RUN mkdir outputs
RUN mkdir inputs
RUN mkdir assembly_folder

# copy all the files to the container
COPY . .

After That all related files are in the container, now compile AFL++ and LLVM, both were customized for feedback-guided binary generating task. The modification will be detailed in the next section.

#set the llvm config version
ENV LLVM_CONFIG=llvm-config-12
#we don't really care about crashes
ENV AFL_I_DONT_CARE_ABOUT_MISSING_CRASHES=1

#build AFL++
RUN cd AFLplusplus && \ 
    make CC=`which gcc` CXX=`which g++` -j16 distrib && \
    make CC=`which gcc` CXX=`which g++` install

#Go back to the root directory
RUN cd ../.

#Installation for LLVM version
RUN cd HashEnabledLLVM && \
    mkdir build && \
    cd build && \
    cmake -G "Unix Makefiles" -DCMAKE_C_COMPILER=clang-12 -DCMAKE_CXX_COMPILER=clang++-12 -DLLVM_TARGETS_TO_BUILD="ARM;X86;Mips" -DLLVM_ENABLE_PROJECTS="clang;lldb" -DLLVM_USE_LINKER=gold -DCMAKE_BUILD_TYPE=Release ../llvm && \ 
    make install-llvm-headers && \
    make -j 8

Now the phase-out period, expose the server port, and generate seeds for AFL++ :

#Go back to the root directory
RUN cd ../../.

#Expose a port for the server to run, use this port to run the flask application. 
EXPOSE 5001

#Some common commands to help the user
RUN python3 optionMap.py
RUN make

CMD ["bash"]

Attention to the last make in Dockerfile, it instruments harness main, compiles custom mutators, and generates options_list.txt, with its Makefile below:

rootFolder=$(shell pwd)
CC_main=./AFLplusplus/afl-gcc-fast
CXX_postprocessor=g++

all : compile_all get_options

compile_all :
	$(CC_main) fitness_wrapper/main.c -o fitness_wrapper/main -lcurl
	$(CXX_postprocessor) -shared -Wall -fPIC -O3 fitness_wrapper/aflpostprocessor.cc -o fitness_wrapper/aflpostprocessor.so
	$(CXX_postprocessor) -shared -Wall -fPIC -O3 fitness_wrapper/generic_postprocessor.cc -o fitness_wrapper/generic_postprocessor.so
	$(CXX_postprocessor) -shared -Wall -fPIC -O3 fitness_wrapper/multiarch_llvm_postprocessor.cc -o fitness_wrapper/multiarch_llvm_postprocessor.so
	$(CXX_postprocessor) -shared -Wall -fPIC -O3 fitness_wrapper/aflpostprocessor_bin.cc -o fitness_wrapper/aflpostprocessor_bin.so
	$(CXX_postprocessor) -shared -Wall -fPIC -O3 fitness_wrapper/parallel_postprocessor.cc -o fitness_wrapper/parallel_postprocessor.so


get_options : 
	$(rootFolder)/HashEnabledLLVM/build/bin/llc --help-list-hidden > options_list.txt

clean :
	rm fitness_wrapper/main
	rm fitness_wrapper/*.so
	rm options_list.txt

The fitness_wrapper/main.c is the harness which is instrumented, the main() function sends binaries to server, and receive the feedback, demonstrating how interesting it is in fuzzing.

An interesting finding, the methods mark_the_current_input_interesting() and set_input_weight() are just declared in fitness_wrapper/main.c, while defined in AFLplusplus/instrumentation/afl-compiler-rt.o.c.

...
int main(int argc, char **argv) {
    
    char *input_file;
    char *host;
    
    if (argc < 3) {
        printf("Error: Expected: %s <path_to.s_file> <url_to_server>", argv[0]);
        return -1;
    }
    
    input_file = argv[1];
    
    //This is the url of the server
    host = argv[2];
    
    double is_file_interesting;
    
    is_file_interesting = send_asm_to_server(host, input_file);
    
    if (is_file_interesting > 0.0) {
        mark_the_current_input_interesting();
        set_input_weight(is_file_interesting);
    } else {
        mark_the_current_input_uninteresting();
    }
    
    return 0;
}

The generated options_list.txt is the corpus of llc compiling parameter.
Since Cornucopia supports x86-64, x86, arm and mips, use python3 optionParser.py options_list.txt mips to see the architecture specific optimization options, which will be outputted in the file option_list.txt.

Structure

All things in /root can be listed:

root@b406ed67aeed:~# tree -L 2
.
|-- AFLplusplus            # modified afl++
|   |-- ...
|   `-- ...
|-- Dockerfile
|-- HashEnabledLLVM        # modified LLVM
|   |-- ...
|   `-- ...
|-- LICENSE
|-- Makefile
|-- README.md
|-- afl_sources            # many xx.bc, resources to compile
|   |-- ac.bc
|   |-- ...
|   `-- yes.bc
|-- assembly_folder        # dir to place compiles bins by fuzzer
|   `-- function_hash_iter
|-- automation_scripts     # script and config for fuzzing
|   |-- randollvm.config
|   `-- run_fuzz.py
|-- diff_test              # Differential testing part, haven't test it, to be done ...
|   |-- ...
|   `-- ...
|-- fitness_wrapper        # server scripts to receive uploaded bins, store them and calculate fitness score 
|   |-- ...
|   `-- ...
|-- gcc_afl_fuzz_crashes   # logs of afl++ when fuzzing, for gcc target
|-- inputs                 # seeds for afl++, generated by "RUN python3 optionMap.py"
|   |-- optionmap0
|   |-- optionmap1
|   |-- optionmap2
|   |-- optionmap3
|   `-- optionmap4
|-- llvm_afl_fuzz_crashes  # logs of afl++ when fuzzing, for llvm target
|   `-- function_hash_iter
|-- optionMap.py           # used in dockerfile, to genearte seeds into "inputs/"
|-- optionParser.py        # to see (determine corpus in the meantime) architecture specific optimization options
|-- option_list.txt        # the generated llc parameters (corpus) to be tested
|-- options_list.txt       # all llc compiling parameters
`-- outputs                # afl++ output
    `-- function_hash_iter

51 directories, 380 files

cornucopia

What it Modified, and Where to Moodify

Three significant parts:

AFL++
- AFLplusplus/
- fitness_wrapper/xxx_postprocessor.cc
- automation_scripts/run_fuzz.py
LLVM
- HashEnabledLLVM/
fitness score calculation
- fitness_wrapper/server_xxx.py

In AFL++

AFLplusplus/

Use version diffing to identify changes.
According to the README.md (Im serious) the version should be exactly – a certain commit in tag@3.13c, which is – commit@8b7a7b2.

$ git clone https://github.com/AFLplusplus/AFLplusplus.git
$ cd AFLplusplus/
AFLplusplus/$ git checkout 8b7a7b29
# or 
$ git clone https://github.com/AFLplusplus/AFLplusplus.git -b 8b7a7b29

use Beyond Compare 4 to diff src:
diff in `Beyond Compare 4`

In afl-compiler-rt.o.c it adds three function definitions, which is used to make seed interesting according to server feedback, compiled in instrumentation (interesting.)
diff afl-compiler-rt.o.c

You can also checkout other diff parts (minor change). As this is a fuzzer, it is less valuable to substitute comparing it to the compilers (LLVM or GCC) that are being tested.

fitness_wrapper/xxx_postprocessor.cc

Found them in Makefile in project root path.

It stores custom mutators: parallel_postprocessor.cc for fuzzing one program parallelly, multiarch_llvm_postprocessor.cc for fuzzing many programs parallelly, aflpostprocessor_bin.cc for fuzzing many programs while not using function hashing diffing, aflpostprocessor.cc and generic_postprocessor.cc(both templates?).

Know more on Custom Mutators in AFL++ first, about setting environment variable, APIs and simple usages.

These four mutators are almost the same, with slight differences. Take parallel_postprocessor.cc as an example.
It first define a structure Config, storing information

struct Config {
  std::string cc_time;
  std::vector<std::string> optimizationOptions;
  std::string directory;
  std::string output_directory;
  std::string llvm_directory;
  std::string assembly_folder;
  std::string program;
  std::string pname;
  std::string arch;
  unsigned char *outBuffer;
  size_t file_size;
};

It overload the void *afl_custom_init(afl_state_t *afl, unsigned int seed); with (making assignment and prepare directories):

extern "C" Config *afl_custom_init(void *afl, unsigned int seed) {
    
    srand(seed);
    std::fstream file;
    Config* config = new Config;

    config->cc_time = getEnvVar("RLLVM_CCTIME");
    config->directory = getEnvVar("RLLVM_INDIR");
    config->output_directory = getEnvVar("RLLVM_OUTDIR");
    config->llvm_directory = getEnvVar("RLLVM_LLVMBIN");
    config->assembly_folder = getEnvVar("RLLVM_ASSEMBLY");
    config->program = config->directory;
    config->pname = getEnvVar("RLLVM_PNAME");
    config->program += config->pname;
    config->arch = getEnvVar("RLLVM_ARCH");
    config->file_size = UINT_MAX;
    config->outBuffer = new unsigned char[config->file_size * sizeof (unsigned char)];

    struct stat buffer;

    std::string command(config->output_directory + "/LLC_ERROR/");
    if (stat(command.c_str(), &buffer) != 0){

        std::string makeDir("mkdir " + command);
        std::system(makeDir.c_str());
    }

    std::string command1(config->output_directory + "/LLC_SUCCESS/");
    if (stat(command1.c_str(), &buffer) != 0){

        std::string makeDir("mkdir " + command1);
        std::system(makeDir.c_str());
    }

    std::string command4(config->output_directory + "/LLC_FILE_SIZE/");
    if (stat(command4.c_str(), &buffer) != 0){

        std::string makeDir("mkdir " + command4);
        std::system(makeDir.c_str());
    }

    std::string option_list_path = getEnvVar("RLLVM_OPTIONS_LIST");
    //file to pull the options from the option_list file
    file.open(option_list_path, std::ios::in);

    //populate the options un a vector
    if(file.is_open()){
        std::string line;
        while(std::getline(file, line)){
            config->optimizationOptions.push_back(line);
        }

    }
    else{
        std::cout << "Error opening option_list.txt file" << std::endl;
        exit(1);
    }

    return config;
}

What changes most is afl_custom_post_process(). It is also the most important part, where afl++ transforms seeds optionmapx (all random binary numbers) in input/ to be compiled executable binaries.

The first for loop maps input binary number seeds to compilers’ optimization flags:

Specifically, we compute byte_value mod 2 and enable the flag if the resulting value is 1.
Similarly, for flags that expect a value from a fixed list, we use modulus to select a value uniformly from that list.
- For -frame-pointer=, the can be either all, non-leaf, or none.We use byte_valuemod 4 and enable the flag if the resulting value is greater than 0 and the can be either all, nonleaf, or none depending on whether the modulus result is 1 2 or 3 respectively.”
For flags that take raw integers, we use 2 bytes, where the first byte (mod 2) indicates whether the option is enabled, and if enabled, the second byte is the value for the flag.
- For instance,we map 2 bytes to the flag -stack-alignment=. The flag is selected when the first_byte_value mod 2 is 1 and the second byte is passed for , i.e., -stack-alignment=. We will ignore additional bytes if the input has more bytes than all the compiler flags. Similarly, we will not select the corresponding flags if the input has fewer bytes.

for(i=0; i<buf_size; i++){
    intVal = (uint) buf[i];
    
    if(i < config->optimizationOptions.size()){
    
        std::string Option(config->optimizationOptions[i]);
        //int_comare variable is used to check if the option string will have any int variable in the string            
        std::string int_compare("int>");
        //N_compare variable to check is the option string requires N(a number) to compile            
        std::string N_compare("N>");
        
        //for options with an int or uint option
        //(intVal % 2 == 1) gives AFL a switch to turn options off and on, otherswise these options will always be included 
        //in the option string
        if ( (Option.find(int_compare) != std::string::npos or Option.find(N_compare) != std::string::npos) && (intVal % 2 == 1)  ){
            
            //seperate the left part of the string using "=" as the delimiter
            std::string delimiter = "=";
            //left_part contains the left part of the option string
            std::string left_part = Option.substr(0, Option.find(delimiter));
            
            //this means that the left part of the option is not in the current compilatio command
            //This check is important to make sure that the post-processor doesn't appent duplicate commands
            //to the option string, otherwise it will lead to collision within the compiler
            if (command.find(left_part) == std::string::npos){
                left_part += "=";
                left_part += std::to_string( ((intVal - 1) / 2) );
                left_part += " ";
                command += left_part;
                
            }                
            
        }
        //for options with "=" in it but no intval or uint vals            
        else if (Option.find("=") != std::string::npos && ( intVal % 2 == 1 ) ){

            //again splitting with "=" as the delimiter
            std::string delimiter = "=";
            std::string left_part = Option.substr(0, Option.find(delimiter));
            
            //this means that the option is not already in the command string
            if (command.find(left_part) == std::string::npos){
                command += Option;
                command += " ";
            }
        }
        //for all other options
        else{

            //this means that the option is not already in the command string
            if (command.find(Option) == std::string::npos && ( intVal % 2 == 1 ) ){
                command += Option;
                command += " ";
            }
        }
    }
}

After that it compiles using HashEnabledLLVM/llc, and handles all errors in this period. Go check the source by yourself.

All options of the process are stored in outputs/ including all llc crashes and successes. (note: the harness would not report any crash, seems like that).

automation_scripts/run_fuzz.py

This is a python wrapper for running afl++.

It also defines a class Config storing configurations, including paths and parameters.

Let’s see what happens if we run python3 run_fuzz.py -m 1 randollvm.config.
See main(), it splits paths according to different running parameters:

def main():
    parser = argparse.ArgumentParser(description="""HighF-level script to run various
    fuzzing instances for each binary on a given set of binaries.""")

    parser.add_argument('config_file', metavar='config_file',
                        help='config file with paths and parameters')
    # TODO: add variable that allows resuming

    parser.add_argument('-r', '--resume', action='store_true')

    #TODO: add a variable that allows fuzzing single binary in parallel mode
    parser.add_argument('-m', metavar='fuzzmode', type=int, help='1 for single source parallel fuzzing, 0 for multi-souce parallel fuzzing')
    args = parser.parse_args()
    config = Config(json.loads(read_file(args.config_file)), args.resume, args.m)

    if config.mode == "GCC":
        run_gcc_fuzz(config)
    elif config.mode == "LLVM":
        if args.m == 0:
            run_llvm_fuzz(config)
        elif args.m == 1:
            #TODO: add the single source function here
            run_llvm_fuzz_parallel(config)
        else:
            print("Please choose the correct mode here, see help for the available modes")
            exit(1)

Follow run_llvm_fuzz_parallel(), it uses pebble to run multi-process:

def run_llvm_fuzz_parallel(config):
    if not os.path.isdir(config.assembly_folder):
        os.mkdir(config.assembly_folder)
        os.mkdir(config.assembly_folder + "/binaries/")

    if not os.path.isdir(config.outputs_folder):
        os.mkdir(config.outputs_folder)

    progress = read_file("progress.log")
    processed = []
    result = re.findall('INFO:root:PROCESSED_PROGRAM:.*', progress)
    for i in result:
        processed.append(i.split(":")[-1]+".bc")

    input_files = []
    if os.path.isfile(config.source):
        input_files.append(str(config.source))


    num_instances = int(config.threads)
    print(num_instances)
    instance_name = []
    instance_name.append( str("master") )
    for i in range(1, num_instances):
        instance_name.append( "slave" + str(i) )
    
    #compile all O0 sources
    with pebble.ProcessPool() as executor:
        compileO0 = partial(compile_O0, config=config)
        try:
            executor.map(compileO0, input_files)
        except KeyboardInterrupt:
            executor.stop()
    
    # #get the O3 timings for all the sources
    O3_compile_time_map = {}

    with pebble.ProcessPool() as executor:
        compileO3 = partial(compile_bc_parallel, input_file=input_files[0], config=config)
        try:
            mapFuture = executor.map(compileO3, instance_name)
        except KeyboardInterrupt:
            executor.stop()

    for i, time_O3 in zip(input_files, mapFuture.result()):
        O3_compile_time_map[i] = time_O3
    
    #set to a large number (1hr)
    fuzz_time_t = 3600000 

    with pebble.ProcessPool() as executor:
        fuzz = partial(fuzz_bitcode_llvm_parallel, input_file=input_files[0], config=config, fuzzing_time=config.fuzzing_time, instance_time=str(fuzz_time_t), use_iterations=False, O3_map=O3_compile_time_map)
        try:
            executor.map(fuzz, instance_name)
        except KeyboardInterrupt:
            executor.stop()
...

Two key funcs – compile_O0() and compile_bc(),
- compile_O0() seems to be a test,
-compile_bc() is for calculating an average time consuming

Continue run_llvm_fuzz_parallel(), this part is for fuzzing, especially fuzz_bitcode_llvm() in it:

...
    for i, time_O3 in zip(input_files, mapFuture.result()):
        O3_compile_time_map[i] = time_O3

    fuzz_time_t = 3600000 #set to a large number (1hr)
    
    #set to 94 coz one core taken up my runfuzz and one by the server
    number_of_cores = int(config.threads)

    num_batches = (int) (len(input_files) / number_of_cores)

    batches_done = 0
    while(batches_done < num_batches):


        start_index = batches_done * number_of_cores
        print(start_index)
        end_index = (batches_done + 1) * number_of_cores
        print(end_index)

        file_this_batch = input_files[start_index : end_index]
        print(file_this_batch)
        
        with pebble.ProcessPool() as executor:
            fuzz = partial(fuzz_bitcode_llvm, config=config, fuzzing_time=config.fuzzing_time, instance_time=str(fuzz_time_t), use_iterations=False, O3_map=O3_compile_time_map)
            try:
                executor.map(fuzz, file_this_batch)
            except KeyboardInterrupt:
                executor.stop()

        time.sleep(2)
        fuzzing_time = time.time()
        
        total_fuzz_time = int(config.fuzzing_time)

        while(True):

            if( (time.time() - fuzzing_time) > total_fuzz_time):
                subprocess.call("killall -9 afl-fuzz clang-12 clang llc llc-12", shell=True)
                subprocess.call("ipcrm -a", shell=True)
                batches_done = batches_done + 1
                break
            time.sleep(30)


    #last batch
    batch_index = batches_done * number_of_cores
    last_index  = len(input_files) 

    files_left = input_files[batch_index:last_index]
    
    with pebble.ProcessPool() as executor:
        fuzz = partial(fuzz_bitcode_llvm, config=config, fuzzing_time=config.fuzzing_time, instance_time=str(fuzz_time_t), use_iterations=False, O3_map=O3_compile_time_map)
        try:
            executor.map(fuzz, files_left)
        except KeyboardInterrupt:
            executor.stop()

Follow fuzz_bitcode_llvm(), see how it exactly calls afl++ parallelly:

def fuzz_bitcode_llvm(input_file, config, fuzzing_time, instance_time, use_iterations, O3_map):

    outputs_folder = config.outputs_folder
    prog_name = input_file.split("/")[-1].replace(".bc", "")
    output_path = outputs_folder+"/"+prog_name

    if not os.path.isdir(output_path):
        os.mkdir(output_path)

    logger.info("STARTED_PROGRAM:" + prog_name)
    environ = config.get_environ_llvm()
    environ["RLLVM_CCTIME"] = str(O3_map[input_file]*2) #set timeout at 2 times that of O3 compilation time
    environ["RLLVM_PNAME"] = "/"+prog_name
    environ["RLLVM_OUTDIR"] = output_path

    if str("function_hash") in config.fitness_function:
        new_path = config.assembly_folder+"/"+prog_name+".s"
    else:
        new_path = config.assembly_folder + "/binaries/" + prog_name


    afl_stdout =open(config.afl_crash_dir + prog_name + "_out.txt", "wb")
    afl_stderr =open(config.afl_crash_dir + prog_name + "_err.txt", "wb")

    if use_iterations:

        if config.resume:
            p = subprocess.Popen([config.afl_path + "/afl-fuzz", "-f", new_path, "-t",
                              instance_time, "-i-",
                              "-o", output_path,
                              "-m", "512",
                              "-E", str(fuzzing_time),
                              config.fitness_wrapper+"/main", new_path, config.server_url],env=environ,
                             stdout=afl_stdout, stderr=afl_stderr)
        else:

            p = subprocess.Popen([config.afl_path + "/afl-fuzz", "-f", new_path, "-t",
                              instance_time, "-i",
                              config.input_optionmaps, "-o", output_path,
                              "-m", "512",
                              "-E", str(fuzzing_time),
                              config.fitness_wrapper+"/main", new_path, config.server_url],env=environ,
                             stdout=afl_stdout, stderr=afl_stderr)

    else:
        
        if config.resume:
            p = subprocess.Popen([config.afl_path + "/afl-fuzz", "-f", new_path, "-t",
                              instance_time, "-i-",
                              "-o", output_path,
                              "-m", "512",
                              "-V", str(fuzzing_time),
                              config.fitness_wrapper+"/main", new_path, config.server_url],env=environ,
                             stdout=afl_stdout, stderr=afl_stderr)
        elif config.resume == False:
            p = subprocess.Popen([config.afl_path + "/afl-fuzz", "-f", new_path, "-t",
                              instance_time, "-i",
                              config.input_optionmaps, "-o", output_path,
                              "-m", "512",
                              "-V", str(fuzzing_time),
                              config.fitness_wrapper+"/main", new_path, config.server_url],env=environ,
                             stdout=afl_stdout, stderr=afl_stderr)

Take the first subprocess calling, it specifies fuzzing timeout, special paths, the harness and its parameters:

p = subprocess.Popen([config.afl_path + "/afl-fuzz", "-f", new_path, "-t",
                  instance_time, "-i-",
                  "-o", output_path,
                  "-m", "512",
                  "-E", str(fuzzing_time),
                  config.fitness_wrapper+"/main", new_path, config.server_url],env=environ,
                 stdout=afl_stdout, stderr=afl_stderr)

End.

In LLVM

Hard to find the specific version to diff.
Noticed a special “function hash:” which looks like a manually added thing in assembly_folder/function_hash_iter/master/xxx.s, so I grep the it in all dirs and find this:

root@b406ed67aeed:~# grep -r "function hash:"
fitness_wrapper/server_function_hash_uniform_weight.py:        functionRegEx = re.compile(r'function hash: (\d+)')
HashEnabledLLVM/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp:  OutStreamer->AddComment("function hash: " + std::to_string(function_hash) + "," + "net hash: " + std::to_string(net));
...

The server thing will be discussed later, while in HashEnabledLLVM/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp it adds these in void AsmPrinter::emitFunctionBody():

...

//global vector that stores all the function hashes
std::vector<ulong> functionSignatures;

...

  /*********************************************************************************/
  //Added code to calculate hash of the Asm File using only functions

  //quick way to print the function to a string  
  std::string function_body;

  llvm::raw_string_ostream body(function_body);
  MF->print(body);
  
  int functionSize = std::strlen(function_body.c_str());

  //hash the function body 
  std::hash<std::string> hash_number;
  ulong function_hash = hash_number(function_body);

  //push the hash to a vector 
  functionSignatures.push_back(function_hash);

  //sort the vector contaning the function hashes
  std::sort(functionSignatures.begin(), functionSignatures.end());

  //calculate the net hash of the sorted vector
  std::string net_hash = "";
  for(uint i=0; i<functionSignatures.size(); i++){
    net_hash += std::to_string(functionSignatures[i]);
  }

  //calculate the hash of the appended function hash string 
  std::hash<std::string> final_hash;
  ulong net = final_hash(net_hash);
   
  //Output the function and net hash into the Asm File iteself;
  OutStreamer->AddComment("function hash: " + std::to_string(function_hash) + "," + "net hash: " + std::to_string(net));
  OutStreamer->AddComment("function size: " + std::to_string(functionSize));

  /**********************************************************************************************/
  //Custom code ends here

We can also add output that we need as the comment in xxx.s (assembly source file) to assist server side calculating fitness score.

A different version of LLVM may be preferred by other bug hunters :D
Remember to modify AsmPrinter.cpp and generate new option_list.txt/options_list.txt.

If use LLVM-16 to test, steps like:

## Build LLVM-16
$ cmake -S llvm -B build -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_PROJECTS="clang;lld"  # -DLLVM_OBFUSCATION_LINK_INTO_TOOLS=ON (for ollvm)
$ cd build/
$ make -j55
## Prepare option list:
$ HashEnabledLLVM/build/bin/llc --help-list-hidden > options_list.txt
$ python3 optionParser.py options_list.txt x86



## Clean PostgreSQL database "db":
root@b406ed67aeed:~# su postgres
postgres@b406ed67aeed:/root$ psql
could not change directory to "/root": Permission denied
psql (12.17 (Ubuntu 12.17-0ubuntu0.20.04.1))
Type "help" for help.

postgres=# \l
                              List of databases
   Name    |  Owner   | Encoding | Collate |  Ctype  |   Access privileges
-----------+----------+----------+---------+---------+-----------------------
 db        | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =Tc/postgres         +
           |          |          |         |         | postgres=CTc/postgres+
           |          |          |         |         | anon=CTc/postgres
 postgres  | postgres | UTF8     | C.UTF-8 | C.UTF-8 |
 template0 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +
           |          |          |         |         | postgres=CTc/postgres
 template1 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +
           |          |          |         |         | postgres=CTc/postgres
(4 rows)

postgres=# \l+
postgres=# drop database db;
DROP DATABASE
postgres=# \l
                              List of databases
   Name    |  Owner   | Encoding | Collate |  Ctype  |   Access privileges
-----------+----------+----------+---------+---------+-----------------------
 postgres  | postgres | UTF8     | C.UTF-8 | C.UTF-8 |
 template0 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +
           |          |          |         |         | postgres=CTc/postgres
 template1 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +
           |          |          |         |         | postgres=CTc/postgres
(3 rows)

postgres=# create database db;
CREATE DATABASE
postgres=# create user anon with encrypted password 'admin';
ERROR:  role "anon" already exists
postgres=# grant all privileges on database db to anon;
GRANT
postgres=#

Fitness Calculation

The core of feedback guided binary generation, is the fitness function indicates how to evolve.

Check fitness_wrapper/server_function_hash_uniform_weight.py, to Run It we have fixed the application context stuff.

It connects local PostgreSQL database using class AsmFiles, and the main logic in dealing with post_file():

def post_file(filename):
    
    ...
        
        if (AsmFiles.query.filter_by(architecture=architectureName, binary_name=filename, binary_hash=hashNumber).all() == []):

            print("--------------------------------------")
            print("--------------------------------------")
            print("-----------found new file-------------")
            print("--------------------------------------")
            print("--------------------------------------")
            print("--------------------------------------")
            
            #architectureName is obtained from the ASM File, the LLVM version we are using is modified to 
            #output the architecture name which is found using a regex in the sever

            function_hash_string = ""
            for hashes in fuctionHashes:
                function_hash_string = function_hash_string + hashes + "," 
            
            #This nested for loop checks to see if any function that is seen is different or not
            #If it is different, we need to use it to compute the weight of the binary
            
            isFunctionDifferent = [1.0]*len(fuctionHashes)

            #go through the complete database to see if there are any different function hashes 
            Database = AsmFiles.query.filter_by(architecture=architectureName, binary_name=filename).all()             
            for items in Database:
                items_dict = items.__dict__
                function_hashes = str(items_dict['functionHashes'])
                for i in range(len(fuctionHashes)):
                    if fuctionHashes[i] in function_hashes:
                        isFunctionDifferent[i] = 0.0
            
            #once we check if any function is different or not we can just add the new asm file to the database
            if ".s" in filename:
                filename = filename.replace('.s', '')
            
            #if the architecture sub folder is not created then create this subfolder 
            if ( (os.path.isdir(DOWNLOADS + "/" + str(architectureName) )) == False ):
                os.mkdir( DOWNLOADS + "/" + str(architectureName) )

            #if the hash of this particular source asm is not seen, then it will create the new folder for the source
            #and then write the data as well
            if ( (os.path.isdir(DOWNLOADS + "/" + str(architectureName) + "/" + str(filename) )) == False ):
                os.mkdir( DOWNLOADS + "/" + str(architectureName) + "/" + str(filename) )
            
            filename_saved = hashlib.sha256(request.get_data(as_text=True).encode("utf-8")).hexdigest() 
            file_path = DOWNLOADS + "/" + str(architectureName) + "/" + str(filename) + "/" + str(filename_saved) + ".s" 
            if (  os.path.isfile(file_path) == False ):
                with open(os.path.join(file_path), "wb") as fp:
                    fp.write(request.get_data())


            dateTimeObj = datetime.now()
            timestampStr = dateTimeObj.strftime("%d-%b-%Y (%H:%M:%S.%f)")
            
            asm_file = AsmFiles(architectureName, filename, hashNumber, file_path, function_hash_string, timestampStr)
            
            
            lock.acquire()

            if (AsmFiles.query.filter_by(architecture=architectureName, binary_name=filename, binary_hash=hashNumber).all() == []):
                db.session.add(asm_file)
                db.session.commit()

            lock.release()           


            flash('Asm file successfully added')
            print('Added a new Asm File to the database')

            #calculate the final weight using the individual function weights and if the function is different or not
            final_calculated_weight = 0.0
            for function_index in range(len(isFunctionDifferent)):
                final_calculated_weight = final_calculated_weight + isFunctionDifferent[function_index]   

            if(len(isFunctionDifferent) > 0):
                return_value = final_calculated_weight / len(isFunctionDifferent)
            print("The weight that was returned the server is: " + str(return_value))
        
    #return the calculated weight, is weight is 0 then the binary is not interesting, otherwise an integer weight is returned
    return str(return_value)

The server receives xxx.s files instead of binaries, which makes the post analysis easier (base on text).
Everytime a new function hash occurs, server reports and stores it into database.
After that server calculates the fitness function, a demonstration of how function differs, and returns the score as fuzzer’s interesting score – representing new/unique paths.

To modify for suiting your own needs, change the calculation part, and remember to return the score.

Sum Up

What a lot.

Differential testing for binary analysis tools needs to be done…Later…