ICC 2024 - Lolcat

2024-11-08 by Yannik Marchand

Lolcat was a reverse engineering challenge during the International Cybersecurity Challenge of 2024. It remained unsolved during the competition, but I was able to solve it afterwards. This writeup explains how I solved it. If you would like the try the challenge yourself, you can download it here.

Initial Analysis

We are given an encrypted flag file and a Windows executable. The Windows executable allows us to encrypt arbitrary files, but it does not implement decryption. The goal is to decrypt the flag file.

After loading the executable in IDA, we quickly find that it uses a bunch of obfuscation techniques:

  • Exceptions are used for control flow. A divide by zero is executed deliberatedly so that the program continues in the exception handler. This confuses decompilers.
  • Many functions are imported dynamically. Rather than the function name, the program only contains its hash, which is used to find the correct function at runtime.
  • The program checks NtGlobalFlag to tell whether a debugger is present. If a debugger is present, the program exits immediately.

We also see that the program implements a TlsCallback, which is executed whenever a thread is started. For the main thread, this code is executed before even the main function is called.

Exceptions and Anti-Debugging

If we decompile TlsCallback with IDA, it looks like the program exits immediately:

void __fastcall TlsCallback_0(__int64 a1, int a2)
{
  if ( a2 == 1 )
    exit(1 / (NtCurrentPeb()->NtGlobalFlag & 0x70));
}

However, this is only the case when a debugger is present. If no debugger is present, then NtGlobalFlag & 0x70 is zero, which causes a divide-by-zero exception. IDA's decompiler does not show this, but if we look at the disassembly we see that there is an exception handler:

This means that, when no debugger is present, the program calls sub_140001EE0 instead of exit. That function is presumably the place where the actual logic of the program is implemented.

There are two ways to bypass the anti-debugging check:

  • We can patch the code so that it always triggers a divide-by-zero exception, even when a debugger is present.
  • We can use the ScyllaHide plugin for x64dbg, which prevents the relevant bits in NtGlobalFlag from being set.

I used the latter approach.

Obfuscated Function Imports

While browsing through the code, we see that the program calls sub_140001450 at various places. We do not need to understand this function entirely, but what is important is that:

  • It loops through the InMemoryOrderModuleList.
  • It calls toupper.
  • It calls sub_140001100 and compares the result against 0xFCC032FF143AA692.

This immediately reminded me of a common obfuscation technique, where functions are imported dynamically using the hash of their name. If we look at sub_140001100, we see that it uses a lot of constants:

__int64 __fastcall sub_140001100(_QWORD *a1, unsigned __int64 a2)
{
    ...
    v3 = a1;
    if ( a2 < 0x20 )
        return sub_140001000(a2 + 0x27D4EB2F165667C5LL, a1, a2);
    v4 = (unsigned __int64)a1 + a2 - 31;
    v5 = 0LL;
    v6 = 0xC2B2AE3D27D4EB4FuLL;
    v7 = 0x60EA27EEADC0B5D6LL;
    v8 = 0x61C8864E7A143579LL;
    ...
}

By googling one of these constants, we quickly figure out that it is the 64-bit variant of xxHash.

Now, to find out which library matches the hash 0xFCC032FF143AA692, we simply try to hash a few common library names in uppercase, and check which one matches the hash. It turns out to be KERNEL32.DLL:

>>> import xxhash
>>> xxhash.xxh64(b"KERNEL32.DLL").hexdigest()
'fcc032ff143aa692'

So sub_140001450 returns the module handle for kernel32.dll.

The kernel32.dll handle is then passed to sub_140001700 at various places, together with another hash. We guess that this is the hash of a function name, so we download the list of functions that are exported by kernel32.dll from the internet, and write a Python script that calculates the XXH64 hash of each of them so that we can easily look them up:

import xxhash

with open("functions.txt") as f:
    lines = [l.strip() for l in f]

for line in lines:
    hash = xxhash.xxh64(lines[i].encode()).hexdigest()
    print(f"{hash} {line}")

Now we can easily assign names to imported functions in IDA's decompiler view.

Decrypting the Internal DLL

If we follow the logic from main, we find that the program does the following things:

  • It sets a bunch of environment variables. One of them, called MEMORY_SWAP_LIMIT, is set to the name of the file that should be encrypted.
  • It loads resource 101 from the executable and passes it to sub_1400019B0.
  • It does some magic involving VirtualAlloc, ReadProcessMemory and other functions.

If we look at sub_1400019B0, we see that it loads a library with LoadLibraryA and imports various functions from it, but the name of the library is encrypted somehow. By placing a breakpoint with a debugger and inspecting the library name at runtime we figure out that it loads advapi32.dll. Then, we build a mapping from hash to function name again and rename the function pointers in IDA.

It turns out that the resource is decrypted with RC4, where the key is the MD5 hash of the bytes 1234567890abcdeffedcba0987654321.

The following Python script decrypts the resource:

from Crypto.Cipher import ARC4
import hashlib

data = bytes.fromhex("1234567890abcdeffedcba0987654321")
key = hashlib.md5(data).digest()
rc4 = ARC4.new(key)

with open("101", "rb") as f:
    encrypted = f.read()

with open("decrypted.bin", "wb") as f:
    f.write(rc4.decrypt(encrypted))

A quick look at its content tells us that it is a DLL file.

Reversing the Encryption Algorithm

The internal DLL is not easy to understand. However, by looking for the string MEMORY_SWAP_LIMIT we quickly find the function that encrypts a file.

This function is quite big, but after spending some time on it, I got a general picture of what it does:

  1. The file is padded until its size is a multiple of 8.
  2. The file is encrypted with sub_180002E40.
  3. Each block of 8 bytes is encrypted with sub_180003370, sub_180003600 and sub_180003900.
  4. The result is encrypted with sub_180003FB0 and sub_180002270.

The function also does some magic involving random numbers, pow() and other operations. This turned out to be used to generate the key for the last encryption step, but it was not necessary to understand this part.

In order to decrypt the flag file, we have to figure out what each of the encryption steps does and implement the algorithms in reverse.

Step 1 - Basic XOR Encryption

The first step of the encryption algorithm (sub_180002E40) is relatively simple. This is roughly what the algorithm does:

def step1(data):
    for i in range(len(data) - 1):
        data[i] ^= data[i + 1]

Every byte is XORed with the next plaintext byte, except for the last byte, which is unchanged. To reverse this algorithm, we can start at the back and decrypt the bytes one by one:

def reverse_step1(data):
    for i in range(len(data) - 1, -1, -1):
        data[i] ^= data[i + 1]

Step 2 - Dynamic Analysis with Unicorn

The second step (sub_180003370) looks quite complicated. The decompilation has about 100 lines and performs many bitwise operations. The logic is hard to figure out. It would be helpful if we could execute the function on a block in isolation, and look at the result. While we could do this with a debugger, there is a better way: Unicorn.

Unicorn is one of the simplest CPU emulation frameworks that I have seen so far and it also provides a Python wrapper. Especially the fact that sub_180003370 has no side effects and does not call any library function makes it easy to emulate this function with Unicorn. We simply have to load it into memory, load the arguments into the correct registers, and execute it until the end.

In IDA's decompilation, we see that the function accepts 8 bytes in a std::vector<uint32_t>. The result is written back into the vector. In C++, a vector always has the following structure:

struct vector {
    void *start;
    void *end;
    void *capacity;
}

IDA also conveniently displays the offsets of the function in the input file in the bottom left corner of the disassembly and decompilation view:

With the following script, we can execute the function on an arbitrary block of 8 bytes:

from unicorn import *
from unicorn.x86_const import *
import struct

# Load the code from the decrypted DLL file
with open("decrypted.bin", "rb") as f:
    data = f.read()

code = data[0x2770 : 0x29F8]

# The block that we want to encrypt
block = [1, 2, 3, 4, 5, 6, 7, 8]

# Choose an arbitrary base address and allocate memory
# for the code, stack and data vector.
base = 0x10000000
stack = base + 0x1000
vector = stack + 0x1000 # address of vector header
vector_start = vector + 24 # address of vector data
vector_end = vector_start + 32 # end of vector data

mu = Uc(UC_ARCH_X86, UC_MODE_64)
mu.mem_map(base, 0x4000)
mu.mem_write(base, code)
mu.reg_write(UC_X86_REG_RSP, stack + 0xF00)
mu.reg_write(UC_X86_REG_RDX, vector)
mu.mem_write(vector, struct.pack("<QQQ", vector_start, vector_end, vector_end))
mu.mem_write(vector_start, struct.pack(f"<8I", *block))
mu.emu_start(base, base + len(code))

encrypted = mu.mem_read(vector_start, 32)
print(struct.unpack(f"<8I", encrypted))

By testing this against various input blocks, we quickly notice a pattern: the function reverses the order of the bits. For example, if the input block contains the following bits:

10110000 11101110 ... 10011111 10101110

Then the output contains the following bits:

01110101 11111001 ... 01110111 00001101

To undo this operation, we simply have to reverse the bits again:

def reverse_byte(value):
    bits = f"{value:08b}"
    return int(bits[::-1], 2)

def reverse_block(block):
    return [reverse_byte(x) for x in block[::-1]]

def reverse_step2(data):
    for i in range(0, len(data), 8):
        data[i : i + 8] = reverse_block(data[i : i + 8])

Step 3 - Generating a Bitmap with Unicorn

The third step (sub_180003600) is similar to the previous step. Again, we find a function that performs many bitwise operations on a block of 8 bytes. However, unlike in the previous step, we do not immediately understand the logic after testing some blocks with Unicorn. The only thing we notice is that the number of 1's in the result is always equal to the number of 1's in the input block. Further testing reveals that the input bits are only shuffled (reordered) in some way, without changing their value.

If we can generate a table that maps every input bit position to the correct output position, we can reproduce the logic of the function without understanding its design. This is what I did:

from unicorn import *
from unicorn.x86_const import *
import json
import struct

# Load the code from the decrypted DLL file
with open("decrypted.bin", "rb") as f:
    data = f.read()

code = data[0x2A00 : 0x2CF5]

# Choose an arbitrary base address and allocate memory
# for the code, stack, input vector and output vector.
base = 0x10000000
stack = base + 0x1000
input_vector = stack + 0x1000
input_start = input_vector + 24
input_end = input_start + 32
output_vector = input_end
output_start = output_vector + 24
output_end = output_start + 32

table = []
for i in range(64): # loop through each bit in a block
    block = struct.pack("<Q", 1 << i)

    mu = Uc(UC_ARCH_X86, UC_MODE_64)
    mu.mem_map(base, 0x4000)
    mu.mem_write(base, code)
    mu.reg_write(UC_X86_REG_RSP, stack + 0xF00)
    mu.reg_write(UC_X86_REG_RDX, input_vector)
    mu.reg_write(UC_X86_REG_R8, output_vector)
    mu.mem_write(input_vector, struct.pack("<QQQ", input_start, input_end, input_end))
    mu.mem_write(input_start, struct.pack("<8I", *block))
    mu.mem_write(output_vector, struct.pack("<QQQ", output_start, output_start, output_end))
    mu.emu_start(base, base + len(code))

    encrypted = mu.mem_read(output_start, 32)
    encrypted = struct.unpack("<8I", encrypted)

    # Figure out which bit is set in the result
    value = struct.unpack("<Q", encrypted)[0]
    for bit in range(value):
        if value & (1 << bit):
            table.append(bit)
            break

with open("table.json", "w") as f:
    json.dump(table, f)

The script produced the following table:

[27, 28, 36, 35, 18, 19, 20, 21, 29, 37, 45, 44, 43, 42, 34, 26, 9, 10, 11, 12, 13, 14, 22, 30, 38, 46, 54, 53, 52, 51, 50, 49, 41, 33, 25, 17, 0, 1, 2, 3, 4, 5, 6, 7, 15, 23, 31, 39, 47, 55, 63, 62, 61, 60, 59, 58, 57, 56, 48, 40, 32, 24, 16, 8]

To reverse the encryption process, we now check which of the input bits is set and flip the corresponding output bits:

import json
import struct

with open("table.json") as f:
    table = json.load(f)

def reverse_step3(data):
    for i in range(0, len(data), 8):
        encrypted = struct.unpack_from("<Q", data, i)[0]
        decrypted = 0
        for bit in range(64):
            if encrypted & (1 << table[bit]):
                decrypted |= 1 << bit
        data[i : i + 8] = struct.pack("<Q", decrypted)

Step 4 - Transposing Bits

The last algorithm that is applied to blocks of 8 bytes each (sub_180003900) is a bit simpler again. This algorithm tranposes the bits in the block.

Because a tranpose is self-inverse, it can be undone by applying another transpose:

def reverse_step4(data):
    for i in range(0, len(data), 8):
        block = data[i : i + 8]
        transposed = [0] * 8
        for byte in range(8):
            for bit in range(8):
                if block[byte] & (1 << bit):
                    transposed[bit] |= 1 << byte
        data[i : i + 8] = transposed

Step 5 - Solving Equations

We are almost at the end. The next function (sub_180003FB0) is applied on 4 bytes each. If the input buffer contains the bytes x0, x1, x2 and x3, then the following equations are calculated:

  • c0 = x0 * x2 + x1 * x3
  • c1 = x1 * x2 - x0 * x3
  • c2 = x0 + x2
  • c3 = x1 - x3

The values c0, c1, c2 and c3 are stored into the output buffer as 64-bit integers.

To reverse this step, we have to find x0, x1, x2 and x3 that satisfy the equations for a given c0, c1, c2 and c3. During the competition, I used Z3 to solve the equations, which provides a generic constraint solver:

from z3 import *
import struct

def reverse_step5(data):
    decrypted = b""
    for i in range(0, len(data), 32):
        c0, c1, c2, c3 = struct.unpack_from("<qqqq", data, i)

        x0 = Int("x0")
        x1 = Int("x1")
        x2 = Int("x2")
        x3 = Int("x3")

        solver = Solver()
        solver.add(x0 >= 0, x0 <= 255)
        solver.add(x1 >= 0, x1 <= 255)
        solver.add(x2 >= 0, x2 <= 255)
        solver.add(x3 >= 0, x3 <= 255)
        solver.add(c0 == x0 * x2 + x1 * x3)
        solver.add(c1 == x1 * x2 - x0 * x3)
        solver.add(c2 == x0 + x2)
        solver.add(c3 == x1 - x3)
        solver.check()

        model = solver.model()

        x0 = model.evaluate(x0).as_long()
        x1 = model.evaluate(x1).as_long()
        x2 = model.evaluate(x2).as_long()
        x3 = model.evaluate(x3).as_long()

        decrypted += bytes([x0, x1, x2, x3])
    return decrypted

However, the encrypted flag file was about 3.7 MB large, which meant that we had to solve the equations more than 100,000 times. Z3 is quite slow, and it took about half an hour to solve all equations.

After the competition, I implemented a faster solver by brute forcing x0 and solving the remaining variables using the 'quadratic formula'. Note that the last two equations can be rewritten as follows:

  • x2 = c2 - x0
  • x3 = x1 - c3

These can then be inserted into the first two equations:

  • x0 * (c2 - x0) + x1 * (x1 - c3) = c0
  • x1 * (c2 - x0) - x0 * (x1 - c3) = c1

If we know the value of x0, we can insert it into the first equation and turn it into a quadratic equation:

  • x1 * x1 - c3 * x1 + x0 * (c2 - x0) - c0 = 0

This means that we can apply the quadratic formula to get x1:

  • D = c3 * c3 + 4 * (c0 - x0 * (c2 - x0))
  • x1 = (c3 + math.isqrt(D)) // 2

We can then use the original equations to check whether our guess for x0 was correct. With this approach, the equations are solved in less than 10 seconds.

If you have a sharp mind, you might have noticed that there are multiple solutions to the equations in some cases. This is actually what prevented us from solving this challenge during the competition. More on that later.

Step 6 - AES-CBC

The last step of the encryption process (sub_180002270) uses AES-CBC. This was relatively easy to recognize. If we look at the function in IDA, we see that it uses the Rijndael S-box internally, which is usually only seen in AES. We also see that the function XORs each plaintext block with the previous ciphertext block during encryption. This implies that CBC is used.

The key and IV that are used are generated in an incomprehensible way, but we can extract them with x64dbg by placing a breakpoint right before sub_180002270 is called. This gave us the following values:

  • Key: 200000126f0000000000126163130012a40072690e5812dc0069124600007013
  • IV: 65126e77dc4604001220391200005812

The PyCryptodome library makes it easy to decrypt AES-CBC in Python:

from Crypto.Cipher import AES
from Crypto.Util.Padding import unpad

def reverse_step6(data):
    key = bytes.fromhex("200000126f0000000000126163130012a40072690e5812dc0069124600007013")
    iv = bytes.fromhex("65126e77dc4604001220391200005812")
    aes = AES.new(key, AES.MODE_CBC, iv=iv)
    return unpad(aes.decrypt(data), 16)

The Fatal Flaw

After we reversed all the encryption algorithms, we thought that we had solved the challenge. However, after running our decryption script against the flag, we noticed that something went wrong. While we saw an IHDR chunk at the end, which would belong to a PNG file, the file was clearly corrupted. Further testing revealed that there is a very small chance that a file that we encrypted with the Windows executable gets corrupted during decryption.

It took a while to figure out the cause, but eventually we got it. Remember that we had to solve the following equations in step 4:

  • c0 = x0 * x2 + x1 * x3
  • c1 = x1 * x2 - x0 * x3
  • c2 = x0 + x2
  • c3 = x1 - x3

Consider what happens when x1 and x3 are both zero. In that case:

  • c0 = x0 * x2
  • c1 = 0
  • c2 = x0 + x2
  • c3 = 0

There are multiple solutions for x0 and x2, because their order can be swapped. For example, if c0 = 6256 and c2 = 218, then 34 and 184 are valid solutions for x0 and x2, but we do not know which one is which.

This meant that there was a very small chance that a chunk of 8 bytes is decrypted incorrectly, and there is no easy way to fix that. Additionally, because of step 1, when one byte goes wrong, all bytes before that are XORed with the difference between the correct byte and the wrong byte.

I initially tried to fix the PNG file by XORing regions of the file with a constant value. While I got a valid PNG header this way, it wasn't enough to recover the flag. The biggest problem was that the image data in a PNG file is zlib compressed. If only one byte is corrupted in a zlib stream, all bytes that come after that are corrupted in an irrecoverable way. This made fixing the PNG file not feasible.

Recovering the Flag

While we did not manage to solve the challenge during the competition, I came up with a solution afterwards. We can find all places where there a multiple solutions by checking whether c1 and c3 are zero. There turned out to be 16 such places in the encrypted flag. This meant that there are 2^16 possible plaintexts for the encrypted flag.

My solution was to generate all possible plaintexts until I found a valid PNG file. Python turned out to be too slow for this (it would take 12 hours to complete), so I ported my script to C. To optimize it further, I reduced step 2, 3 and 4 into a single step. Each of these steps only moves around bits. Instead of moving bits for each step separately, I generated a position table that immediately moves each bit to the final position. The following script generates this table:

import json
import struct

def reverse_byte(value):
    bits = f"{value:08b}"
    return int(bits[::-1], 2)

def reverse_block(block):
    return bytes(reverse_byte(x) for x in block[::-1])

def transpose_block(block):
    transposed = [0] * 8
    for byte in range(8):
        for bit in range(8):
            if block[byte] & (1 << bit):
                transposed[bit] |= 1 << byte
    return bytes(transposed)

def apply_table(block, table):
    encrypted = struct.unpack("<Q", block)[0]
    decrypted = 0
    for bit in range(64):
        if encrypted & (1 << table[bit]):
            decrypted |= 1 << bit
    return struct.pack("<Q", decrypted)

# Load the table 
with open("table.json") as f:
    table = json.load(f)

# We loop through all 64 bits in a block. In every iteration,
# we set a single bit to 1, decrypt the block, and find the
# position at which the bit ends up.
positions = []
for i in range(64):
    block = struct.pack("<Q", 1 << i) # set exactly one bit
    block = transpose_block(block) # reverse algorithm 4
    block = apply_table(block, table) # reverse algorithm 3
    block = reverse_block(block) # reverse algorithm 2

    # Find the position of the bit in the decrypted block
    value = struct.unpack("<Q", bytes(block))[0]
    for j in range(64):
        if value & (1 << j):
            positions.append(j)
            break

print(positions)

This gives us the following table:

[27, 0, 1, 2, 3, 4, 5, 6, 26, 47, 28, 29, 30, 31, 32, 7, 25, 46, 59, 48, 49, 50, 33, 8, 24, 45, 58, 63, 60, 51, 34, 9, 23, 44, 57, 62, 61, 52, 35, 10, 22, 43, 56, 55, 54, 53, 36, 11, 21, 42, 41, 40, 39, 38, 37, 12, 20, 19, 18, 17, 16, 15, 14, 13]

Now, for the final solution, we use two more scripts. The first script reverses the AES-CBC encryption of step 6, computes a solution for all equations of step 5 and generates a list of offsets where the equations have multiple solutions:

from Crypto.Cipher import AES
from Crypto.Util.Padding import unpad
import json
import math
import struct

with open("flag.krmc", "rb") as f:
    data = f.read()

# Reverse algorithm 6 (AES-CBC)
key = bytes.fromhex("200000126f0000000000126163130012a40072690e5812dc0069124600007013")
iv = bytes.fromhex("65126e77dc4604001220391200005812")
aes = AES.new(key, AES.MODE_CBC, iv=iv)
data = unpad(aes.decrypt(data), 16)

# Solve algorithm 5 (equations)
swaps = [] # list of offsets where there are multiple solutions
solved = bytearray(len(data) // 8) # preallocate array for performance
for i in range(0, len(data), 32):
    c0, c1, c2, c3 = struct.unpack_from("<qqqq", data, i)
    for x0 in range(256): # brute force x0
        D = c3 * c3 + 4 * (c0 - x0 * (c2 - x0))
        if D < 0:
            continue

        # Calculate x1, x2 and x3
        x1 = (c3 + math.isqrt(D)) // 2
        x2 = c2 - x0
        x3 = x1 - c3

        # Check if the current x0 is correct
        if x1 * (c2 - x0) - x0 * (x1 - c3) == c1:
            break # We found a solution
    else:
        raise ValueError("No solution found")

    # Store solutions into output array
    solved[i // 8] = x0
    solved[i // 8 + 1] = x1
    solved[i // 8 + 2] = x2
    solved[i // 8 + 3] = x3

    # If x1 and x3 are 0, there are multiple solutions
    # In this case, x0 and x2 might have to be swapped
    if x1 == 0 and x3 == 0:
        swaps.append(i // 8)

# Write solutions for equations to file
with open("temp.bin", "wb") as f:
    f.write(solved)

# Prints the offsets where we might have to swap bytes
print(swaps)

The temp.bin file that was generated by this script can be downloaded here.

The following offsets are printed by the script:

[12464, 28984, 30084, 108364, 111772, 129228, 149680, 195956, 222712, 234588, 237408, 237900, 339724, 345136, 357860, 444992]

Finally, the following script generates all possible solutions, and writes all solutions that have a valid PNG header to a file. Note that because of the way that the encryption algorithm works, a file may have a valid PNG header even if another part of the file is corrupted.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

size_t swaps[] = {12464, 28984, 30084, 108364, 111772, 129228, 149680, 195956, 222712, 234588, 237408, 237900, 339724, 345136, 357860, 444992};
int shuffle[] = {27, 0, 1, 2, 3, 4, 5, 6, 26, 47, 28, 29, 30, 31, 32, 7, 25, 46, 59, 48, 49, 50, 33, 8, 24, 45, 58, 63, 60, 51, 34, 9, 23, 44, 57, 62, 61, 52, 35, 10, 22, 43, 56, 55, 54, 53, 36, 11, 21, 42, 41, 40, 39, 38, 37, 12, 20, 19, 18, 17, 16, 15, 14, 13};

void *read(const char *filename, size_t *size) {
    FILE *f = fopen(filename, "r");
    fseek(f, 0, SEEK_END);
    *size = ftell(f);
    fseek(f, 0, SEEK_SET);
    void *data = malloc(*size);
    fread(data, 1, *size, f);
    fclose(f);
    return data;
}

void write(const char *filename, void *data, size_t size) {
    FILE *f = fopen(filename, "w");
    fwrite(data, 1, size, f);
    fclose(f);
}

int main() {
    size_t encrypted_size;
    uint8_t *encrypted = read("temp.bin", &encrypted_size);
    uint8_t *data = malloc(encrypted_size);

    // There are 16 position where a swap can be applied,
    // so 2^16 = 65536 possible solutions
    for (int i = 0; i < 65536; i++) {
        printf("Progress: %i / 65536\n", i);

        // Create a copy
        memcpy(data, encrypted, encrypted_size);

        // Apply swaps
        for (int j = 0; j < 16; j++) {
            if (i & (1 << j)) {
                // Swap x0 and x2 at offset
                size_t offset = swaps[j];
                uint8_t temp = data[offset];
                data[offset] = data[offset + 2];
                data[offset + 2] = temp;
            }
        }

        // Shuffle bits to correct positions
        for (size_t j = 0; j < encrypted_size; j += 8) {
            uint64_t value = *(uint64_t *)(data + j);
            uint64_t result = 0;
            for (int k = 0; k < 64; k++) {
                if (value & (1l << k)) {
                    result |= 1l << shuffle[k];
                }
            }
            *(uint64_t *)(data + j) = result;
        }

        // Undo XOR encryption
        for (ssize_t j = encrypted_size - 2; j >= 0; j--) {
            data[j] ^= data[j + 1];
        }

        // Check if it has a valid PNG header
        if (data[0] == 0x89 && data[1] == 'P' && data[2] == 'N' && data[3] == 'G') {
            char filename[100];
            sprintf(filename, "flags/flag_%i.png", i);
            write(filename, data, encrypted_size);
        }
    }

    return 0;
}

After about 7 minutes, the script had finally produced the image with the flag:

Conclusion

Lolcat was one of the hardest challenges that I have ever solved. In addition to bypassing basic obfuscation techniques on Windows, we had to reverse engineer 6 different algorithms that were used during encryption.

In one of the algorithms, there was a small chance that different plaintexts produce the same ciphertext. This meant that the encryption algorithm was not always uniquely invertible. In the end, I solved the challenge by generating all possible plaintexts for the flag file, and finding the one that contained a valid image.

Solving this challenge required skills in many different areas, such as Windows reversing, deobfuscation, both static and dynamic analysis, and even some brute force. Even though I did not solve it during the competition, I definitely learned a lot from it.