The Occult

The Occult

A Synchronous/Asynchronous Virtual Machine Architecture
for Unreal Engine

Mert Börü
Video Game Developer
mertboru@gmail.com

Abstract

The “Occult” virtual machine architecture is a cross-platform C++ middleware designed and implemented for Unreal Engine. The primary feature of the Occult is to deliver super-optimized AAA grade games. By adding a very thin-layer on top of Unreal Engine C++ API, the Occult provides a virtual microcomputer architecture with a synchronous/asynchronous 64-bit CPU, various task specific 32-bit coprocessors, and a modern assembly language gameplay programming ecosystem for handling real-time memory/asset management and in-game abstract object/data streaming. It can be used as a standalone solution or as a supplementary tool for existing Blueprint/C++ projects.


1. Introduction

The Occult virtual machine architecture is a cross-platform gameplay programming middleware written in C++ for Unreal Engine. With respect to conventional Unreal Blueprint/C++ gameplay programming methods, the Occult introduces an “alternate” way of gameplay programming and scripting with modern AAA grade video game development requirements in mind. In order to deliver “super-optimized” games, the Occult provides an amalgamation of old and new by revisiting proven low-level game development techniques: a good old-fashioned assembly language gameplay programming ecosystem, harnessing the power of a modern synchronous/asynchronous 64-bit CPU and various 32-bit coprocessors under the hood.

2. System Architecture Overview

In simple terms, the Occult is a “virtual” microcomputer. It comes complete with major computer components, such as CPU, memory and I/O, in addition to custom coprocessors tailor-made to meet specific Unreal Engine features.

The Occult System Bus Architecture

On the surface, the Occult has all the elements of a classic von Neumann[1] architecture: A central processing unit, a memory unit, and various buses. However, a thorough examination shows that the Occult architecture takes the advantage of being virtual, and thus breaks some rules for good. For example, there is no control bus, coprocessors use a proprietary bus, memory can be accessed through an address bus, as well as an index bus, and the stack is isolated from RAM.

Making so many radical design decisions has only one purpose: code optimization. In order to deliver speed & size optimized gameplay code, the Occult has a special architecture design that maximizes code performance via concurrent synchronous/asynchronous processing power and minimizes code size by having a very reduced instruction set.

3. Central Processing Unit (CPU)

Contrary to general purpose CPU of a desktop computer/notebook that Unreal Engine runs on, the Occult has a virtual processor that is designed with “specific needs” of a video game developer in mind. Specific needs require custom solutions.

3.1 Hybrid Synchronous/Asynchronous Architecture

Today’s consumer computers use “clock driven” synchronous processors. A BIOS based clock frequency mimics the conductor of an orchestra. It keeps everything in order. All components in the computer work in perfect harmony, because the changes in the state of CPU and memory elements are always synchronized by the oscillated “reference” signal.

An asynchronous architecture is the complete opposite of a clock driven system. It is “clockless”, or self-timed. There is no conductor. Each player in the orchestra is free to choose a tempo. Despite the fact that clockless music may end up with pure cacophony, a clockless CPU architecture is indeed the perfect choice for “event driven” programming. Signals are only used to indicate completion of instructions and operations. Since there is no clock signal, asynchronous systems don’t have to wait for a clock pulse to begin processing inputs. The state of the system changes as soon as the inputs change. This is simply why asynchronous systems are faster than synchronous systems.

At the beginning of asynchronous execution phase, the task is assigned as a background process. If assignment is successful, then CPU is either halted (for good) or next instruction is executed in parallel. This efficient process is repeated until the background task is completed. Due to nature of event driven programming and complexity of background process, time of completion cannot be estimated precisely.

The Occult has a hybrid processor that utilizes both synchronous and asynchronous architectures simultaneously. The decision is made by “smart” instructions. All instructions decide for themselves. For instance, an arithmetic logic unit driven instruction may or may not use the synchronous method (depending on the other instructions in the pipeline), while a file streaming instruction is always executed as an asynchronous process for the sake of delivering flicker/stutter-free gameplay experience.

For synchronous operations, the Occult needs a clock signal. This is where the “virtual” IRQ (Interrupt Request) instructions come in handy. The video game developer is free to set/update a signal frequency on the fly. The change is reflected right after executing the IRQ instruction(s) in the pipeline.

Time management for IRQ based synchronous operations is inspired from a multimedia system that is considered to be so far ahead of its time: The Amiga family of personal computers introduced by Commodore in 1985. On the Amiga, there were various “time slots” assigned for different tasks to be executed on each horizontal scan line. Whether a task was defined or not, assigned time slots were always executed. It was a hardwired design, forcing the system to work at full capacity all the time. For instance, if no audio data was provided, the Amiga was forced to spend a certain amount of allocated time as if it was playing “silence” and this process was repeated as long as the system was on. – In general, this method was known as “DMA Time Slot Allocation”[2].

A time slot based “closed system” mimics finite-state machine behaviour. Every input is known and every resultant is known (or can be known) within a “specific time”. In terms of performance, constant amount of work done in a constant amount of time simply eliminates transient response, and thus achieves a steady state response.

“Time Slot Allocation” is an efficient and proven model for avoiding stutters during gameplay, i.e. the Amiga architecture. The Occult’s IRQ driven synchronous operations are managed in this manner; whether a synchronous instruction is fetched-decoded-executed or not, that time slot is always executed. Regarding this model, the Occult’s instruction set is designed and implemented in a way that all synchronous instructions fit into allocated time slots.

The Occult can perform non-IRQ driven synchronous operations as well. If no IRQ is set, then the synchronous operations consume all available power of the hardware CPU that the Occult is running on. In that case, clock signal frequency is auto-determined by the resources available on the hardware CPU and time slot allocation is ignored. This is the “put the pedal to the metal” mode of the Occult. Moderate use is recommended. Overuse may or may not cause stutters, depending on the system’s hardware CPU performance.

As a rule of thumb, it is wise to use IRQ driven synchronous operations for “time-stretched” (noncritical) operations, where as non-IRQ driven synchronous operations for “instant” (critical) operations.

The Occult has a hybrid instruction pipeline: during asynchronous execution of a background process, CPU can perform other synchronous/asynchronous tasks depending on the type of next instruction in the pipeline. If necessary, CPU can be halted as well. Switching between two modes are “instruction type” driven. It is handled automatically; user input is not necessary.

3.2 Stack based Processing

The Occult’s virtual CPU is “registerless”. There are no address/data registers. Being a stack based processor, similar to Burroughs large systems architecture used back in the early 60s, all operations are performed on the stack.

There are multiple reasons behind this decision, such as:

a.) Having registers on a real hardware CPU is reasonable, because register based operations are faster due to no (or less) memory access requirement. On the other hand, a virtual processor is made of code. It is pure software. Each and every variable defined in VM source code is memory driven. If defined, a virtual register will be no different than any other variable in memory. In terms of optimization, there is no advantage in using software emulated hardware registers on a virtual processor.

b.) Having limited number of registers is a bottleneck for implementing complex routines. Instead of focusing on functionality of an algorithm, the game developer hassles with pushing and popping registers most of the time, which leads to CPU performance problems (wasted ticks) due to push/pop overhead. Using a stack based architecture simply solves this problem by offering each and every byte of stack as a potential variable. In theory, the number of variables (call it registers, if you want to) is limited with stack size. It is possible to perform sequential operations on thousands of variables without pushing and popping anything, unless necessary.

c.) Most of the algorithms that we use today were created by smart engineers back in the 60s using archaic programming languages, such as Fortran, ALGOL, and Simula. The habit of using these languages and compiling codes on stack based processors has led to decades of “stack driven” way of thinking. It has been widely acclaimed as the natural syntax for creative process. Even today, we still use stack based pseudo code languages (mostly ALGOL driven) for writing academic papers and sharing ideas/algorithms on the Internet. Do we ever worry about registers while creating algorithms? Not at all. All ALGOL-like languages, which really means most commonly used programming languages (including C/C++), were designed based on von Neumann architecture computers. Breaking out of the von Neumann mind-set when designing a computer language isn’t easy, and getting other people to use such a language is even harder[3]. That is simply why the Occult sticks to “traditional” stack driven architecture for running “algorithm driven” codes.

3.3 Status Registers

The Occult CPU has 2 special registers for internal use only: OOR and CCR. These registers serve as “status” registers. The user has read/write access to them. They can only be used for monitoring and setting the status of various virtual machine features; not for storing address/data.

3.3.1 Operand Offset Register (OOR)

OOR is an 8-bit register with full read/write access to the user. It can be used in many ways, depending on the type and functionality of the instruction it is used with. If the instruction is ALU driven, then it operates both as a direction “offset” and a source/destination “selector”. For a database/stack driven simple read-only instruction, it only operates as an offset. And finally, each coprocessor may interpret this register in a very unique manner, such as using it as a custom getter/setter or a feature enabler/disabler, depending on the type and functionality of the caller coprocessor.

3.3.2 Control Code Register (CCR)

CCR is a 32-bit register with full read/write access to the user. It is a modern multipurpose status register for monitoring all system errors, setting various system features, and hosting a minimal set of conditional flags; only Z (Zero) and N (Negative), respectively. The abbreviation used for this register pays tribute to the Condition Code Register of the mighty Motorola M68000 family of processors[4].

3.4 Instruction Set

Computer designers have a common goal: to find a language that makes it easy to build the hardware and the compiler while maximizing performance and minimizing cost. This goal is time-honoured; the following quote was written before you could buy a computer, and it is as true today as it was in 1947[5]:

“It is easy to see by formal-logical methods that there exist certain [instruction sets] that are in abstract adequate to control and cause the execution of any sequence of operations… The really decisive considerations from the present point of view, in selecting an [instruction set], are more of a practical nature: simplicity of the equipment demanded by the [instruction set], and the clarity of its application to the actually important problems together with the speed of its handling of those problems.”

Burks, Goldstine and von Neumann, 1947

The design philosophy behind the Occult’s instruction set architecture is aware of the fact that “simplicity of the equipment” is as valuable a consideration for computers of the 2000s as it was for those of the 1950s, and that is why the Occult comes with a very simple instruction set that contrasts with today’s complex CISC and RISC processor instruction sets.

Each instruction performs only one task, whether simple or complex. It can be executed synchronously and/or asynchronously, depending on both the nature of the instruction and other instruction(s) being processed in the pipeline. If an operand is needed, it can either be “source” or “destination”. Source is read-only. It can be a database record, a stack slot or an immediate value. Destination is always the stack and/or the status registers, and has read/write access to all. Usage of more than one operand is allowed for some of the instructions, though not necessary for most of the time.

Most instructions are strictly typed, similar to C++ pointer convention. Operations must be performed on same types. If necessary, types can be converted to other types using conversion instructions.

Immediate operands are stored in the stack, and conditionally used on control flow instructions. Depending on type of the instruction, immediate operands can either be “absolute” or “relative”. Sign of immediate operand indicates the direction of operation. For absolute Move/Swap/Jump instructions, it is interpreted as an offset “from the beginning/end” of stack, where as relative instructions stick to “from the current position” convention.

All mnemonics are defined as C++ macros. If necessary, alternate instruction set(s) can be created by renaming mnemonics.

4. Stack

Stack is the primary memory unit of the Occult architecture. Contrary to RAM based traditional stack, it is an independent unit isolated from the main memory. In additional to ROM and RAM, it can be thought of as a 3rd type of memory.

Stack is typed. Each stack bank is dedicated to a “type”. A total of 6 stack (1 system + 5 user) types are defined as factory default banks. System stack is a privileged bank that is used for protected function calls. User access to system stack is not allowed. All gameplay related operations are performed on user stack banks. As the name implies, user has full read/write access to them. Although not necessary, the flexible nature of the Occult ecosystem enables users to define new stack types/banks in C++.

The size of each stack bank is user defined. All banks can have a different size. By default, all stack banks are defined and allocated during level creation and garbage collected right before level destruction. If necessary, stacks can be preserved during level transitions. In order to achieve consistent frame rates, stack resizing is not allowed during gameplay.

Contrary to classic von Neumann architecture, no address pointers are used during stack read/write operations. In terms of “addressing”, ROM and stack do not share the same bus. Stack uses the index bus to achieve an index based addressing for fast “slot” access, while ROM sticks to the traditional address bus. This separation ensures that both ROM (read) and stack (read/write) operations are concurrently handled for synchronous instructions.

The Occult instruction set is designed in a way that stack can be used both as “source” and “destination”, depending on the type of operation performed. If stack is set as destination, then RAM can only be used as source. Only one destination is allowed per instruction. It can either be stack or RAM.

Having such a unique stack architecture is a ticket to an adventure into uncharted waters. Besides being able to apply rules of classic software engineering concepts and methods, bending/breaking them for the sake of hardcore game code optimization is possible as well, such as implementing push/popless function calls, type driven tree traversals, stack arrays, and split structs.

5. Memory

ROM and RAM are the secondary memory units of the Occult architecture. Just like stack, they are typed. Each memory bank is dedicated to a “type”. A total of 5 user types are defined as factory default banks. The size of each ROM/RAM bank is user defined, and all banks can have a different size. Code can be executed on both ROM and RAM with full stack access.

In terms of addressing, both ROM and RAM use conventional address bus architecture. However, ROM has an advantage of accessing the index bus as well. Only one of these buses can be used at a time. User has full privilege to switch between address and index bus for ROM operations, preferably the latter: the Occult handles all index bus operations like a handheld console that uses ROM cartridges, and thus index bus Read operations are ~40 times faster than address bus operations.

Regarding memory read performance via index bus, ROM access is always faster than stack access. RAM is the slowest of all. For write operations, stack is always faster than RAM, simply because the latter has no access to index bus. As a rule of thumb, it is wise to define variables in stack rather than RAM, and that is a paradigm shift.

6. Coprocessors

All coprocessors share a custom 32-bit proprietary bus for addressing, and a standard 64-bit data bus for data transfer. In order to reduce system traffic, they do not have access to address bus. All coprocessors can synchronously communicate with each other via CPU, while performing asynchronous tasks. Currently, there are 3 coprocessors.

6.1 Trigger

Trigger is a coprocessor dedicated to defining, updating and executing “actions/tasks”. When a collision is detected, Trigger fires an action. Action is user defined. It can either be a C++ function running on the physical (hardware) CPU or a virtual function on the Occult. Trigger can cooperate with Flagger for defining prerequisite conditions.

6.2 Flagger

Flagger is a coprocessor dedicated to defining, updating and executing “conditional” statements. No actions/tasks are performed. It functions as a complementary coprocessor to Trigger. One or more conditional statements can be updated/executed before a Trigger action is performed. Flagger definitions are stored in ROM and RAM. It is possible to implement and execute self-modifying conditional statements using RAM.

6.3 Pooler

Pooler is a coprocessor dedicated to defining, acquiring and releasing predefined objects. An object can be anything; an Actor, a Component, or an abstract user defined entity. User has full access to Pooler for defining object types and sizes. All pool objects are reusable; they can be acquired and released many times during gameplay with no performance cost at all. Pooler instructions are synchronously executed on the virtual CPU, while pooler tasks are asynchronously performed at coprocessor level.

7. Case Studies

The following snippets demonstrate how the Occult can efficiently be used for low-level Unreal gameplay programming. All examples are implemented as inline functions using Visual Studio 2019. According to the coding standards and guidelines of the Occult, all assembly language instructions begin with __asm__ using lower-case letters, where as immediate operands, variables, definitions and macros are preferably capitalized.

7.1 Add Trigger Components

A very simple ROM snippet caller for adding Trigger components. It is a gameplay function with no input and only 1 output parameter.

7.2 Add Actor to Scene

A modular ROM firmware routine for adding an Actor to current scene in only 31 bytes. It has 5 input and 3 output parameters.

7.3 Add Component to Scene

A complex and very modular ROM firmware routine for adding a Component to current scene in only 58 bytes. It has 8 input and 3 output parameters.

7.4 Add Component to Trigger Tree

Yet an another complex and very modular ROM firmware routine; this time for adding a Component to Trigger tree in only 79 bytes. This case demonstrates passing an UE Enum type as an immediate value, adding new tags, and setting world offset applied component location, rotation and scale values from database. This function has 8 input and 3 output parameters.

8. Availability

Being meticulously coded from February 2019 through March 2021 and tested on Windows 10, macOS Big Sur, and iOS 14 platforms, the Occult is now available for personal use in projects that the author is involved in. – It is not available for public use.

9. Conclusion

The Occult virtual machine architecture is a comparatively powerful way of programming super-optimized Unreal games. It is a versatile tool: can be used for developing any type of video game ranging from next-gen open world AAA games to simple mobile card games. It consists of a virtual CPU, various dedicated coprocessors, and an assembly language driven gameplay programming ecosystem. In most cases, its modern synchronous/asynchronous architecture and low-level coding features simplify time-consuming video game development processes by offering an alternate way of Unreal gameplay programming and scripting.

The Occult is a tool designed for veteran video game developers; a moderate level of assembly language programming experience is a must, and entry level parallel programming background is a bonus.

In future research and development, new features and coprocessors will be implemented. Currently, a smart first/third person camera management system is being developed for delivering next-gen gameplay experience, and an audio management system is in design process.

This article serves as an introduction to the Occult architecture. Technical details of each coprocessor and asynchronous in-game abstract object/data streaming will be thoroughly explained in an another article.

10. Acknowledgement

The Occult was conceived, designed and implemented by Mert Börü, a professional video game developer working in the European video game industry since 1985. Throughout his career, he had the privilege of coding in assembly language on various mainstream processors from Intel, Motorola, Sun Microsystems, MIPS and ARM, but the relationship with the good old Zilog Z80 processor had been a “game” changer for him, literally. This project might never have seen the light of day without the dedicated spirit and enthusiasm of 8/16-bit game development scene in its heyday, that the author had been involved in. He keeps on carrying the torch for all the hardworking game developers who turned humble video game development scene into a global industrial giant. Last but not least, very special thanks to Tuncay Talayman for designing such a fascinating logo and cover illustration, and supporting author’s projects throughout the years.

References

[1] Aspray, W. (1990). John von Neumann and the Origins of Modern Computing. MIT Press.

[2] (1985). Commodore Amiga Book: Amiga Hardware Reference Manual. Commodore Amiga Inc., p.190

[3] Petzold, C. (2000). The Hidden Language of Computer Hardware and Software: Code. Microsoft Press, p.363.

[4] (1992). M68000 Family Programmer’s Reference Manual (PM/AD Rev. 1). Motorola Inc., chapter 1.1.4, p.1-3

[5] Patterson, D. A., & Hennessy, J. L. (2005). Computer Organization and Design (3rd ed.). Elsevier Inc., Morgan Kaufmann Publishers, p.48

Vivre libre ou mourir!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.