====== Optimizing programs ======

This page serves as a loose list of advice for getting the most out of the WonderSwan.

===== Optimizing C code =====

==== Optimizing for code speed ====

To optimize for speed, compile your code with ''%%-O2%%''.

==== Optimizing for code size ====

To optimize for size, compile your code with ''%%-Os%%''.

==== Optimizing for memory usage ====

  * For data stored in RAM, use the smallest type possible.
    * Exception: For argument passing, there is little reason to prefer ''%%char%%'' over ''%%int%%'' - the stack is always aligned to 2 bytes.
  * By default, GCC allows function call arguments to accumulate on the stack, then pops them all at once. To reduce peak stack usage at the cost of a larger and slightly slower program, compile your code with ''%%-fno-defer-pop%%''.

===== Optimizing assembly code =====

==== Optimizing for speed ====

While the V30MZ is an 80186-compatible CPU, its instruction timings differ wildly from common expectations and are more reflective of its 1990s-era design:

  * ''MUL'' is very fast on this CPU, taking 3-4 cycles. As a result, "shift plus add" ladders will almost always be slower or, at best, equal in performance.
  * Using ''XCHG'' over ''PUSH/POP'' (and ''XCHG AX, reg'' over ''MOV AX, reg'') is a popular pattern on the 8088/8086 due to the speed benefit. However, on the V30MZ, that is not the case:
    * ''XCHG'' on V30MZ always takes 3 cycles.
    * ''PUSH'' and ''POP'' take 1 cycle each for general registers. This means that ''PUSH/POP'' will be one cycle faster.
    * While ''MOV AX, reg'' is one byte larger than ''XCHG AX, reg'', it also only takes 1 cycle.
  * ''XLAT'' takes 5 cycles. Many simple CPU operations (such as ''SHR reg, 4'' or ''MUL reg'' - both taking 3 cycles) can actually be faster.

You can study the instruction timings in detail on [[https://ws.nesdev.org/wiki/NEC_V30MZ_instruction_set|the WSdev wiki]].

There are also some additional tricks you can take advantage of:
 
  * Avoid far calls between functions - branches are expensive, and far branches are significantly more expensive. If you're calling a far function from another far function in the same section, use the ''%%IA16_CALL_LOCAL%%'' macro over a far call to save a few cycles.
  * Try word-aligning loop labels by prepending them with ''%%.align 2, 0x90%%'' - this generates a NOP opcode if necessary. This may help a little.

===== Measuring programs =====

==== Measuring code size ====

Wonderful Toolchain comes with a tool for observing the RAM/ROM allocation and per-symbol sizes: ''%%wf-wswantool usage build/your_program.elf%%''. It is also provided in the default Makefile as ''%%make usage%%''.

==== Measuring performance ====

The best option is to use Mesen 2's profiler. While Mesen 2 is not 100% cycle-accurate for the WonderSwan yet, it's close enough for non-demoscene use cases.

TODO: Document how to use it.