Optimization Tips for MAME programmers

The following is a set of tips for MAME programmers to help squeeze the most speed out of their game drivers.  These tips can also be useful to authors of other emulators since the same principals apply to most game emulation in general.  These tips are not guaranteed to double the speed of a game, but in many instances can provide measurable benefit.  Each area I cover will not affect the speed of emulation greatly, but taken as a whole, these tips can be useful for improving existing drivers as well as helping authors of new drivers avoid problems in the first place.

MAME's programming environment has many well designed features to make writing game drivers easier, however, it is also easy to overlook how the use of that environment will affect a game's overall performance.  The following 2 general areas try to assist the MAME developer to avoid wasting CPU time unnecessarily.

CPU Emulation

Watchdog timers
Most game machines have a watchdog timer which needs to be read or written frequently or the machine will reset itself.  I have seen in many MAME drivers that the watchdog timer address is used to call a watchdog_reset_r routine.  Once you have debugged the game and everything works correctly the watchdog timer will never reset the machine since the smooth, proper operation of the game prevents this from happening.   Leaving this in the final game driver just slows things down since each read or write must pass to a special handler routine.  Certainly leave a comment in the driver to document the watchdog address and behavior, but remove it from the final version of the driver.  Dan Boris has pointed out that this is true for about 99% of the games, but a few actually need the watchdog reset to be active in order to function properly.

Shared memory regions
Many multi-CPU games have some sort of shared memory region between one or more CPUs as a means of communication between the CPUs and to access shared hardware.   If it is possible, determine which CPU is reading/writing to this area the most and have the shared memory region reside in that CPU's memory space.  In other words, one of the CPUs does not necessarily need to have a special handler routine for read/writing to this shared memory space because it already resides in that CPU's memory.  This can save a considerable amount of time on some games.  For example, reads from shared video memory normally don't need to do anything special.  To see a specific case of this look at \src\drivers\galaga.c line 105 - since the shared memory region resides in CPU #1's memory space, and reads don't do anything special, it does not have to pass through a handler routine for each read done in this area.  This wasted time has a measurable impact on game performance.

Custom I/O chips
Several games incorporate custom I/O chips (especially from Namco) which have a read/write memory area.  Very often, this memory area has no action associated with writes to it and only reads need to be sent to a handler.  In this case, there should not be any write handler routine and the custom chip's memory can reside in the normal RAM memory map to avoid calling a routine every time a write occurs.  An example of this is in \src\machine\mappy.c line 124.  Having these writes pass through a handler routine wastes time unnecessarily.

CPU speed and interleave
Most game hardware is documented well enough to know the precise speed of the CPUs.  Especially in the case of 6809 processors since the processor is typically run at rated speed.  If the original game used a CPU clock of 1.536Mhz, then that's what should be emulated; anything faster will just waste time in a busy loop.

The flip side of this is for multi-CPU games.   There typically needs to be some interleave between the CPUs to allow any shared memory variables to be processed at the proper time.  Once the game is running properly, reduce this interleave value to determine what is really needed.

Video

Character Drawing
There are two optimizations that can be done to speed up character drawing.  The first is to optimize the palette usage. Many older games did not use all 64*4 combinations of palette colors.  A good example is PacMan.  Since only 128 color combinations are used; all possible character color combinations can be mapped directly into the working palette thereby avoiding a color lookup for each pixel.   Many of the newer, more complicated games don't allow this optimization, but always investigate the palette contents before making any assumptions.

Second, is the screen update loop.  I have seen in many MAME drivers that for each character to be drawn, the video address is converted into a display x,y through a complicated series of calculations.  These calculations acually waste a measurable amount of time (especially with Intel's slow integer multiply and divide instructions).  A better way to handle this is to create a lookup table which maps the video address to an x,y coordinate or a video buffer address (whichever is more convenient). Something else to note related to divides - never use the modulo "%" operator when working with powers of 2 since you can't assume that the compiler will turn it into a logical AND versus the much slower divide (e.g. x = i % 32 vs. x = i & 31).

Memory Usage
Many games have multiple layers of graphics which must be combined to form the final image.  The thing to consider when writing the video render code is that the less memory is changed, the faster the code will run.    In other words, use as few buffers as needed to get the job done since the more memory is involved, the slower the code will run and the more cache misses will occur.

Back