bsnes/higan/processor/m68k/m68k.hpp

415 lines
24 KiB
C++
Raw Normal View History

Update to v100r02 release. byuu says: Sigh ... I'm really not a good person. I'm inherently selfish. My responsibility and obligation right now is to work on loki, and then on the Tengai Makyou Zero translation, and then on improving the Famicom emulation. And yet ... it's not what I really want to do. That shouldn't matter; I should work on my responsibilities first. Instead, I'm going to be a greedy, self-centered asshole, and work on what I really want to instead. I'm really sorry, guys. I'm sure this will make a few people happy, and probably upset even more people. I'm also making zero guarantees that this ever gets finished. As always, I wish I could keep these things secret, so if I fail / give up, I could just drop it with no shame. But I would have to cut everyone out of the WIP process completely to make it happen. So, here goes ... This WIP adds the initial skeleton for Sega Mega Drive / Genesis emulation. God help us. (minor note: apparently the new extension for Mega Drive games is .md, neat. That's what I chose for the folders too. I thought it was .smd, so that'll be fixed in icarus for the next WIP.) (aside: this is why I wanted to get v100 out. I didn't want this code in a skeleton state in v100's source. Nor did I want really broken emulation, which the first release is sure to be, tarring said release.) ... So, basically, I've been ruminating on the legacy I want to leave behind with higan. 3D systems are just plain out. I'm never going to support them. They're too complex for my abilities, and they would run too slowly with my design style. I'm not willing to compromise my design ideals. And I would never want to play a 3D game system at native 240p/480i resolution ... but 1080p+ upscaling is not accurate, so that's a conflict I want to avoid entirely. It's also never going to emulate computer systems (X68K, PC-98, FM-Towns, etc) because holy shit that would completely destroy me. It's also never going emulate arcade machines. So I think of higan as a collection of 2D emulators for consoles and handhelds. I've gone over every major 2D gaming system there is, looking for ones with games I actually care about and enjoy. And I basically have five of those systems supported already. Looking at the remaining list, I see only three systems left that I have any interest in whatsoever: PC-Engine, Master System, Mega Drive. Again, I'm not in any way committing to emulating any of these, but ... if I had all of those in higan, I think I'd be content to really, truly, finally stop writing more emulators for the rest of my life. And so I decided to tackle the most difficult system first. If I'm successful, the Z80 core should cover a lot of the work on the SMS. And the HuC6280 should land somewhere between the NES and SNES in terms of difficulty ... closer to the NES. The systems that just don't appeal to me at all, which I will never touch, include, but are not limited to: * Atari 2600/5200/7800 * Lynx * Jaguar * Vectrex * Colecovision * Commodore 64 * Neo-Geo * Neo-Geo Pocket / Color * Virtual Boy * Super A'can * 32X * CD-i * etc, etc, etc. And really, even if something were mildly interesting in there ... we have to stop. I can't scale infinitely. I'm already way past my limit, but I'm doing this anyway. Too many cores bloats everything and kills quality on everything. I don't want higan to become MESS v2. I don't know what I'll do about the Famicom Disk System, PC-Engine CD, and Mega CD. I don't think I'll be able to achieve 60fps emulating the Mega CD, even if I tried to. I don't know what's going to happen here with even the Mega Drive. Maybe I'll get driven crazy with the documentation and quit. Maybe it'll end up being too complicated and I'll quit. Maybe the emulation will end up way too slow and I'll give up. Maybe it'll take me seven years to get any games playable at all. Maybe Steve Snake, AamirM and Mike Pavone will pool money to hire a hitman to come after me. Who knows. But this is what I want to do, so ... here goes nothing.
2016-07-09 04:21:37 +00:00
#pragma once
//Motorola M68000
namespace Processor {
struct M68K {
struct Bus {
virtual auto readByte(uint24 addr) -> uint16 = 0;
virtual auto readWord(uint24 addr) -> uint16 = 0;
virtual auto writeByte(uint24 addr, uint16 data) -> void = 0;
virtual auto writeWord(uint24 addr, uint16 data) -> void = 0;
};
virtual auto step(uint clocks) -> void = 0;
enum : bool { User, Supervisor };
enum : uint { Byte, Word, Long };
enum : bool { Reverse = 1, Extend = 1, Hold = 1 };
enum : uint {
DataRegisterDirect,
AddressRegisterDirect,
AddressRegisterIndirect,
AddressRegisterIndirectWithPostIncrement,
AddressRegisterIndirectWithPreDecrement,
AddressRegisterIndirectWithDisplacement,
AddressRegisterIndirectWithIndex,
AbsoluteShortIndirect,
AbsoluteLongIndirect,
ProgramCounterIndirectWithDisplacement,
ProgramCounterIndirectWithIndex,
Immediate,
};
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
struct Exception { enum : uint {
Illegal,
DivisionByZero,
BoundsCheck,
Overflow,
Unprivileged,
Trap,
Update to v101r07 release. byuu says: Added VDP sprite rendering. Can't get any games far enough in to see if it actually works. So in other words, it doesn't work at all and is 100% completely broken. Also added 68K exceptions and interrupts. So far only the VDP interrupt is present. It definitely seems to be firing in commercial games, so that's promising. But the implementation is almost certainly completely wrong. There is fuck all of nothing for documentation on how interrupts actually work. I had to find out the interrupt vector numbers from reading the comments from the Sonic the Hedgehog disassembly. I have literally no fucking clue what I0-I2 (3-bit integer priority value in the status register) is supposed to do. I know that Vblank=6, Hblank=4, Ext(gamepad)=2. I know that at reset, SR.I=7. I don't know if I'm supposed to block interrupts when I is >, >=, <, <= to the interrupt level. I don't know what level CPU exceptions are supposed to be. Also implemented VDP regular DMA. No idea if it works correctly since none of the commercial games run far enough to use it. So again, it's horribly broken for usre. Also improved VDP fill mode. But I don't understand how it takes byte-lengths when the bus is 16-bit. The transfer times indicate it's actually transferring at the same speed as the 68K->VDP copy, strongly suggesting it's actually doing 16-bit transfers at a time. In which case, what happens when you set an odd transfer length? Also, both DMA modes can now target VRAM, VSRAM, CRAM. Supposedly there's all kinds of weird shit going on when you target VSRAM, CRAM with VDP fill/copy modes, but whatever. Get to that later. Also implemented a very lazy preliminary wait mechanism to to stall out a processor while another processor exerts control over the bus. This one's going to be a major work in progress. For one, it totally breaks the model I use to do save states with libco. For another, I don't know if a 68K->VDP DMA instantly locks the CPU, or if it the CPU could actually keep running if it was executing out of RAM when it started the DMA transfer from ROM (eg it's a bus busy stall, not a hard chip stall.) That'll greatly change how I handle the waiting. Also, the OSS driver now supports Audio::Latency. Sound should be even lower latency now. On FreeBSD when set to 0ms, it's absolutely incredible. Cannot detect latency whatsoever. The Mario jump sound seems to happen at the very instant I hear my cherry blue keyswitch activate.
2016-08-15 04:56:38 +00:00
Interrupt,
};};
struct Vector { enum : uint {
Update to v101r07 release. byuu says: Added VDP sprite rendering. Can't get any games far enough in to see if it actually works. So in other words, it doesn't work at all and is 100% completely broken. Also added 68K exceptions and interrupts. So far only the VDP interrupt is present. It definitely seems to be firing in commercial games, so that's promising. But the implementation is almost certainly completely wrong. There is fuck all of nothing for documentation on how interrupts actually work. I had to find out the interrupt vector numbers from reading the comments from the Sonic the Hedgehog disassembly. I have literally no fucking clue what I0-I2 (3-bit integer priority value in the status register) is supposed to do. I know that Vblank=6, Hblank=4, Ext(gamepad)=2. I know that at reset, SR.I=7. I don't know if I'm supposed to block interrupts when I is >, >=, <, <= to the interrupt level. I don't know what level CPU exceptions are supposed to be. Also implemented VDP regular DMA. No idea if it works correctly since none of the commercial games run far enough to use it. So again, it's horribly broken for usre. Also improved VDP fill mode. But I don't understand how it takes byte-lengths when the bus is 16-bit. The transfer times indicate it's actually transferring at the same speed as the 68K->VDP copy, strongly suggesting it's actually doing 16-bit transfers at a time. In which case, what happens when you set an odd transfer length? Also, both DMA modes can now target VRAM, VSRAM, CRAM. Supposedly there's all kinds of weird shit going on when you target VSRAM, CRAM with VDP fill/copy modes, but whatever. Get to that later. Also implemented a very lazy preliminary wait mechanism to to stall out a processor while another processor exerts control over the bus. This one's going to be a major work in progress. For one, it totally breaks the model I use to do save states with libco. For another, I don't know if a 68K->VDP DMA instantly locks the CPU, or if it the CPU could actually keep running if it was executing out of RAM when it started the DMA transfer from ROM (eg it's a bus busy stall, not a hard chip stall.) That'll greatly change how I handle the waiting. Also, the OSS driver now supports Audio::Latency. Sound should be even lower latency now. On FreeBSD when set to 0ms, it's absolutely incredible. Cannot detect latency whatsoever. The Mario jump sound seems to happen at the very instant I hear my cherry blue keyswitch activate.
2016-08-15 04:56:38 +00:00
Illegal = 4,
DivisionByZero = 5,
BoundsCheck = 6,
Overflow = 7,
Unprivileged = 8,
IllegalLineA = 10,
IllegalLineF = 11,
HorizontalBlank = 28,
VerticalBlank = 30,
};};
M68K();
auto power() -> void;
auto reset() -> void;
auto supervisor() -> bool;
auto exception(uint exception, uint vector, uint priority = 7) -> void;
//registers.cpp
struct DataRegister {
explicit DataRegister(uint number_) : number(number_) {}
uint3 number;
};
template<uint Size = Long> auto read(DataRegister reg) -> uint32;
template<uint Size = Long> auto write(DataRegister reg, uint32 data) -> void;
struct AddressRegister {
explicit AddressRegister(uint number_) : number(number_) {}
uint3 number;
};
template<uint Size = Long> auto read(AddressRegister reg) -> uint32;
template<uint Size = Long> auto write(AddressRegister reg, uint32 data) -> void;
auto readCCR() -> uint8;
auto readSR() -> uint16;
auto writeCCR(uint8 ccr) -> void;
auto writeSR(uint16 sr) -> void;
//memory.cpp
template<uint Size> auto read(uint32 addr) -> uint32;
template<uint Size, bool Order = 0> auto write(uint32 addr, uint32 data) -> void;
template<uint Size = Word> auto readPC() -> uint32;
template<uint Size> auto pop() -> uint32;
template<uint Size> auto push(uint32 data) -> void;
//effective-address.cpp
struct EffectiveAddress {
explicit EffectiveAddress(uint mode_, uint reg_) : mode(mode_), reg(reg_) {
if(mode == 7) mode += reg; //optimization: convert modes {7; 0-4} to {8-11}
}
uint4 mode;
uint3 reg;
boolean valid;
uint32 address;
};
template<uint Size> auto fetch(EffectiveAddress& ea) -> uint32;
template<uint Size, bool Hold = 0> auto read(EffectiveAddress& ea) -> uint32;
template<uint Size, bool Hold = 0> auto write(EffectiveAddress& ea, uint32 data) -> void;
//instruction.cpp
auto instruction() -> void;
//instructions.cpp
auto testCondition(uint4 condition) -> bool;
template<uint Size> auto bytes() -> uint;
template<uint Size> auto bits() -> uint;
template<uint Size> auto lsb() -> uint32;
template<uint Size> auto msb() -> uint32;
template<uint Size> auto mask() -> uint32;
template<uint Size> auto clip(uint32 data) -> uint32;
template<uint Size> auto sign(uint32 data) -> int32;
auto instructionABCD(EffectiveAddress with, EffectiveAddress from) -> void;
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
template<uint Size, bool Extend = false> auto ADD(uint32 source, uint32 target) -> uint32;
template<uint Size> auto instructionADD(EffectiveAddress from, DataRegister with) -> void;
template<uint Size> auto instructionADD(DataRegister from, EffectiveAddress with) -> void;
template<uint Size> auto instructionADDA(AddressRegister ar, EffectiveAddress ea) -> void;
template<uint Size> auto instructionADDI(EffectiveAddress modify) -> void;
template<uint Size> auto instructionADDQ(uint4 immediate, EffectiveAddress with) -> void;
template<uint Size> auto instructionADDQ(uint4 immediate, AddressRegister with) -> void;
template<uint Size> auto instructionADDX(EffectiveAddress with, EffectiveAddress from) -> void;
template<uint Size> auto AND(uint32 source, uint32 target) -> uint32;
template<uint Size> auto instructionAND(EffectiveAddress from, DataRegister with) -> void;
template<uint Size> auto instructionAND(DataRegister from, EffectiveAddress with) -> void;
template<uint Size> auto instructionANDI(EffectiveAddress with) -> void;
auto instructionANDI_TO_CCR() -> void;
auto instructionANDI_TO_SR() -> void;
template<uint Size> auto ASL(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionASL(uint4 shift, DataRegister modify) -> void;
template<uint Size> auto instructionASL(DataRegister shift, DataRegister modify) -> void;
auto instructionASL(EffectiveAddress modify) -> void;
template<uint Size> auto ASR(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionASR(uint4 shift, DataRegister modify) -> void;
template<uint Size> auto instructionASR(DataRegister shift, DataRegister modify) -> void;
auto instructionASR(EffectiveAddress modify) -> void;
auto instructionBCC(uint4 condition, uint8 displacement) -> void;
template<uint Size> auto instructionBCHG(DataRegister bit, EffectiveAddress with) -> void;
template<uint Size> auto instructionBCHG(EffectiveAddress with) -> void;
template<uint Size> auto instructionBCLR(DataRegister bit, EffectiveAddress with) -> void;
template<uint Size> auto instructionBCLR(EffectiveAddress with) -> void;
template<uint Size> auto instructionBSET(DataRegister bit, EffectiveAddress with) -> void;
template<uint Size> auto instructionBSET(EffectiveAddress with) -> void;
template<uint Size> auto instructionBTST(DataRegister bit, EffectiveAddress with) -> void;
template<uint Size> auto instructionBTST(EffectiveAddress with) -> void;
auto instructionCHK(DataRegister compare, EffectiveAddress maximum) -> void;
template<uint Size> auto instructionCLR(EffectiveAddress ea) -> void;
template<uint Size> auto CMP(uint32 source, uint32 target) -> uint32;
template<uint Size> auto instructionCMP(DataRegister dr, EffectiveAddress ea) -> void;
template<uint Size> auto instructionCMPA(AddressRegister ar, EffectiveAddress ea) -> void;
template<uint Size> auto instructionCMPI(EffectiveAddress ea) -> void;
template<uint Size> auto instructionCMPM(EffectiveAddress ax, EffectiveAddress ay) -> void;
auto instructionDBCC(uint4 condition, DataRegister dr) -> void;
template<bool Sign> auto DIV(uint16 divisor, DataRegister with) -> void;
auto instructionDIVS(DataRegister with, EffectiveAddress from) -> void;
auto instructionDIVU(DataRegister with, EffectiveAddress from) -> void;
template<uint Size> auto EOR(uint32 source, uint32 target) -> uint32;
template<uint Size> auto instructionEOR(DataRegister from, EffectiveAddress with) -> void;
template<uint Size> auto instructionEORI(EffectiveAddress with) -> void;
auto instructionEORI_TO_CCR() -> void;
auto instructionEORI_TO_SR() -> void;
auto instructionEXG(DataRegister x, DataRegister y) -> void;
auto instructionEXG(AddressRegister x, AddressRegister y) -> void;
auto instructionEXG(DataRegister x, AddressRegister y) -> void;
template<uint Size> auto instructionEXT(DataRegister with) -> void;
auto instructionILLEGAL() -> void;
auto instructionJMP(EffectiveAddress target) -> void;
auto instructionJSR(EffectiveAddress target) -> void;
auto instructionLEA(AddressRegister ar, EffectiveAddress ea) -> void;
auto instructionLINK(AddressRegister with) -> void;
template<uint Size> auto LSL(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionLSL(uint4 immediate, DataRegister dr) -> void;
template<uint Size> auto instructionLSL(DataRegister sr, DataRegister dr) -> void;
auto instructionLSL(EffectiveAddress ea) -> void;
template<uint Size> auto LSR(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionLSR(uint4 immediate, DataRegister dr) -> void;
template<uint Size> auto instructionLSR(DataRegister shift, DataRegister dr) -> void;
auto instructionLSR(EffectiveAddress ea) -> void;
template<uint Size> auto instructionMOVE(EffectiveAddress to, EffectiveAddress from) -> void;
template<uint Size> auto instructionMOVEA(AddressRegister ar, EffectiveAddress ea) -> void;
template<uint Size> auto instructionMOVEM_TO_MEM(EffectiveAddress to) -> void;
template<uint Size> auto instructionMOVEM_TO_REG(EffectiveAddress from) -> void;
template<uint Size> auto instructionMOVEP(DataRegister from, EffectiveAddress to) -> void;
template<uint Size> auto instructionMOVEP(EffectiveAddress from, DataRegister to) -> void;
auto instructionMOVEQ(DataRegister dr, uint8 immediate) -> void;
auto instructionMOVE_FROM_SR(EffectiveAddress ea) -> void;
auto instructionMOVE_TO_CCR(EffectiveAddress ea) -> void;
auto instructionMOVE_TO_SR(EffectiveAddress ea) -> void;
auto instructionMOVE_FROM_USP(AddressRegister to) -> void;
auto instructionMOVE_TO_USP(AddressRegister from) -> void;
auto instructionMULS(DataRegister with, EffectiveAddress from) -> void;
auto instructionMULU(DataRegister with, EffectiveAddress from) -> void;
auto instructionNBCD(EffectiveAddress with) -> void;
template<uint Size> auto instructionNEG(EffectiveAddress with) -> void;
template<uint Size> auto instructionNEGX(EffectiveAddress with) -> void;
auto instructionNOP() -> void;
template<uint Size> auto instructionNOT(EffectiveAddress with) -> void;
template<uint Size> auto OR(uint32 source, uint32 target) -> uint32;
template<uint Size> auto instructionOR(EffectiveAddress from, DataRegister with) -> void;
template<uint Size> auto instructionOR(DataRegister from, EffectiveAddress with) -> void;
template<uint Size> auto instructionORI(EffectiveAddress with) -> void;
auto instructionORI_TO_CCR() -> void;
auto instructionORI_TO_SR() -> void;
auto instructionPEA(EffectiveAddress from) -> void;
auto instructionRESET() -> void;
template<uint Size> auto ROL(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionROL(uint4 shift, DataRegister modify) -> void;
template<uint Size> auto instructionROL(DataRegister shift, DataRegister modify) -> void;
auto instructionROL(EffectiveAddress modify) -> void;
template<uint Size> auto ROR(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionROR(uint4 shift, DataRegister modify) -> void;
template<uint Size> auto instructionROR(DataRegister shift, DataRegister modify) -> void;
auto instructionROR(EffectiveAddress modify) -> void;
template<uint Size> auto ROXL(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionROXL(uint4 shift, DataRegister modify) -> void;
template<uint Size> auto instructionROXL(DataRegister shift, DataRegister modify) -> void;
auto instructionROXL(EffectiveAddress modify) -> void;
template<uint Size> auto ROXR(uint32 result, uint shift) -> uint32;
template<uint Size> auto instructionROXR(uint4 shift, DataRegister modify) -> void;
template<uint Size> auto instructionROXR(DataRegister shift, DataRegister modify) -> void;
auto instructionROXR(EffectiveAddress modify) -> void;
auto instructionRTE() -> void;
auto instructionRTR() -> void;
auto instructionRTS() -> void;
auto instructionSBCD(EffectiveAddress with, EffectiveAddress from) -> void;
auto instructionSCC(uint4 condition, EffectiveAddress to) -> void;
auto instructionSTOP() -> void;
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
template<uint Size, bool Extend = false> auto SUB(uint32 source, uint32 target) -> uint32;
template<uint Size> auto instructionSUB(EffectiveAddress source, DataRegister target) -> void;
template<uint Size> auto instructionSUB(DataRegister source, EffectiveAddress target) -> void;
template<uint Size> auto instructionSUBA(AddressRegister to, EffectiveAddress from) -> void;
template<uint Size> auto instructionSUBI(EffectiveAddress with) -> void;
template<uint Size> auto instructionSUBQ(uint4 immediate, EffectiveAddress with) -> void;
template<uint Size> auto instructionSUBQ(uint4 immediate, AddressRegister with) -> void;
template<uint Size> auto instructionSUBX(EffectiveAddress with, EffectiveAddress from) -> void;
auto instructionSWAP(DataRegister with) -> void;
auto instructionTAS(EffectiveAddress with) -> void;
auto instructionTRAP(uint4 vector) -> void;
auto instructionTRAPV() -> void;
template<uint Size> auto instructionTST(EffectiveAddress ea) -> void;
auto instructionUNLK(AddressRegister with) -> void;
//disassembler.cpp
auto disassemble(uint32 pc) -> string;
auto disassembleRegisters() -> string;
struct Registers {
Update to v101r04 release. byuu says: Changelog: - pulled the (u)intN type aliases into higan instead of leaving them in nall - added 68K LINEA, LINEF hooks for illegal instructions - filled the rest of the 68K lambda table with generic instance of ILLEGAL - completed the 68K disassembler effective addressing modes - still unsure whether I should use An to decode absolute addresses or not - pro: way easier to read where accesses are taking place - con: requires An to be valid; so as a disassembler it does a poor job - making it optional: too much work; ick - added I/O decoding for the VDP command-port registers - added skeleton timing to all five processor cores - output at 1280x480 (needed for mixed 256/320 widths; and to handle interlace modes) The VDP, PSG, Z80, YM2612 are all stepping one clock at a time and syncing; which is the pathological worst case for libco. But they also have no logic inside of them. With all the above, I'm averaging around 250fps with just the 68K core actually functional, and the VDP doing a dumb "draw white pixels" loop. Still way too early to tell how this emulator is going to perform. Also, the 320x240 mode of the Genesis means that we don't need an aspect correction ratio. But we do need to ensure the output window is a multiple 320x240 so that the scale values work correctly. I was hard-coding aspect correction to stretch the window an additional \*8/7. But that won't work anymore so ... the main higan window is now 640x480, 960x720, or 1280x960. Toggling aspect correction only changes the video width inside the window. It's a bit jarring ... the window is a lot wider, more black space now for most modes. But for now, it is what it is.
2016-08-12 01:07:04 +00:00
uint32 d[8]; //data registers
uint32 a[8]; //address registers (a7 = s ? ssp : usp)
uint32 sp; //inactive stack pointer (s ? usp : ssp)
uint32 pc; //program counter
bool c; //carry
bool v; //overflow
bool z; //zero
bool n; //negative
bool x; //extend
uint3 i; //interrupt mask
bool s; //supervisor mode
bool t; //trace mode
bool stop;
bool reset;
} r;
uint16 opcode = 0;
uint instructionsExecuted = 0;
function<void ()> instructionTable[65536];
Bus* bus = nullptr;
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
private:
//disassembler.cpp
auto disassembleABCD(EffectiveAddress with, EffectiveAddress from) -> string;
template<uint Size> auto disassembleADD(EffectiveAddress from, DataRegister with) -> string;
template<uint Size> auto disassembleADD(DataRegister from, EffectiveAddress with) -> string;
template<uint Size> auto disassembleADDA(AddressRegister ar, EffectiveAddress ea) -> string;
template<uint Size> auto disassembleADDI(EffectiveAddress modify) -> string;
template<uint Size> auto disassembleADDQ(uint4 immediate, EffectiveAddress with) -> string;
template<uint Size> auto disassembleADDQ(uint4 immediate, AddressRegister with) -> string;
template<uint Size> auto disassembleADDX(EffectiveAddress with, EffectiveAddress from) -> string;
template<uint Size> auto disassembleAND(EffectiveAddress from, DataRegister with) -> string;
template<uint Size> auto disassembleAND(DataRegister from, EffectiveAddress with) -> string;
template<uint Size> auto disassembleANDI(EffectiveAddress with) -> string;
auto disassembleANDI_TO_CCR() -> string;
auto disassembleANDI_TO_SR() -> string;
template<uint Size> auto disassembleASL(uint4 shift, DataRegister modify) -> string;
template<uint Size> auto disassembleASL(DataRegister shift, DataRegister modify) -> string;
auto disassembleASL(EffectiveAddress modify) -> string;
template<uint Size> auto disassembleASR(uint4 shift, DataRegister modify) -> string;
template<uint Size> auto disassembleASR(DataRegister shift, DataRegister modify) -> string;
auto disassembleASR(EffectiveAddress modify) -> string;
auto disassembleBCC(uint4 condition, uint8 displacement) -> string;
template<uint Size> auto disassembleBCHG(DataRegister bit, EffectiveAddress with) -> string;
template<uint Size> auto disassembleBCHG(EffectiveAddress with) -> string;
template<uint Size> auto disassembleBCLR(DataRegister bit, EffectiveAddress with) -> string;
template<uint Size> auto disassembleBCLR(EffectiveAddress with) -> string;
template<uint Size> auto disassembleBSET(DataRegister bit, EffectiveAddress with) -> string;
template<uint Size> auto disassembleBSET(EffectiveAddress with) -> string;
template<uint Size> auto disassembleBTST(DataRegister bit, EffectiveAddress with) -> string;
template<uint Size> auto disassembleBTST(EffectiveAddress with) -> string;
auto disassembleCHK(DataRegister compare, EffectiveAddress maximum) -> string;
template<uint Size> auto disassembleCLR(EffectiveAddress ea) -> string;
template<uint Size> auto disassembleCMP(DataRegister dr, EffectiveAddress ea) -> string;
template<uint Size> auto disassembleCMPA(AddressRegister ar, EffectiveAddress ea) -> string;
template<uint Size> auto disassembleCMPI(EffectiveAddress ea) -> string;
template<uint Size> auto disassembleCMPM(EffectiveAddress ax, EffectiveAddress ay) -> string;
auto disassembleDBCC(uint4 condition, DataRegister dr) -> string;
auto disassembleDIVS(DataRegister with, EffectiveAddress from) -> string;
auto disassembleDIVU(DataRegister with, EffectiveAddress from) -> string;
template<uint Size> auto disassembleEOR(DataRegister from, EffectiveAddress with) -> string;
template<uint Size> auto disassembleEORI(EffectiveAddress with) -> string;
auto disassembleEORI_TO_CCR() -> string;
auto disassembleEORI_TO_SR() -> string;
auto disassembleEXG(DataRegister x, DataRegister y) -> string;
auto disassembleEXG(AddressRegister x, AddressRegister y) -> string;
auto disassembleEXG(DataRegister x, AddressRegister y) -> string;
template<uint Size> auto disassembleEXT(DataRegister with) -> string;
auto disassembleILLEGAL() -> string;
auto disassembleJMP(EffectiveAddress target) -> string;
auto disassembleJSR(EffectiveAddress target) -> string;
auto disassembleLEA(AddressRegister ar, EffectiveAddress ea) -> string;
auto disassembleLINK(AddressRegister with) -> string;
template<uint Size> auto disassembleLSL(uint4 immediate, DataRegister dr) -> string;
template<uint Size> auto disassembleLSL(DataRegister sr, DataRegister dr) -> string;
auto disassembleLSL(EffectiveAddress ea) -> string;
template<uint Size> auto disassembleLSR(uint4 immediate, DataRegister dr) -> string;
template<uint Size> auto disassembleLSR(DataRegister shift, DataRegister dr) -> string;
auto disassembleLSR(EffectiveAddress ea) -> string;
template<uint Size> auto disassembleMOVE(EffectiveAddress to, EffectiveAddress from) -> string;
template<uint Size> auto disassembleMOVEA(AddressRegister ar, EffectiveAddress ea) -> string;
template<uint Size> auto disassembleMOVEM_TO_MEM(EffectiveAddress to) -> string;
template<uint Size> auto disassembleMOVEM_TO_REG(EffectiveAddress from) -> string;
template<uint Size> auto disassembleMOVEP(DataRegister from, EffectiveAddress to) -> string;
template<uint Size> auto disassembleMOVEP(EffectiveAddress from, DataRegister to) -> string;
auto disassembleMOVEQ(DataRegister dr, uint8 immediate) -> string;
auto disassembleMOVE_FROM_SR(EffectiveAddress ea) -> string;
auto disassembleMOVE_TO_CCR(EffectiveAddress ea) -> string;
auto disassembleMOVE_TO_SR(EffectiveAddress ea) -> string;
auto disassembleMOVE_FROM_USP(AddressRegister to) -> string;
auto disassembleMOVE_TO_USP(AddressRegister from) -> string;
auto disassembleMULS(DataRegister with, EffectiveAddress from) -> string;
auto disassembleMULU(DataRegister with, EffectiveAddress from) -> string;
auto disassembleNBCD(EffectiveAddress with) -> string;
template<uint Size> auto disassembleNEG(EffectiveAddress with) -> string;
template<uint Size> auto disassembleNEGX(EffectiveAddress with) -> string;
auto disassembleNOP() -> string;
template<uint Size> auto disassembleNOT(EffectiveAddress with) -> string;
template<uint Size> auto disassembleOR(EffectiveAddress from, DataRegister with) -> string;
template<uint Size> auto disassembleOR(DataRegister from, EffectiveAddress with) -> string;
template<uint Size> auto disassembleORI(EffectiveAddress with) -> string;
auto disassembleORI_TO_CCR() -> string;
auto disassembleORI_TO_SR() -> string;
auto disassemblePEA(EffectiveAddress from) -> string;
auto disassembleRESET() -> string;
template<uint Size> auto disassembleROL(uint4 shift, DataRegister modify) -> string;
template<uint Size> auto disassembleROL(DataRegister shift, DataRegister modify) -> string;
auto disassembleROL(EffectiveAddress modify) -> string;
template<uint Size> auto disassembleROR(uint4 shift, DataRegister modify) -> string;
template<uint Size> auto disassembleROR(DataRegister shift, DataRegister modify) -> string;
auto disassembleROR(EffectiveAddress modify) -> string;
template<uint Size> auto disassembleROXL(uint4 shift, DataRegister modify) -> string;
template<uint Size> auto disassembleROXL(DataRegister shift, DataRegister modify) -> string;
auto disassembleROXL(EffectiveAddress modify) -> string;
template<uint Size> auto disassembleROXR(uint4 shift, DataRegister modify) -> string;
template<uint Size> auto disassembleROXR(DataRegister shift, DataRegister modify) -> string;
auto disassembleROXR(EffectiveAddress modify) -> string;
auto disassembleRTE() -> string;
auto disassembleRTR() -> string;
auto disassembleRTS() -> string;
auto disassembleSBCD(EffectiveAddress with, EffectiveAddress from) -> string;
auto disassembleSCC(uint4 condition, EffectiveAddress to) -> string;
auto disassembleSTOP() -> string;
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
template<uint Size> auto disassembleSUB(EffectiveAddress source, DataRegister target) -> string;
template<uint Size> auto disassembleSUB(DataRegister source, EffectiveAddress target) -> string;
template<uint Size> auto disassembleSUBA(AddressRegister to, EffectiveAddress from) -> string;
template<uint Size> auto disassembleSUBI(EffectiveAddress with) -> string;
template<uint Size> auto disassembleSUBQ(uint4 immediate, EffectiveAddress with) -> string;
template<uint Size> auto disassembleSUBQ(uint4 immediate, AddressRegister with) -> string;
template<uint Size> auto disassembleSUBX(EffectiveAddress with, EffectiveAddress from) -> string;
auto disassembleSWAP(DataRegister with) -> string;
auto disassembleTAS(EffectiveAddress with) -> string;
auto disassembleTRAP(uint4 vector) -> string;
auto disassembleTRAPV() -> string;
template<uint Size> auto disassembleTST(EffectiveAddress ea) -> string;
auto disassembleUNLK(AddressRegister with) -> string;
template<uint Size> auto _read(uint32 addr) -> uint32;
template<uint Size = Word> auto _readPC() -> uint32;
Update to v101r04 release. byuu says: Changelog: - pulled the (u)intN type aliases into higan instead of leaving them in nall - added 68K LINEA, LINEF hooks for illegal instructions - filled the rest of the 68K lambda table with generic instance of ILLEGAL - completed the 68K disassembler effective addressing modes - still unsure whether I should use An to decode absolute addresses or not - pro: way easier to read where accesses are taking place - con: requires An to be valid; so as a disassembler it does a poor job - making it optional: too much work; ick - added I/O decoding for the VDP command-port registers - added skeleton timing to all five processor cores - output at 1280x480 (needed for mixed 256/320 widths; and to handle interlace modes) The VDP, PSG, Z80, YM2612 are all stepping one clock at a time and syncing; which is the pathological worst case for libco. But they also have no logic inside of them. With all the above, I'm averaging around 250fps with just the 68K core actually functional, and the VDP doing a dumb "draw white pixels" loop. Still way too early to tell how this emulator is going to perform. Also, the 320x240 mode of the Genesis means that we don't need an aspect correction ratio. But we do need to ensure the output window is a multiple 320x240 so that the scale values work correctly. I was hard-coding aspect correction to stretch the window an additional \*8/7. But that won't work anymore so ... the main higan window is now 640x480, 960x720, or 1280x960. Toggling aspect correction only changes the video width inside the window. It's a bit jarring ... the window is a lot wider, more black space now for most modes. But for now, it is what it is.
2016-08-12 01:07:04 +00:00
auto _readDisplacement(uint32 base) -> uint32;
auto _readIndex(uint32 base) -> uint32;
auto _dataRegister(DataRegister dr) -> string;
auto _addressRegister(AddressRegister ar) -> string;
template<uint Size> auto _immediate() -> string;
template<uint Size> auto _address(EffectiveAddress& ea) -> string;
template<uint Size> auto _effectiveAddress(EffectiveAddress& ea) -> string;
auto _branch(uint8 displacement) -> string;
template<uint Size> auto _suffix() -> string;
auto _condition(uint4 condition) -> string;
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
uint32 _pc;
function<string ()> disassembleTable[65536];
Update to v100r02 release. byuu says: Sigh ... I'm really not a good person. I'm inherently selfish. My responsibility and obligation right now is to work on loki, and then on the Tengai Makyou Zero translation, and then on improving the Famicom emulation. And yet ... it's not what I really want to do. That shouldn't matter; I should work on my responsibilities first. Instead, I'm going to be a greedy, self-centered asshole, and work on what I really want to instead. I'm really sorry, guys. I'm sure this will make a few people happy, and probably upset even more people. I'm also making zero guarantees that this ever gets finished. As always, I wish I could keep these things secret, so if I fail / give up, I could just drop it with no shame. But I would have to cut everyone out of the WIP process completely to make it happen. So, here goes ... This WIP adds the initial skeleton for Sega Mega Drive / Genesis emulation. God help us. (minor note: apparently the new extension for Mega Drive games is .md, neat. That's what I chose for the folders too. I thought it was .smd, so that'll be fixed in icarus for the next WIP.) (aside: this is why I wanted to get v100 out. I didn't want this code in a skeleton state in v100's source. Nor did I want really broken emulation, which the first release is sure to be, tarring said release.) ... So, basically, I've been ruminating on the legacy I want to leave behind with higan. 3D systems are just plain out. I'm never going to support them. They're too complex for my abilities, and they would run too slowly with my design style. I'm not willing to compromise my design ideals. And I would never want to play a 3D game system at native 240p/480i resolution ... but 1080p+ upscaling is not accurate, so that's a conflict I want to avoid entirely. It's also never going to emulate computer systems (X68K, PC-98, FM-Towns, etc) because holy shit that would completely destroy me. It's also never going emulate arcade machines. So I think of higan as a collection of 2D emulators for consoles and handhelds. I've gone over every major 2D gaming system there is, looking for ones with games I actually care about and enjoy. And I basically have five of those systems supported already. Looking at the remaining list, I see only three systems left that I have any interest in whatsoever: PC-Engine, Master System, Mega Drive. Again, I'm not in any way committing to emulating any of these, but ... if I had all of those in higan, I think I'd be content to really, truly, finally stop writing more emulators for the rest of my life. And so I decided to tackle the most difficult system first. If I'm successful, the Z80 core should cover a lot of the work on the SMS. And the HuC6280 should land somewhere between the NES and SNES in terms of difficulty ... closer to the NES. The systems that just don't appeal to me at all, which I will never touch, include, but are not limited to: * Atari 2600/5200/7800 * Lynx * Jaguar * Vectrex * Colecovision * Commodore 64 * Neo-Geo * Neo-Geo Pocket / Color * Virtual Boy * Super A'can * 32X * CD-i * etc, etc, etc. And really, even if something were mildly interesting in there ... we have to stop. I can't scale infinitely. I'm already way past my limit, but I'm doing this anyway. Too many cores bloats everything and kills quality on everything. I don't want higan to become MESS v2. I don't know what I'll do about the Famicom Disk System, PC-Engine CD, and Mega CD. I don't think I'll be able to achieve 60fps emulating the Mega CD, even if I tried to. I don't know what's going to happen here with even the Mega Drive. Maybe I'll get driven crazy with the documentation and quit. Maybe it'll end up being too complicated and I'll quit. Maybe the emulation will end up way too slow and I'll give up. Maybe it'll take me seven years to get any games playable at all. Maybe Steve Snake, AamirM and Mike Pavone will pool money to hire a hitman to come after me. Who knows. But this is what I want to do, so ... here goes nothing.
2016-07-09 04:21:37 +00:00
};
}