bsnes/higan/processor/m68k/instructions.cpp

1155 lines
32 KiB
C++
Raw Normal View History

auto M68K::testCondition(uint4 condition) -> bool {
switch(condition) {
case 0: return true; //T
case 1: return false; //F
case 2: return !r.c && !r.z; //HI
case 3: return r.c || r.z; //LS
case 4: return !r.c; //CC,HS
case 5: return r.c; //CS,LO
case 6: return !r.z; //NE
case 7: return r.z; //EQ
case 8: return !r.v; //VC
case 9: return r.v; //VS
case 10: return !r.n; //PL
case 11: return r.n; //MI
case 12: return r.n == r.v; //GE
case 13: return r.n != r.v; //LT
case 14: return r.n == r.v && !r.z; //GT
case 15: return r.n != r.v || r.z; //LE
}
unreachable;
}
//
template<> auto M68K::bytes<Byte>() -> uint { return 1; }
template<> auto M68K::bytes<Word>() -> uint { return 2; }
template<> auto M68K::bytes<Long>() -> uint { return 4; }
template<> auto M68K::bits<Byte>() -> uint { return 8; }
template<> auto M68K::bits<Word>() -> uint { return 16; }
template<> auto M68K::bits<Long>() -> uint { return 32; }
template<uint Size> auto M68K::lsb() -> uint32 { return 1; }
template<> auto M68K::msb<Byte>() -> uint32 { return 0x80; }
template<> auto M68K::msb<Word>() -> uint32 { return 0x8000; }
template<> auto M68K::msb<Long>() -> uint32 { return 0x80000000; }
template<> auto M68K::mask<Byte>() -> uint32 { return 0xff; }
template<> auto M68K::mask<Word>() -> uint32 { return 0xffff; }
template<> auto M68K::mask<Long>() -> uint32 { return 0xffffffff; }
template<> auto M68K::clip<Byte>(uint32 data) -> uint32 { return data & 0xff; }
template<> auto M68K::clip<Word>(uint32 data) -> uint32 { return data & 0xffff; }
template<> auto M68K::clip<Long>(uint32 data) -> uint32 { return data & 0xffffffff; }
template<> auto M68K::sign<Byte>(uint32 data) -> int32 { return (int8)data; }
template<> auto M68K::sign<Word>(uint32 data) -> int32 { return (int16)data; }
template<> auto M68K::sign<Long>(uint32 data) -> int32 { return (int32)data; }
//
auto M68K::instructionABCD(EffectiveAddress with, EffectiveAddress from) -> void {
auto source = read<Byte>(from);
auto target = read<Byte, Hold>(with);
auto result = source + target + r.x;
bool v = false;
if(((target ^ source ^ result) & 0x10) || (result & 0x0f) >= 0x0a) {
auto previous = result;
result += 0x06;
v |= ((~previous & 0x80) & (result & 0x80));
}
if(result >= 0xa0) {
auto previous = result;
result += 0x60;
v |= ((~previous & 0x80) & (result & 0x80));
}
write<Byte>(with, result);
r.c = sign<Byte>(result >> 1) < 0;
r.v = v;
r.z = clip<Byte>(result) ? 0 : r.z;
r.n = sign<Byte>(result) < 0;
r.x = r.c;
}
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
template<uint Size, bool Extend> auto M68K::ADD(uint32 source, uint32 target) -> uint32 {
auto result = (uint64)source + target;
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
if(Extend) result += r.x;
r.c = sign<Size>(result >> 1) < 0;
r.v = sign<Size>(~(target ^ source) & (target ^ result)) < 0;
r.z = clip<Size>(result) ? 0 : (Extend ? r.z : 1);
r.n = sign<Size>(result) < 0;
r.x = r.c;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionADD(EffectiveAddress from, DataRegister with) -> void {
auto source = read<Size>(from);
auto target = read<Size>(with);
auto result = ADD<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionADD(DataRegister from, EffectiveAddress with) -> void {
auto source = read<Size>(from);
auto target = read<Size, Hold>(with);
auto result = ADD<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionADDA(AddressRegister ar, EffectiveAddress ea) -> void {
auto source = sign<Size>(read<Size>(ea));
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
auto target = read<Long>(ar);
write<Long>(ar, source + target);
}
template<uint Size> auto M68K::instructionADDI(EffectiveAddress modify) -> void {
auto source = readPC<Size>();
auto target = read<Size, Hold>(modify);
auto result = ADD<Size>(source, target);
write<Size>(modify, result);
}
template<uint Size> auto M68K::instructionADDQ(uint4 immediate, EffectiveAddress with) -> void {
auto source = immediate;
auto target = read<Size, Hold>(with);
auto result = ADD<Size>(source, target);
write<Size>(with, result);
}
//Size is ignored: always uses Long
template<uint Size> auto M68K::instructionADDQ(uint4 immediate, AddressRegister with) -> void {
auto result = read<Long>(with) + immediate;
write<Long>(with, result);
}
template<uint Size> auto M68K::instructionADDX(EffectiveAddress with, EffectiveAddress from) -> void {
auto source = read<Size>(from);
auto target = read<Size, Hold>(with);
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
auto result = ADD<Size, Extend>(source, target);
write<Size>(with, result);
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
}
template<uint Size> auto M68K::AND(uint32 source, uint32 target) -> uint32 {
uint32 result = target & source;
r.c = 0;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionAND(EffectiveAddress from, DataRegister with) -> void {
auto source = read<Size>(from);
auto target = read<Size>(with);
auto result = AND<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionAND(DataRegister from, EffectiveAddress with) -> void {
auto source = read<Size>(from);
auto target = read<Size, Hold>(with);
auto result = AND<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionANDI(EffectiveAddress with) -> void {
auto source = readPC<Size>();
auto target = read<Size, Hold>(with);
auto result = AND<Size>(source, target);
write<Size>(with, result);
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
}
auto M68K::instructionANDI_TO_CCR() -> void {
auto data = readPC<Word>();
writeCCR(readCCR() & data);
}
auto M68K::instructionANDI_TO_SR() -> void {
if(!supervisor()) return;
auto data = readPC<Word>();
writeSR(readSR() & data);
}
template<uint Size> auto M68K::ASL(uint32 result, uint shift) -> uint32 {
bool carry = false;
uint32 overflow = 0;
for(auto _ : range(shift)) {
carry = result & msb<Size>();
uint32 before = result;
result <<= 1;
overflow |= before ^ result;
}
r.c = carry;
r.v = sign<Size>(overflow) < 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
if(shift) r.x = r.c;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionASL(uint4 shift, DataRegister modify) -> void {
auto result = ASL<Size>(read<Size>(modify), shift);
write<Size>(modify, result);
}
template<uint Size> auto M68K::instructionASL(DataRegister shift, DataRegister modify) -> void {
auto count = read<Long>(shift) & 63;
auto result = ASL<Size>(read<Size>(modify), count);
write<Size>(modify, result);
}
auto M68K::instructionASL(EffectiveAddress modify) -> void {
auto result = ASL<Word>(read<Word, Hold>(modify), 1);
write<Word>(modify, result);
}
template<uint Size> auto M68K::ASR(uint32 result, uint shift) -> uint32 {
bool carry = false;
uint32 overflow = 0;
for(auto _ : range(shift)) {
carry = result & lsb<Size>();
uint32 before = result;
result = sign<Size>(result) >> 1;
overflow |= before ^ result;
}
r.c = carry;
r.v = sign<Size>(overflow) < 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
if(shift) r.x = r.c;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionASR(uint4 shift, DataRegister modify) -> void {
auto result = ASR<Size>(read<Size>(modify), shift);
write<Size>(modify, result);
}
template<uint Size> auto M68K::instructionASR(DataRegister shift, DataRegister modify) -> void {
auto count = read<Long>(shift) & 63;
auto result = ASR<Size>(read<Size>(modify), count);
write<Size>(modify, result);
}
auto M68K::instructionASR(EffectiveAddress modify) -> void {
auto result = ASR<Word>(read<Word, Hold>(modify), 1);
write<Word>(modify, result);
}
auto M68K::instructionBCC(uint4 condition, uint8 displacement) -> void {
auto extension = readPC<Word>();
if(displacement) r.pc -= 2;
if(condition >= 2 && !testCondition(condition)) return;
if(condition == 1) push<Long>(r.pc);
r.pc += displacement ? (int8_t)displacement : (int16_t)extension - 2;
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
}
template<uint Size> auto M68K::instructionBCHG(DataRegister bit, EffectiveAddress with) -> void {
auto index = read<Size>(bit) & bits<Size>() - 1;
auto test = read<Size, Hold>(with);
r.z = test.bit(index) == 0;
test.bit(index) ^= 1;
write<Size>(with, test);
}
template<uint Size> auto M68K::instructionBCHG(EffectiveAddress with) -> void {
auto index = readPC<Word>() & bits<Size>() - 1;
auto test = read<Size, Hold>(with);
r.z = test.bit(index) == 0;
test.bit(index) ^= 1;
write<Size>(with, test);
}
template<uint Size> auto M68K::instructionBCLR(DataRegister bit, EffectiveAddress with) -> void {
auto index = read<Size>(bit) & bits<Size>() - 1;
auto test = read<Size, Hold>(with);
r.z = test.bit(index) == 0;
test.bit(index) = 0;
write<Size>(with, test);
}
template<uint Size> auto M68K::instructionBCLR(EffectiveAddress with) -> void {
auto index = readPC<Word>() & bits<Size>() - 1;
auto test = read<Size, Hold>(with);
r.z = test.bit(index) == 0;
test.bit(index) = 0;
write<Size>(with, test);
}
template<uint Size> auto M68K::instructionBSET(DataRegister bit, EffectiveAddress with) -> void {
auto index = read<Size>(bit) & bits<Size>() - 1;
auto test = read<Size, Hold>(with);
r.z = test.bit(index) == 0;
test.bit(index) = 1;
write<Size>(with, test);
}
template<uint Size> auto M68K::instructionBSET(EffectiveAddress with) -> void {
auto index = readPC<Word>() & bits<Size>() - 1;
auto test = read<Size, Hold>(with);
r.z = test.bit(index) == 0;
test.bit(index) = 1;
write<Size>(with, test);
}
template<uint Size> auto M68K::instructionBTST(DataRegister bit, EffectiveAddress with) -> void {
auto index = read<Size>(bit) & bits<Size>() - 1;
auto test = read<Size>(with);
r.z = test.bit(index) == 0;
}
template<uint Size> auto M68K::instructionBTST(EffectiveAddress with) -> void {
auto index = readPC<Word>() & bits<Size>() - 1;
auto test = read<Size>(with);
r.z = test.bit(index) == 0;
}
auto M68K::instructionCHK(DataRegister compare, EffectiveAddress maximum) -> void {
auto source = read<Word>(maximum);
auto target = read<Word>(compare);
r.z = clip<Word>(target) == 0;
r.n = sign<Word>(target) < 0;
if(r.n) return exception(Exception::BoundsCheck, Vector::BoundsCheck);
auto result = (uint64)target - source;
r.c = sign<Word>(result >> 1) < 0;
r.v = sign<Word>((target ^ source) & (target ^ result)) < 0;
r.z = clip<Word>(result) == 0;
r.n = sign<Word>(result) < 0;
if(r.n == r.v && !r.z) return exception(Exception::BoundsCheck, Vector::BoundsCheck);
}
template<uint Size> auto M68K::instructionCLR(EffectiveAddress ea) -> void {
read<Size, Hold>(ea);
write<Size>(ea, 0);
r.c = 0;
r.v = 0;
r.z = 1;
r.n = 0;
}
template<uint Size> auto M68K::CMP(uint32 source, uint32 target) -> uint32 {
auto result = (uint64)target - source;
r.c = sign<Size>(result >> 1) < 0;
r.v = sign<Size>((target ^ source) & (target ^ result)) < 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionCMP(DataRegister dr, EffectiveAddress ea) -> void {
auto source = read<Size>(ea);
auto target = read<Size>(dr);
CMP<Size>(source, target);
}
template<uint Size> auto M68K::instructionCMPA(AddressRegister ar, EffectiveAddress ea) -> void {
auto source = sign<Size>(read<Size>(ea));
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
auto target = read<Long>(ar);
CMP<Long>(source, target);
}
template<uint Size> auto M68K::instructionCMPI(EffectiveAddress ea) -> void {
auto source = readPC<Size>();
auto target = read<Size>(ea);
CMP<Size>(source, target);
}
template<uint Size> auto M68K::instructionCMPM(EffectiveAddress ax, EffectiveAddress ay) -> void {
auto source = read<Size>(ay);
auto target = read<Size>(ax);
CMP<Size>(source, target);
}
auto M68K::instructionDBCC(uint4 condition, DataRegister dr) -> void {
auto displacement = readPC<Word>();
if(!testCondition(condition)) {
uint16 result = read<Word>(dr);
write<Word>(dr, result - 1);
if(result) r.pc -= 2, r.pc += sign<Word>(displacement);
}
}
template<bool Sign> auto M68K::DIV(uint16 divisor, DataRegister with) -> void {
auto dividend = read<Long>(with);
bool negativeQuotient = false;
bool negativeRemainder = false;
bool overflow = false;
if(divisor == 0) return exception(Exception::DivisionByZero, Vector::DivisionByZero);
if(Sign) {
negativeQuotient = (dividend >> 31) ^ (divisor >> 15);
if(dividend >> 31) dividend = -dividend, negativeRemainder = true;
if(divisor >> 15) divisor = -divisor;
}
auto result = dividend;
for(auto _ : range(16)) {
bool lb = false;
if(result >= (uint32)divisor << 15) result -= divisor << 15, lb = true;
bool ob = result >> 31;
result = result << 1 | lb;
if(ob) overflow = true;
}
if(Sign) {
if((uint16)result > 0x7fff + negativeQuotient) overflow = true;
}
if(result >> 16 >= divisor) overflow = true;
if(Sign && !overflow) {
if(negativeQuotient) result = ((-result) & 0xffff) | (result & 0xffff0000);
if(negativeRemainder) result = (((-(result >> 16)) << 16) & 0xffff0000) | (result & 0xffff);
}
if(!overflow) write<Long>(with, result);
r.c = 0;
r.v = overflow;
r.z = clip<Word>(result) == 0;
r.n = sign<Word>(result) < 0;
}
auto M68K::instructionDIVS(DataRegister with, EffectiveAddress from) -> void {
auto divisor = read<Word>(from);
DIV<1>(divisor, with);
}
auto M68K::instructionDIVU(DataRegister with, EffectiveAddress from) -> void {
auto divisor = read<Word>(from);
DIV<0>(divisor, with);
}
template<uint Size> auto M68K::EOR(uint32 source, uint32 target) -> uint32 {
uint32 result = target ^ source;
r.c = 0;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionEOR(DataRegister from, EffectiveAddress with) -> void {
auto source = read<Size>(from);
auto target = read<Size, Hold>(with);
auto result = EOR<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionEORI(EffectiveAddress with) -> void {
auto source = readPC<Size>();
auto target = read<Size, Hold>(with);
auto result = EOR<Size>(source, target);
write<Size>(with, result);
}
auto M68K::instructionEORI_TO_CCR() -> void {
auto data = readPC<Word>();
writeCCR(readCCR() ^ data);
}
auto M68K::instructionEORI_TO_SR() -> void {
if(!supervisor()) return;
auto data = readPC<Word>();
writeSR(readSR() ^ data);
}
auto M68K::instructionEXG(DataRegister x, DataRegister y) -> void {
auto z = read<Long>(x);
write<Long>(x, read<Long>(y));
write<Long>(y, z);
}
auto M68K::instructionEXG(AddressRegister x, AddressRegister y) -> void {
auto z = read<Long>(x);
write<Long>(x, read<Long>(y));
write<Long>(y, z);
}
auto M68K::instructionEXG(DataRegister x, AddressRegister y) -> void {
auto z = read<Long>(x);
write<Long>(x, read<Long>(y));
write<Long>(y, z);
}
template<> auto M68K::instructionEXT<Word>(DataRegister with) -> void {
auto result = (int8)read<Byte>(with);
write<Word>(with, result);
r.c = 0;
r.v = 0;
r.z = clip<Word>(result) == 0;
r.n = sign<Word>(result) < 0;
}
template<> auto M68K::instructionEXT<Long>(DataRegister with) -> void {
auto result = (int16)read<Word>(with);
write<Long>(with, result);
r.c = 0;
r.v = 0;
r.z = clip<Long>(result) == 0;
r.n = sign<Long>(result) < 0;
}
auto M68K::instructionILLEGAL() -> void {
r.pc -= 2;
Update to v101r04 release. byuu says: Changelog: - pulled the (u)intN type aliases into higan instead of leaving them in nall - added 68K LINEA, LINEF hooks for illegal instructions - filled the rest of the 68K lambda table with generic instance of ILLEGAL - completed the 68K disassembler effective addressing modes - still unsure whether I should use An to decode absolute addresses or not - pro: way easier to read where accesses are taking place - con: requires An to be valid; so as a disassembler it does a poor job - making it optional: too much work; ick - added I/O decoding for the VDP command-port registers - added skeleton timing to all five processor cores - output at 1280x480 (needed for mixed 256/320 widths; and to handle interlace modes) The VDP, PSG, Z80, YM2612 are all stepping one clock at a time and syncing; which is the pathological worst case for libco. But they also have no logic inside of them. With all the above, I'm averaging around 250fps with just the 68K core actually functional, and the VDP doing a dumb "draw white pixels" loop. Still way too early to tell how this emulator is going to perform. Also, the 320x240 mode of the Genesis means that we don't need an aspect correction ratio. But we do need to ensure the output window is a multiple 320x240 so that the scale values work correctly. I was hard-coding aspect correction to stretch the window an additional \*8/7. But that won't work anymore so ... the main higan window is now 640x480, 960x720, or 1280x960. Toggling aspect correction only changes the video width inside the window. It's a bit jarring ... the window is a lot wider, more black space now for most modes. But for now, it is what it is.
2016-08-12 01:07:04 +00:00
if(opcode >> 12 == 0xa) return exception(Exception::Illegal, Vector::IllegalLineA);
if(opcode >> 12 == 0xf) return exception(Exception::Illegal, Vector::IllegalLineF);
return exception(Exception::Illegal, Vector::Illegal);
}
auto M68K::instructionJMP(EffectiveAddress target) -> void {
r.pc = fetch<Long>(target);
}
auto M68K::instructionJSR(EffectiveAddress target) -> void {
auto pc = fetch<Long>(target);
push<Long>(r.pc);
r.pc = pc;
}
auto M68K::instructionLEA(AddressRegister ar, EffectiveAddress ea) -> void {
write<Long>(ar, fetch<Long>(ea));
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
}
auto M68K::instructionLINK(AddressRegister with) -> void {
auto displacement = (int16)readPC<Word>();
auto sp = AddressRegister{7};
push<Long>(read<Long>(with));
write<Long>(with, read<Long>(sp));
write<Long>(sp, read<Long>(sp) + displacement);
}
template<uint Size> auto M68K::LSL(uint32 result, uint shift) -> uint32 {
bool carry = false;
for(auto _ : range(shift)) {
carry = result & msb<Size>();
result <<= 1;
}
r.c = carry;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
if(shift) r.x = r.c;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionLSL(uint4 immediate, DataRegister dr) -> void {
auto result = LSL<Size>(read<Size>(dr), immediate);
write<Size>(dr, result);
}
template<uint Size> auto M68K::instructionLSL(DataRegister sr, DataRegister dr) -> void {
auto shift = read<Long>(sr) & 63;
auto result = LSL<Size>(read<Size>(dr), shift);
write<Size>(dr, result);
}
auto M68K::instructionLSL(EffectiveAddress ea) -> void {
auto result = LSL<Word>(read<Word, Hold>(ea), 1);
write<Word>(ea, result);
}
template<uint Size> auto M68K::LSR(uint32 result, uint shift) -> uint32 {
bool carry = false;
for(auto _ : range(shift)) {
carry = result & lsb<Size>();
result >>= 1;
}
r.c = carry;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
if(shift) r.x = r.c;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionLSR(uint4 immediate, DataRegister dr) -> void {
auto result = LSR<Size>(read<Size>(dr), immediate);
write<Size>(dr, result);
}
template<uint Size> auto M68K::instructionLSR(DataRegister shift, DataRegister dr) -> void {
auto count = read<Long>(shift) & 63;
auto result = LSR<Size>(read<Size>(dr), count);
write<Size>(dr, result);
}
auto M68K::instructionLSR(EffectiveAddress ea) -> void {
auto result = LSR<Word>(read<Word, Hold>(ea), 1);
write<Word>(ea, result);
}
template<uint Size> auto M68K::instructionMOVE(EffectiveAddress to, EffectiveAddress from) -> void {
auto data = read<Size>(from);
write<Size>(to, data);
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
r.c = 0;
r.v = 0;
r.z = clip<Size>(data) == 0;
r.n = sign<Size>(data) < 0;
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
}
template<uint Size> auto M68K::instructionMOVEA(AddressRegister ar, EffectiveAddress ea) -> void {
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
auto data = sign<Size>(read<Size>(ea));
write<Long>(ar, data);
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
}
template<uint Size> auto M68K::instructionMOVEM_TO_MEM(EffectiveAddress to) -> void {
auto list = readPC<Word>();
auto addr = fetch<Long>(to);
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
for(uint n : range(16)) {
if(!list.bit(n)) continue;
//pre-decrement mode traverses registers in reverse order {A7-A0, D7-D0}
uint index = to.mode == AddressRegisterIndirectWithPreDecrement ? 15 - n : n;
if(to.mode == AddressRegisterIndirectWithPreDecrement) addr -= bytes<Size>();
auto data = index < 8 ? read<Size>(DataRegister{index}) : read<Size>(AddressRegister{index});
write<Size>(addr, data);
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
if(to.mode != AddressRegisterIndirectWithPreDecrement) addr += bytes<Size>();
}
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
AddressRegister with{to.reg};
if(to.mode == AddressRegisterIndirectWithPreDecrement ) write<Long>(with, addr);
if(to.mode == AddressRegisterIndirectWithPostIncrement) write<Long>(with, addr);
}
template<uint Size> auto M68K::instructionMOVEM_TO_REG(EffectiveAddress from) -> void {
auto list = readPC<Word>();
auto addr = fetch<Long>(from);
for(uint n : range(16)) {
if(!list.bit(n)) continue;
uint index = from.mode == AddressRegisterIndirectWithPreDecrement ? 15 - n : n;
if(from.mode == AddressRegisterIndirectWithPreDecrement) addr -= bytes<Size>();
auto data = read<Size>(addr);
data = sign<Size>(data);
index < 8 ? write<Long>(DataRegister{index}, data) : write<Long>(AddressRegister{index}, data);
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
if(from.mode != AddressRegisterIndirectWithPreDecrement) addr += bytes<Size>();
}
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
AddressRegister with{from.reg};
if(from.mode == AddressRegisterIndirectWithPreDecrement ) write<Long>(with, addr);
if(from.mode == AddressRegisterIndirectWithPostIncrement) write<Long>(with, addr);
}
template<uint Size> auto M68K::instructionMOVEP(DataRegister from, EffectiveAddress to) -> void {
auto address = fetch<Size>(to);
auto data = read<Long>(from);
uint shift = bits<Size>();
for(auto _ : range(bytes<Size>())) {
shift -= 8;
write<Byte>(address, data >> shift);
address += 2;
}
}
template<uint Size> auto M68K::instructionMOVEP(EffectiveAddress from, DataRegister to) -> void {
auto address = fetch<Size>(from);
auto data = read<Long>(to);
uint shift = bits<Size>();
for(auto _ : range(bytes<Size>())) {
shift -= 8;
data &= ~(0xff << shift);
data |= read<Byte>(address) << shift;
address += 2;
}
write<Long>(to, data);
}
auto M68K::instructionMOVEQ(DataRegister dr, uint8 immediate) -> void {
write<Long>(dr, sign<Byte>(immediate));
Update to v100r06 release. byuu says: Up to ten 68K instructions out of somewhere between 61 and 88, depending upon which PDF you look at. Of course, some of them aren't 100% completed yet, either. Lots of craziness with MOVEM, and BCC has a BSR variant that needs stack push/pop functions. This WIP actually took over eight hours to make, going through every possible permutation on how to design the core itself. The updated design now builds both the instruction decoder+dispatcher and the disassembler decoder into the same main loop during M68K's constructor. The special cases are also really psychotic on this processor, and I'm afraid of missing something via the fallthrough cases. So instead, I'm ordering the instructions alphabetically, and including exclusion cases to ignore binding invalid cases. If I end up remapping an existing register, then it'll throw a run-time assertion at program startup. I wanted very much to get rid of struct EA (EffectiveAddress), but it's too difficult to keep track of the internal effective address without it. So I split out the size to a separate parameter, since every opcode only has one size parameter, and otherwise it was getting duplicated in opcodes that take two EAs, and was also awkward with the flag testing. It's a bit more typing, but I feel it's more clean this way. Overall, I'm really worried this is going to be too slow. I don't want to turn the EA stuff into templates, because that will massively bloat out compilation times and object sizes, and will also need a special DSL preprocessor since C++ doesn't have a static for loop. I can definitely optimize a lot of EA's address/read/write functions away once the core is completed, but it's never going to hold a candle to a templatized 68K core. ---- Forgot to include the SA-1 regression fix. I always remember immediately after I upload and archive the WIP. Will try to get that in next time, I guess.
2016-07-16 08:39:44 +00:00
r.c = 0;
r.v = 0;
r.z = clip<Byte>(immediate) == 0;
r.n = sign<Byte>(immediate) < 0;
}
auto M68K::instructionMOVE_FROM_SR(EffectiveAddress ea) -> void {
auto data = readSR();
write<Word>(ea, data);
}
auto M68K::instructionMOVE_TO_CCR(EffectiveAddress ea) -> void {
auto data = read<Byte>(ea);
writeCCR(data);
}
auto M68K::instructionMOVE_TO_SR(EffectiveAddress ea) -> void {
if(!supervisor()) return;
auto data = read<Word>(ea);
writeSR(data);
}
auto M68K::instructionMOVE_FROM_USP(AddressRegister to) -> void {
if(!supervisor()) return;
write<Long>(to, r.sp);
}
auto M68K::instructionMOVE_TO_USP(AddressRegister from) -> void {
if(!supervisor()) return;
r.sp = read<Long>(from);
}
auto M68K::instructionMULS(DataRegister with, EffectiveAddress from) -> void {
auto source = read<Word>(from);
auto target = read<Word>(with);
auto result = (int16)source * (int16)target;
write<Long>(with, result);
r.c = 0;
r.v = 0;
r.z = clip<Long>(result) == 0;
r.n = sign<Long>(result) < 0;
}
auto M68K::instructionMULU(DataRegister with, EffectiveAddress from) -> void {
auto source = read<Word>(from);
auto target = read<Word>(with);
auto result = source * target;
write<Long>(with, result);
r.c = 0;
r.v = 0;
r.z = clip<Long>(result) == 0;
r.n = sign<Long>(result) < 0;
}
auto M68K::instructionNBCD(EffectiveAddress with) -> void {
auto source = 0u;
auto target = read<Byte, Hold>(with);
auto result = target - source - r.x;
bool v = false;
const bool adjustLo = (target ^ source ^ result) & 0x10;
const bool adjustHi = result & 0x100;
if(adjustLo) {
auto previous = result;
result -= 0x06;
v |= (previous & 0x80) & (~result & 0x80);
}
if(adjustHi) {
auto previous = result;
result -= 0x60;
v |= (previous & 0x80) & (~result & 0x80);
}
write<Byte>(with, result);
r.c = sign<Byte>(result >> 1) < 0;
r.v = v;
r.z = clip<Byte>(result) ? 0 : r.z;
r.n = sign<Byte>(result) < 0;
}
template<uint Size> auto M68K::instructionNEG(EffectiveAddress with) -> void {
Update to v101r11 release. byuu says: Changelog: - 68K: fixed NEG/NEGX operand order - 68K: fixed bug in disassembler that was breaking trace logging - VDP: improved sprite rendering (still 100% broken) - VDP: added horizontal/vertical scrolling (90% broken) Forgot: - 68K: fix extension word sign bit on indexed modes for disassembler as well - 68K: emulate STOP properly (use r.stop flag; clear on IRQs firing) I'm really wearing out fast here. The Genesis documentation is somehow even worse than Game Boy documentation, but this is a far more complex system. It's a massive time sink to sit here banging away at every possible combination of how things could work, only to see no positive improvements. Nothing I do seems to get sprites to do a goddamn thing. squee says the sprite Y field is 10-bits, X field is 9-bits. genvdp says they're both 10-bits. BlastEm treats them like they're both 10-bits, then masks off the upper bit so it's effectively 9-bits anyway. Nothing ever bothers to tell you whether the horizontal scroll values are supposed to add or subtract from the current X position. Probably the most basic detail you could imagine for explaining horizontal scrolling and yet ... nope. Nothing. I can't even begin to understand how the VDP FIFO functionality works, or what the fuck is meant by "slots". I'm completely at a loss as how how in the holy hell the 68K works with 8-bit accesses. I don't know whether I need byte/word handlers for every device, or if I can just hook it right into the 68K core itself. This one's probably the most major design detail. I need to know this before I go and implement the PSG/YM2612/IO ports-\>gamepads/Z80/etc. Trying to debug the 68K is murder because basically every game likes to start with a 20,000,000-instruction reset phase of checksumming entire games, and clearing out the memory as agonizingly slowly as humanly possible. And like the ARM, there's too many registers so I'd need three widescreen monitors to comfortably view the entire debugger output lines onscreen. I can't get any test ROMs to debug functionality outside of full games because every **goddamned** test ROM coder thinks it's acceptable to tell people to go fetch some toolchain from a link that died in the late '90s and only works on MS-DOS 6.22 to build their fucking shit, because god forbid you include a 32KiB assembled ROM image in your fucking archives. ... I may have to take a break for a while. We'll see.
2016-08-21 02:50:05 +00:00
auto result = SUB<Size>(read<Size, Hold>(with), 0);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionNEGX(EffectiveAddress with) -> void {
Update to v101r11 release. byuu says: Changelog: - 68K: fixed NEG/NEGX operand order - 68K: fixed bug in disassembler that was breaking trace logging - VDP: improved sprite rendering (still 100% broken) - VDP: added horizontal/vertical scrolling (90% broken) Forgot: - 68K: fix extension word sign bit on indexed modes for disassembler as well - 68K: emulate STOP properly (use r.stop flag; clear on IRQs firing) I'm really wearing out fast here. The Genesis documentation is somehow even worse than Game Boy documentation, but this is a far more complex system. It's a massive time sink to sit here banging away at every possible combination of how things could work, only to see no positive improvements. Nothing I do seems to get sprites to do a goddamn thing. squee says the sprite Y field is 10-bits, X field is 9-bits. genvdp says they're both 10-bits. BlastEm treats them like they're both 10-bits, then masks off the upper bit so it's effectively 9-bits anyway. Nothing ever bothers to tell you whether the horizontal scroll values are supposed to add or subtract from the current X position. Probably the most basic detail you could imagine for explaining horizontal scrolling and yet ... nope. Nothing. I can't even begin to understand how the VDP FIFO functionality works, or what the fuck is meant by "slots". I'm completely at a loss as how how in the holy hell the 68K works with 8-bit accesses. I don't know whether I need byte/word handlers for every device, or if I can just hook it right into the 68K core itself. This one's probably the most major design detail. I need to know this before I go and implement the PSG/YM2612/IO ports-\>gamepads/Z80/etc. Trying to debug the 68K is murder because basically every game likes to start with a 20,000,000-instruction reset phase of checksumming entire games, and clearing out the memory as agonizingly slowly as humanly possible. And like the ARM, there's too many registers so I'd need three widescreen monitors to comfortably view the entire debugger output lines onscreen. I can't get any test ROMs to debug functionality outside of full games because every **goddamned** test ROM coder thinks it's acceptable to tell people to go fetch some toolchain from a link that died in the late '90s and only works on MS-DOS 6.22 to build their fucking shit, because god forbid you include a 32KiB assembled ROM image in your fucking archives. ... I may have to take a break for a while. We'll see.
2016-08-21 02:50:05 +00:00
auto result = SUB<Size, Extend>(read<Size, Hold>(with), 0);
write<Size>(with, result);
}
auto M68K::instructionNOP() -> void {
}
template<uint Size> auto M68K::instructionNOT(EffectiveAddress with) -> void {
auto result = ~read<Size, Hold>(with);
write<Size>(with, result);
r.c = 0;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
}
template<uint Size> auto M68K::OR(uint32 source, uint32 target) -> uint32 {
auto result = target | source;
r.c = 0;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionOR(EffectiveAddress from, DataRegister with) -> void {
auto source = read<Size>(from);
auto target = read<Size>(with);
auto result = OR<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionOR(DataRegister from, EffectiveAddress with) -> void {
auto source = read<Size>(from);
auto target = read<Size, Hold>(with);
auto result = OR<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionORI(EffectiveAddress with) -> void {
auto source = readPC<Size>();
auto target = read<Size, Hold>(with);
auto result = OR<Size>(source, target);
write<Size>(with, result);
}
auto M68K::instructionORI_TO_CCR() -> void {
auto data = readPC<Word>();
writeCCR(readCCR() | data);
}
auto M68K::instructionORI_TO_SR() -> void {
if(!supervisor()) return;
auto data = readPC<Word>();
writeSR(readSR() | data);
}
auto M68K::instructionPEA(EffectiveAddress from) -> void {
auto data = fetch<Long>(from);
push<Long>(data);
}
auto M68K::instructionRESET() -> void {
if(!supervisor()) return;
r.reset = true;
}
template<uint Size> auto M68K::ROL(uint32 result, uint shift) -> uint32 {
bool carry = false;
for(auto _ : range(shift)) {
carry = result & msb<Size>();
result = result << 1 | carry;
}
r.c = carry;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionROL(uint4 shift, DataRegister modify) -> void {
auto result = ROL<Size>(read<Size>(modify), shift);
write<Size>(modify, result);
}
template<uint Size> auto M68K::instructionROL(DataRegister shift, DataRegister modify) -> void {
auto count = read<Long>(shift) & 63;
auto result = ROL<Size>(read<Size>(modify), count);
write<Size>(modify, result);
}
auto M68K::instructionROL(EffectiveAddress modify) -> void {
auto result = ROL<Word>(read<Word, Hold>(modify), 1);
write<Word>(modify, result);
}
template<uint Size> auto M68K::ROR(uint32 result, uint shift) -> uint32 {
bool carry = false;
for(auto _ : range(shift)) {
carry = result & lsb<Size>();
result >>= 1;
if(carry) result |= msb<Size>();
}
r.c = carry;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionROR(uint4 shift, DataRegister modify) -> void {
auto result = ROR<Size>(read<Size>(modify), shift);
write<Size>(modify, result);
}
template<uint Size> auto M68K::instructionROR(DataRegister shift, DataRegister modify) -> void {
auto count = read<Long>(shift) & 63;
auto result = ROR<Size>(read<Size>(modify), count);
write<Size>(modify, result);
}
auto M68K::instructionROR(EffectiveAddress modify) -> void {
auto result = ROR<Word>(read<Word, Hold>(modify), 1);
write<Word>(modify, result);
}
template<uint Size> auto M68K::ROXL(uint32 result, uint shift) -> uint32 {
bool carry = r.x;
for(auto _ : range(shift)) {
bool extend = carry;
carry = result & msb<Size>();
result = result << 1 | extend;
}
r.c = carry;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
r.x = r.c;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionROXL(uint4 shift, DataRegister modify) -> void {
auto result = ROXL<Size>(read<Size>(modify), shift);
write<Size>(modify, result);
}
template<uint Size> auto M68K::instructionROXL(DataRegister shift, DataRegister modify) -> void {
auto count = read<Long>(shift) & 63;
auto result = ROXL<Size>(read<Size>(modify), count);
write<Size>(modify, result);
}
auto M68K::instructionROXL(EffectiveAddress modify) -> void {
auto result = ROXL<Word>(read<Word, Hold>(modify), 1);
write<Word>(modify, result);
}
template<uint Size> auto M68K::ROXR(uint32 result, uint shift) -> uint32 {
bool carry = r.x;
for(auto _ : range(shift)) {
bool extend = carry;
carry = result & lsb<Size>();
result >>= 1;
if(extend) result |= msb<Size>();
}
r.c = carry;
r.v = 0;
r.z = clip<Size>(result) == 0;
r.n = sign<Size>(result) < 0;
r.x = r.c;
return clip<Size>(result);
}
template<uint Size> auto M68K::instructionROXR(uint4 shift, DataRegister modify) -> void {
auto result = ROXR<Size>(read<Size>(modify), shift);
write<Size>(modify, result);
}
template<uint Size> auto M68K::instructionROXR(DataRegister shift, DataRegister modify) -> void {
auto count = read<Long>(shift) & 63;
auto result = ROXR<Size>(read<Size>(modify), count);
write<Size>(modify, result);
}
auto M68K::instructionROXR(EffectiveAddress modify) -> void {
auto result = ROXR<Word>(read<Word, Hold>(modify), 1);
write<Word>(modify, result);
}
auto M68K::instructionRTE() -> void {
if(!supervisor()) return;
auto sr = pop<Word>();
r.pc = pop<Long>();
writeSR(sr);
}
auto M68K::instructionRTR() -> void {
writeCCR(pop<Word>());
r.pc = pop<Long>();
}
auto M68K::instructionRTS() -> void {
r.pc = pop<Long>();
}
auto M68K::instructionSBCD(EffectiveAddress with, EffectiveAddress from) -> void {
auto source = read<Byte>(from);
auto target = read<Byte, Hold>(with);
auto result = target - source - r.x;
bool v = false;
const bool adjustLo = (target ^ source ^ result) & 0x10;
const bool adjustHi = result & 0x100;
if(adjustLo) {
auto previous = result;
result -= 0x06;
v |= (previous & 0x80) & (~result & 0x80);
}
if(adjustHi) {
auto previous = result;
result -= 0x60;
v |= (previous & 0x80) & (~result & 0x80);
}
write<Byte>(with, result);
r.c = sign<Byte>(result >> 1) < 0;
r.v = v;
r.z = clip<Byte>(result) ? 0 : r.z;
r.n = sign<Byte>(result) < 0;
}
auto M68K::instructionSCC(uint4 condition, EffectiveAddress to) -> void {
uint8 result = testCondition(condition) ? ~0 : 0;
write<Byte>(to, result);
}
auto M68K::instructionSTOP() -> void {
if(!supervisor()) return;
auto sr = readPC<Word>();
writeSR(sr);
r.stop = true;
}
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
template<uint Size, bool Extend> auto M68K::SUB(uint32 source, uint32 target) -> uint32 {
auto result = (uint64)target - source;
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
if(Extend) result -= r.x;
r.c = sign<Size>(result >> 1) < 0;
r.v = sign<Size>((target ^ source) & (target ^ result)) < 0;
r.z = clip<Size>(result) ? 0 : (Extend ? r.z : 1);
r.n = sign<Size>(result) < 0;
r.x = r.c;
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
return result;
}
template<uint Size> auto M68K::instructionSUB(EffectiveAddress source_, DataRegister target_) -> void {
auto source = read<Size>(source_);
auto target = read<Size>(target_);
auto result = SUB<Size>(source, target);
write<Size>(target_, result);
}
template<uint Size> auto M68K::instructionSUB(DataRegister source_, EffectiveAddress target_) -> void {
auto source = read<Size>(source_);
auto target = read<Size, Hold>(target_);
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
auto result = SUB<Size>(source, target);
write<Size>(target_, result);
}
template<uint Size> auto M68K::instructionSUBA(AddressRegister to, EffectiveAddress from) -> void {
auto source = sign<Size>(read<Size>(from));
Update to v101r12 release. byuu says: Changelog: - new md/bus/ module for bus reads/writes - abstracts byte/word accesses wherever possible (everything but RAM; forces all but I/O to word, I/O to byte) - holds the system RAM since that's technically not part of the CPU anyway - added md/controller and md/system/peripherals - added emulation of gamepads - added stub PSG audio output (silent) to cap the framerate at 60fps with audio sync enabled - fixed VSRAM reads for plane vertical scrolling (two bugs here: add instead of sub; interlave plane A/B) - mask nametable read offsets (can't exceed 8192-byte nametables apparently) - emulated VRAM/VSRAM/CRAM reads from VDP data port - fixed sprite width/height size calculations - added partial emulation of 40-tile per scanline limitation (enough to fix Sonic's title screen) - fixed off-by-one sprite range testing - fixed sprite tile indexing - Vblank happens at Y=224 with overscan disabled - unsure what happens when you toggle it between Y=224 and Y=240 ... probably bad things - fixed reading of address register for ADDA, CMPA, SUBA - fixed sign extension for MOVEA effect address reads - updated MOVEM to increment the read addresses (but not writeback) for (aN) mode With all of that out of the way, we finally have Sonic the Hedgehog (fully?) playable. I played to stage 1-2 and through the special stage, at least. EDIT: yeah, we probably need HIRQs for Labyrinth Zone. Not much else works, of course. Most games hang waiting on the Z80, and those that don't (like Altered Beast) are still royally screwed. Tons of features still missing; including all of the Z80/PSG/YM2612. A note on the perihperals this time around: the Mega Drive EXT port is basically identical to the regular controller ports. So unlike with the Famicom and Super Famicom, I'm inheriting the exension port from the controller class.
2016-08-21 22:11:24 +00:00
auto target = read<Long>(to);
write<Long>(to, target - source);
}
template<uint Size> auto M68K::instructionSUBI(EffectiveAddress with) -> void {
auto source = readPC<Size>();
auto target = read<Size, Hold>(with);
auto result = SUB<Size>(source, target);
write<Size>(with, result);
}
template<uint Size> auto M68K::instructionSUBQ(uint4 immediate, EffectiveAddress with) -> void {
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
auto source = immediate;
auto target = read<Size, Hold>(with);
Update to v100r15 release. byuu wrote: Aforementioned scheduler changes added. Longer explanation of why here: http://hastebin.com/raw/toxedenece Again, we really need to test this as thoroughly as possible for regressions :/ This is a really major change that affects absolutely everything: all emulation cores, all coprocessors, etc. Also added ADDX and SUB to the 68K core, which brings us just barely above 50% of the instruction encoding space completed. [Editor's note: The "aformentioned scheduler changes" were described in a previous forum post: Unfortunately, 64-bits just wasn't enough precision (we were getting misalignments ~230 times a second on 21/24MHz clocks), so I had to move to 128-bit counters. This of course doesn't exist on 32-bit architectures (and probably not on all 64-bit ones either), so for now ... higan's only going to compile on 64-bit machines until we figure something out. Maybe we offer a "lower precision" fallback for machines that lack uint128_t or something. Using the booth algorithm would be way too slow. Anyway, the precision is now 2^-96, which is roughly 10^-29. That puts us far beyond the yoctosecond. Suck it, MAME :P I'm jokingly referring to it as the byuusecond. The other 32-bits of precision allows a 1Hz clock to run up to one full second before all clocks need to be normalized to prevent overflow. I fixed a serious wobbling issue where I was using clock > other.clock for synchronization instead of clock >= other.clock; and also another aliasing issue when two threads share a common frequency, but don't run in lock-step. The latter I don't even fully understand, but I did observe it in testing. nall/serialization.hpp has been extended to support 128-bit integers, but without explicitly naming them (yay generic code), so nall will still compile on 32-bit platforms for all other applications. Speed is basically a wash now. FC's a bit slower, SFC's a bit faster. The "longer explanation" in the linked hastebin is: Okay, so the idea is that we can have an arbitrary number of oscillators. Take the SNES: - CPU/PPU clock = 21477272.727272hz - SMP/DSP clock = 24576000hz - Cartridge DSP1 clock = 8000000hz - Cartridge MSU1 clock = 44100hz - Controller Port 1 modem controller clock = 57600hz - Controller Port 2 barcode battler clock = 115200hz - Expansion Port exercise bike clock = 192000hz Is this a pathological case? Of course it is, but it's possible. The first four do exist in the wild already: see Rockman X2 MSU1 patch. Manifest files with higan let you specify any frequency you want for any component. The old trick higan used was to hold an int64 counter for each thread:thread synchronization, and adjust it like so: - if thread A steps X clocks; then clock += X * threadB.frequency - if clock >= 0; switch to threadB - if thread B steps X clocks; then clock -= X * threadA.frequency - if clock < 0; switch to threadA But there are also system configurations where one processor has to synchronize with more than one other processor. Take the Genesis: - the 68K has to sync with the Z80 and PSG and YM2612 and VDP - the Z80 has to sync with the 68K and PSG and YM2612 - the PSG has to sync with the 68K and Z80 and YM2612 Now I could do this by having an int64 clock value for every association. But these clock values would have to be outside the individual Thread class objects, and we would have to update every relationship's clock value. So the 68K would have to update the Z80, PSG, YM2612 and VDP clocks. That's four expensive 64-bit multiply-adds per clock step event instead of one. As such, we have to account for both possibilities. The only way to do this is with a single time base. We do this like so: - setup: scalar = timeBase / frequency - step: clock += scalar * clocks Once per second, we look at every thread, find the smallest clock value. Then subtract that value from all threads. This prevents the clock counters from overflowing. Unfortunately, these oscillator values are psychotic, unpredictable, and often times repeating fractions. Even with a timeBase of 1,000,000,000,000,000,000 (one attosecond); we get rounding errors every ~16,300 synchronizations. Specifically, this happens with a CPU running at 21477273hz (rounded) and SMP running at 24576000hz. That may be good enough for most emulators, but ... you know how I am. Plus, even at the attosecond level, we're really pushing against the limits of 64-bit integers. Given the reciprocal inverse, a frequency of 1Hz (which does exist in higan!) would have a scalar that consumes 1/18th of the entire range of a uint64 on every single step. Yes, I could raise the frequency, and then step by that amount, I know. But I don't want to have weird gotchas like that in the scheduler core. Until I increase the accuracy to about 100 times greater than a yoctosecond, the rounding errors are too great. And since the only choice above 64-bit values is 128-bit values; we might as well use all the extra headroom. 2^-96 as a timebase gives me the ability to have both a 1Hz and 4GHz clock; and run them both for a full second; before an overflow event would occur. Another hastebin includes demonstration code: #include <libco/libco.h> #include <nall/nall.hpp> using namespace nall; // cothread_t mainThread = nullptr; const uint iterations = 100'000'000; const uint cpuFreq = 21477272.727272 + 0.5; const uint smpFreq = 24576000.000000 + 0.5; const uint cpuStep = 4; const uint smpStep = 5; // struct ThreadA { cothread_t handle = nullptr; uint64 frequency = 0; int64 clock = 0; auto create(auto (*entrypoint)() -> void, uint frequency) { this->handle = co_create(65536, entrypoint); this->frequency = frequency; this->clock = 0; } }; struct CPUA : ThreadA { static auto Enter() -> void; auto main() -> void; CPUA() { create(&CPUA::Enter, cpuFreq); } } cpuA; struct SMPA : ThreadA { static auto Enter() -> void; auto main() -> void; SMPA() { create(&SMPA::Enter, smpFreq); } } smpA; uint8 queueA[iterations]; uint offsetA; cothread_t resumeA = cpuA.handle; auto EnterA() -> void { offsetA = 0; co_switch(resumeA); } auto QueueA(uint value) -> void { queueA[offsetA++] = value; if(offsetA >= iterations) { resumeA = co_active(); co_switch(mainThread); } } auto CPUA::Enter() -> void { while(true) cpuA.main(); } auto CPUA::main() -> void { QueueA(1); smpA.clock -= cpuStep * smpA.frequency; if(smpA.clock < 0) co_switch(smpA.handle); } auto SMPA::Enter() -> void { while(true) smpA.main(); } auto SMPA::main() -> void { QueueA(2); smpA.clock += smpStep * cpuA.frequency; if(smpA.clock >= 0) co_switch(cpuA.handle); } // struct ThreadB { cothread_t handle = nullptr; uint128_t scalar = 0; uint128_t clock = 0; auto print128(uint128_t value) { string s; while(value) { s.append((char)('0' + value % 10)); value /= 10; } s.reverse(); print(s, "\n"); } //femtosecond (10^15) = 16306 //attosecond (10^18) = 688838 //zeptosecond (10^21) = 13712691 //yoctosecond (10^24) = 13712691 (hitting a dead-end on a rounding error causing a wobble) //byuusecond? ( 2^96) = (perfect? 79,228 times more precise than a yoctosecond) auto create(auto (*entrypoint)() -> void, uint128_t frequency) { this->handle = co_create(65536, entrypoint); uint128_t unitOfTime = 1; //for(uint n : range(29)) unitOfTime *= 10; unitOfTime <<= 96; //2^96 time units ... this->scalar = unitOfTime / frequency; print128(this->scalar); this->clock = 0; } auto step(uint128_t clocks) -> void { clock += clocks * scalar; } auto synchronize(ThreadB& thread) -> void { if(clock >= thread.clock) co_switch(thread.handle); } }; struct CPUB : ThreadB { static auto Enter() -> void; auto main() -> void; CPUB() { create(&CPUB::Enter, cpuFreq); } } cpuB; struct SMPB : ThreadB { static auto Enter() -> void; auto main() -> void; SMPB() { create(&SMPB::Enter, smpFreq); clock = 1; } } smpB; auto correct() -> void { auto minimum = min(cpuB.clock, smpB.clock); cpuB.clock -= minimum; smpB.clock -= minimum; } uint8 queueB[iterations]; uint offsetB; cothread_t resumeB = cpuB.handle; auto EnterB() -> void { correct(); offsetB = 0; co_switch(resumeB); } auto QueueB(uint value) -> void { queueB[offsetB++] = value; if(offsetB >= iterations) { resumeB = co_active(); co_switch(mainThread); } } auto CPUB::Enter() -> void { while(true) cpuB.main(); } auto CPUB::main() -> void { QueueB(1); step(cpuStep); synchronize(smpB); } auto SMPB::Enter() -> void { while(true) smpB.main(); } auto SMPB::main() -> void { QueueB(2); step(smpStep); synchronize(cpuB); } // #include <nall/main.hpp> auto nall::main(string_vector) -> void { mainThread = co_active(); uint masterCounter = 0; while(true) { print(masterCounter++, " ...\n"); auto A = clock(); EnterA(); auto B = clock(); print((double)(B - A) / CLOCKS_PER_SEC, "s\n"); auto C = clock(); EnterB(); auto D = clock(); print((double)(D - C) / CLOCKS_PER_SEC, "s\n"); for(uint n : range(iterations)) { if(queueA[n] != queueB[n]) return print("fail at ", n, "\n"); } } } ...and that's everything.]
2016-07-31 02:11:20 +00:00
auto result = SUB<Size>(source, target);
write<Size>(with, result);
}
//Size is ignored: always uses Long
template<uint Size> auto M68K::instructionSUBQ(uint4 immediate, AddressRegister with) -> void {
auto result = read<Long>(with) - immediate;
write<Long>(with, result);
}
template<uint Size> auto M68K::instructionSUBX(EffectiveAddress with, EffectiveAddress from) -> void {
auto source = read<Size>(from);
auto target = read<Size, Hold>(with);
auto result = SUB<Size, Extend>(source, target);
write<Size>(with, result);
}
auto M68K::instructionSWAP(DataRegister with) -> void {
auto result = read<Long>(with);
result = result >> 16 | result << 16;
write<Long>(with, result);
r.c = 0;
r.v = 0;
r.z = clip<Long>(result) == 0;
r.n = sign<Long>(result) < 0;
}
auto M68K::instructionTAS(EffectiveAddress with) -> void {
auto data = read<Byte, Hold>(with);
write<Byte>(with, data | 0x80);
r.c = 0;
r.v = 0;
r.z = clip<Byte>(data) == 0;
r.n = sign<Byte>(data) < 0;
}
auto M68K::instructionTRAP(uint4 vector) -> void {
exception(Exception::Trap, vector);
}
auto M68K::instructionTRAPV() -> void {
if(r.v) exception(Exception::Overflow, Vector::Overflow);
}
template<uint Size> auto M68K::instructionTST(EffectiveAddress ea) -> void {
auto data = read<Size>(ea);
r.c = 0;
r.v = 0;
r.z = clip<Size>(data) == 0;
r.n = sign<Size>(data) < 0;
}
auto M68K::instructionUNLK(AddressRegister with) -> void {
auto sp = AddressRegister{7};
write<Long>(sp, read<Long>(with));
write<Long>(with, pop<Long>());
}