Using glMapBufferRange to read back the contents of the SSBO is extremely slow on NVIDIA drivers. This is more noticeable at higher internal resolutions. Using glGetBufferSubData instead does not seem to exhibit this slowdown.
This implemention tries to be as accurate as the old SW implemention, but it will remove the dependcy of our vertexloader on videosw.