If the inputs are both float singles, and the top half is known to be identical
to the bottom half, we can use packed arithmetic instead of scalar to skip
the movddup.
This is slower on a few rather old CPUs, plus the Atom+Silvermont, so detect
Atom and disable it in that case.
Also avoid PPC_FP on stores if we know that the output came from a float op.
Before this commit, the two were reversed ("cpu_string" had the brand, e.g. "AuthenticAMD"; and "brand_string" had the CPU type, e.g. "AMD Phenom II X4 925").
Our defines were never clear between what meant 64bit or x86_64
This makes a clear cut between bitness and architecture.
This commit also has the side effect of bringing up aarch64 compiling support.