mirror of https://github.com/xemu-project/xemu.git
![]() Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in QEMU). This format was lacking in the following: * Grain directory (L1) and grain table (L2) entries were 32-bit, allowing access to only 2TB (slightly less) of data. * The grain size (default) was 512 bytes - leading to data fragmentation and many grain tables. * For space reclamation purposes, it was necessary to find all the grains which are not pointed to by any grain table - so a reverse mapping of "offset of grain in vmdk" to "grain table" must be constructed - which takes large amounts of CPU/RAM. The format specification can be found in VMware's documentation: https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf In ESXi 6.5, to support snapshot files larger than 2TB, a new format was introduced: SESparse (Space Efficient). This format fixes the above issues: * All entries are now 64-bit. * The grain size (default) is 4KB. * Grain directory and grain tables are now located at the beginning of the file. + seSparse format reserves space for all grain tables. + Grain tables can be addressed using an index. + Grains are located in the end of the file and can also be addressed with an index. - seSparse vmdks of large disks (64TB) have huge preallocated headers - mainly due to L2 tables, even for empty snapshots. * The header contains a reverse mapping ("backmap") of "offset of grain in vmdk" to "grain table" and a bitmap ("free bitmap") which specifies for each grain - whether it is allocated or not. Using these data structures we can implement space reclamation efficiently. * Due to the fact that the header now maintains two mappings: * The regular one (grain directory & grain tables) * A reverse one (backmap and free bitmap) These data structures can lose consistency upon crash and result in a corrupted VMDK. Therefore, a journal is also added to the VMDK and is replayed when the VMware reopens the file after a crash. Since ESXi 6.7 - SESparse is the only snapshot format available. Unfortunately, VMware does not provide documentation regarding the new seSparse format. This commit is based on black-box research of the seSparse format. Various in-guest block operations and their effect on the snapshot file were tested. The only VMware provided source of information (regarding the underlying implementation) was a log file on the ESXi: /var/log/hostd.log Whenever an seSparse snapshot is created - the log is being populated with seSparse records. Relevant log records are of the form: [...] Const Header: [...] constMagic = 0xcafebabe [...] version = 2.1 [...] capacity = 204800 [...] grainSize = 8 [...] grainTableSize = 64 [...] flags = 0 [...] Extents: [...] Header : <1 : 1> [...] JournalHdr : <2 : 2> [...] Journal : <2048 : 2048> [...] GrainDirectory : <4096 : 2048> [...] GrainTables : <6144 : 2048> [...] FreeBitmap : <8192 : 2048> [...] BackMap : <10240 : 2048> [...] Grain : <12288 : 204800> [...] Volatile Header: [...] volatileMagic = 0xcafecafe [...] FreeGTNumber = 0 [...] nextTxnSeqNumber = 0 [...] replayJournal = 0 The sizes that are seen in the log file are in sectors. Extents are of the following format: <offset : size> This commit is a strict implementation which enforces: * magics * version number 2.1 * grain size of 8 sectors (4KB) * grain table size of 64 sectors * zero flags * extent locations Additionally, this commit proivdes only a subset of the functionality offered by seSparse's format: * Read-only * No journal replay * No space reclamation * No unmap support Hence, journal header, journal, free bitmap and backmap extents are unused, only the "classic" (L1 -> L2 -> data) grain access is implemented. However there are several differences in the grain access itself. Grain directory (L1): * Grain directory entries are indexes (not offsets) to grain tables. * Valid grain directory entries have their highest nibble set to 0x1. * Since grain tables are always located in the beginning of the file - the index can fit into 32 bits - so we can use its low part if it's valid. Grain table (L2): * Grain table entries are indexes (not offsets) to grains. * If the highest nibble of the entry is: 0x0: The grain in not allocated. The rest of the bytes are 0. 0x1: The grain is unmapped - guest sees a zero grain. The rest of the bits point to the previously mapped grain, see 0x3 case. 0x2: The grain is zero. 0x3: The grain is allocated - to get the index calculate: ((entry & 0x0fff000000000000) >> 48) | ((entry & 0x0000ffffffffffff) << 12) * The difference between 0x1 and 0x2 is that 0x1 is an unallocated grain which results from the guest using sg_unmap to unmap the grain - but the grain itself still exists in the grain extent - a space reclamation procedure should delete it. Unmapping a zero grain has no effect (0x2 will not change to 0x1) but unmapping an unallocated grain will (0x0 to 0x1) - naturally. In order to implement seSparse some fields had to be changed to support both 32-bit and 64-bit entry sizes. Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com> Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com Signed-off-by: Max Reitz <mreitz@redhat.com> |
||
---|---|---|
accel | ||
audio | ||
authz | ||
backends | ||
block | ||
bsd-user | ||
capstone@22ead3e0bf | ||
chardev | ||
contrib | ||
crypto | ||
default-configs | ||
disas | ||
docs | ||
dtc@88f18909db | ||
fpu | ||
fsdev | ||
gdb-xml | ||
hw | ||
include | ||
io | ||
libdecnumber | ||
linux-headers | ||
linux-user | ||
migration | ||
monitor | ||
nbd | ||
net | ||
pc-bios | ||
po | ||
python/qemu | ||
qapi | ||
qga | ||
qobject | ||
qom | ||
replay | ||
roms | ||
scripts | ||
scsi | ||
slirp@f0da672620 | ||
stubs | ||
target | ||
tcg | ||
tests | ||
trace | ||
ui | ||
util | ||
.cirrus.yml | ||
.dir-locals.el | ||
.editorconfig | ||
.exrc | ||
.gdbinit | ||
.gitignore | ||
.gitlab-ci.yml | ||
.gitmodules | ||
.gitpublish | ||
.mailmap | ||
.patchew.yml | ||
.shippable.yml | ||
.travis.yml | ||
CODING_STYLE | ||
COPYING | ||
COPYING.LIB | ||
Changelog | ||
HACKING | ||
Kconfig.host | ||
LICENSE | ||
MAINTAINERS | ||
Makefile | ||
Makefile.objs | ||
Makefile.target | ||
README | ||
VERSION | ||
arch_init.c | ||
balloon.c | ||
block.c | ||
blockdev-nbd.c | ||
blockdev.c | ||
blockjob.c | ||
bootdevice.c | ||
bt-host.c | ||
bt-vhci.c | ||
configure | ||
cpus-common.c | ||
cpus.c | ||
device-hotplug.c | ||
device_tree.c | ||
disas.c | ||
dma-helpers.c | ||
dump.c | ||
exec.c | ||
gdbstub.c | ||
gitdm.config | ||
hmp-commands-info.hx | ||
hmp-commands.hx | ||
hmp.h | ||
ioport.c | ||
iothread.c | ||
job-qmp.c | ||
job.c | ||
memory.c | ||
memory_ldst.inc.c | ||
memory_mapping.c | ||
module-common.c | ||
numa.c | ||
os-posix.c | ||
os-win32.c | ||
qdev-monitor.c | ||
qemu-bridge-helper.c | ||
qemu-deprecated.texi | ||
qemu-doc.texi | ||
qemu-edid.c | ||
qemu-ga.texi | ||
qemu-img-cmds.hx | ||
qemu-img.c | ||
qemu-img.texi | ||
qemu-io-cmds.c | ||
qemu-io.c | ||
qemu-keymap.c | ||
qemu-nbd.c | ||
qemu-nbd.texi | ||
qemu-option-trace.texi | ||
qemu-options-wrapper.h | ||
qemu-options.h | ||
qemu-options.hx | ||
qemu-seccomp.c | ||
qemu-tech.texi | ||
qemu.nsi | ||
qemu.sasl | ||
qtest.c | ||
replication.c | ||
replication.h | ||
rules.mak | ||
thunk.c | ||
tpm.c | ||
trace-events | ||
version.rc | ||
vl.c | ||
win_dump.c | ||
win_dump.h |
README
QEMU README =========== QEMU is a generic and open source machine & userspace emulator and virtualizer. QEMU is capable of emulating a complete machine in software without any need for hardware virtualization support. By using dynamic translation, it achieves very good performance. QEMU can also integrate with the Xen and KVM hypervisors to provide emulated hardware while allowing the hypervisor to manage the CPU. With hypervisor support, QEMU can achieve near native performance for CPUs. When QEMU emulates CPUs directly it is capable of running operating systems made for one machine (e.g. an ARMv7 board) on a different machine (e.g. an x86_64 PC board). QEMU is also capable of providing userspace API virtualization for Linux and BSD kernel interfaces. This allows binaries compiled against one architecture ABI (e.g. the Linux PPC64 ABI) to be run on a host using a different architecture ABI (e.g. the Linux x86_64 ABI). This does not involve any hardware emulation, simply CPU and syscall emulation. QEMU aims to fit into a variety of use cases. It can be invoked directly by users wishing to have full control over its behaviour and settings. It also aims to facilitate integration into higher level management layers, by providing a stable command line interface and monitor API. It is commonly invoked indirectly via the libvirt library when using open source applications such as oVirt, OpenStack and virt-manager. QEMU as a whole is released under the GNU General Public License, version 2. For full licensing details, consult the LICENSE file. Building ======== QEMU is multi-platform software intended to be buildable on all modern Linux platforms, OS-X, Win32 (via the Mingw64 toolchain) and a variety of other UNIX targets. The simple steps to build QEMU are: mkdir build cd build ../configure make Additional information can also be found online via the QEMU website: https://qemu.org/Hosts/Linux https://qemu.org/Hosts/Mac https://qemu.org/Hosts/W32 Submitting patches ================== The QEMU source code is maintained under the GIT version control system. git clone https://git.qemu.org/git/qemu.git When submitting patches, one common approach is to use 'git format-patch' and/or 'git send-email' to format & send the mail to the qemu-devel@nongnu.org mailing list. All patches submitted must contain a 'Signed-off-by' line from the author. Patches should follow the guidelines set out in the HACKING and CODING_STYLE files. Additional information on submitting patches can be found online via the QEMU website https://qemu.org/Contribute/SubmitAPatch https://qemu.org/Contribute/TrivialPatches The QEMU website is also maintained under source control. git clone https://git.qemu.org/git/qemu-web.git https://www.qemu.org/2017/02/04/the-new-qemu-website-is-up/ A 'git-publish' utility was created to make above process less cumbersome, and is highly recommended for making regular contributions, or even just for sending consecutive patch series revisions. It also requires a working 'git send-email' setup, and by default doesn't automate everything, so you may want to go through the above steps manually for once. For installation instructions, please go to https://github.com/stefanha/git-publish The workflow with 'git-publish' is: $ git checkout master -b my-feature $ # work on new commits, add your 'Signed-off-by' lines to each $ git publish Your patch series will be sent and tagged as my-feature-v1 if you need to refer back to it in the future. Sending v2: $ git checkout my-feature # same topic branch $ # making changes to the commits (using 'git rebase', for example) $ git publish Your patch series will be sent with 'v2' tag in the subject and the git tip will be tagged as my-feature-v2. Bug reporting ============= The QEMU project uses Launchpad as its primary upstream bug tracker. Bugs found when running code built from QEMU git or upstream released sources should be reported via: https://bugs.launchpad.net/qemu/ If using QEMU via an operating system vendor pre-built binary package, it is preferable to report bugs to the vendor's own bug tracker first. If the bug is also known to affect latest upstream code, it can also be reported via launchpad. For additional information on bug reporting consult: https://qemu.org/Contribute/ReportABug Contact ======= The QEMU community can be contacted in a number of ways, with the two main methods being email and IRC - qemu-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/qemu-devel - #qemu on irc.oftc.net Information on additional methods of contacting the community can be found online via the QEMU website: https://qemu.org/Contribute/StartHere -- End