if multiple sectors spanning multiple clusters are read the
function count_contiguous_clusters should ensure that the
cluster type should not change between the clusters.
Especially the for-loop should break when we have one
or more normal clusters followed by a compressed cluster.
Unfortunately the wrong macro was used in the mask to
compare the flags.
This was discovered while debugging a data corruption
issue when converting a compressed qcow2 image to raw.
qemu-img reads 2MB chunks which span multiple clusters.
CC: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If backing file doesn't exist, the error message is confusing and
misleading:
$ qemu /tmp/a.qcow2
qemu: could not open disk image /tmp/a.qcow2: Could not open file: No
such file or directory
But...
$ ls /tmp/a.qcow2
/tmp/a.qcow2
$ qemu-img info /tmp/a.qcow2
image: /tmp/a.qcow2
file format: qcow2
virtual size: 8.0G (8589934592 bytes)
disk size: 196K
cluster_size: 65536
backing file: /tmp/b.qcow2
Because...
$ ls /tmp/b.qcow2
ls: cannot access /tmp/b.qcow2: No such file or directory
This is not intuitive. It's better to have the missing file's name in
the error message. With this patch:
$ qemu-io -c 'read 0 512' /tmp/a.qcow2
qemu-io: can't open device /tmp/a.qcow2: Could not open backing
file: Could not open '/stor/vm/arch.raw': No such file or directory
no file open, try 'help open'
Which is a little bit better.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds support for VHDX image creation, for images of type "Fixed"
and "Dynamic". "Differencing" types (i.e., VHDX images with backing
files) are currently not supported.
Options for image creation include:
* log size:
The size of the journaling log for VHDX. Minimum is 1MB,
and it must be a multiple of 1MB. Invalid log sizes will be
silently fixed by rounding up to the nearest MB.
Default is 1MB.
* block size:
This is the size of a payload block. The range is 1MB to 256MB,
inclusive, and must be a multiple of 1MB as well. Invalid sizes
and multiples will be silently fixed. If '0' is passed, then
a sane size is chosen (depending on virtual image size).
Default is 0 (Auto-select).
* subformat:
- "dynamic"
An image without data pre-allocated.
- "fixed"
An image with data pre-allocated.
Default is "dynamic"
When creating the image file, the lettered sections are created:
-----------------------------------------------------------------.
| (A) | (B) | (C) | (D) | (E)
| File ID | Header1 | Header 2 | Region Tbl 1 | Region Tbl 2
| | | | |
.-----------------------------------------------------------------.
0 64KB 128KB 192KB 256KB 320KB
.---- ~ ----------- ~ ------------ ~ ---------------- ~ -----------.
| (F) | (G) | (H) |
| Journal Log | BAT / Bitmap | Metadata | .... data ......
| | | |
.---- ~ ----------- ~ ------------ ~ ---------------- ~ -----------.
1MB (var.) (var.) (var.)
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
VHDXPage83Data and VHDXParentLocatorHeader both incorrectly had their
MSGUID fields set as arrays of 16. This is incorrect (it stems from
an early version where those fields were uint_8 arrays). Those fields
were, up to this patch, unused.
Also, there were a couple of typos and incorrect wording in comments,
and those have been fixed up as well.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is preperation for vhdx_create(). The ability to write headers,
and calculate the number of BAT entries will be needed within the
create() functions, so move this relevant code into helper functions.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In preparation for vhdx_create(), move more endian translation
functions out to vhdx-endian.c.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Bit shifting can be fun, but in this case it was unnecessary. The
upper 44 bits of the 64-bit BAT entry is specifies the File Offset,
so we shifted the bits to get access to the value.
However, per the spec the value is in MB. So we dutifully shifted back
to the left by 20 bits, to convert to a true uint64_t file offset.
This replaces those steps with just a bit mask, to get rid of the lower
20 bits instead.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds support for writing to VHDX image files, using coroutines.
Writes into the BAT table goes through the VHDX log. Currently, BAT
table writes occur when expanding a dynamic VHDX file, and allocating a
new BAT entry.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds support for writing to the VHDX log.
For spec details, see VHDX Specification Format v1.00:
https://www.microsoft.com/en-us/download/details.aspx?id=34750
There are a few limitations to this log support:
1.) There is no caching yet
2.) The log is flushed after each entry
The primary write interface, vhdx_log_write_and_flush(), performs a log
write followed by an immediate flush of the log.
As each log entry sector is a minimum of 4KB, partial sector writes are
filled in with data from the disk write destination.
If the current file log GUID is 0, a new GUID is generated and updated
in the header.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Regions in the image file cannot overlap - the log, region tables,
and metdata must all be unique and non-overlapping.
This adds region checking by means of a QLIST; there can be a variable
number of regions and metadata (there may be metadata or region tables
that we do not recognize / know about, but are not required).
This adds the capability to register a region for later checking, and
to check against registered regions for any overlap.
Also, if neither the BAT or Metadata region tables are found, return
error.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds support for VHDX v0 logs, as specified in Microsoft's
VHDX Specification Format v1.00:
https://www.microsoft.com/en-us/download/details.aspx?id=34750
The following support is added:
* Log parsing, and validation - validate that an existing log
is correct.
* Log search - search through an existing log, to find any valid
sequence of entries.
* Log replay and flush - replay an existing log, and flush/clear
the log when complete.
The VHDX log is a circular buffer, with elements (sectors) of 4KB.
A log entry is a variably-length number of sectors, that is
comprised of a header and 'descriptors', that describe each sector.
A log may contain multiple entries, know as a log sequence. In a log
sequence, each log entry immediately follows the previous entry, with an
incrementing sequence number. There can only ever be one active and
valid sequence in the log.
Each log entry must match the file log GUID in order to be valid (along
with other criteria). Once we have flushed all valid log entries, we
marked the file log GUID to be zero, which indicates a buffer with no
valid entries.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Allow tracking of first file write in the VHDX image, as well as
the ability to update the GUID in the header. This is in preparation
for log support.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This moves the endian translation functions out from the vhdx.c source,
into a separate source file. In addition to the previously defined
endian functions, new endian translation functions for log support are
added as well.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds some magic number defines, and internal structure definitions
for VHDX log replay support. The struct VHDXLogEntries does not reflect
an on-disk data structure, and thus does not need to be packed.
Some minor code style fixes are applied as well.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In preparation for VHDX log support, move these structures to the
header.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds the ability to update the headers in a VHDX image, including
generating a new MS-compatible GUID.
As VHDX depends on uuid.h, VHDX is now a configurable build option. If
VHDX support is enabled, that will also enable uuid as well. The
default is to have VHDX enabled.
To enable/disable VHDX: --enable-vhdx, --disable-vhdx
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Just a couple of minor comments to help note where allocated
buffers are freed, and a typo fix.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The below patch is needed to compile qemu trunk on FreeBSD with gcc48,
clang will fail.... ;). Host x84_64-freebsd.
Signed-off-by: Andreas Tobler <andreast@FreeBSD.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Replace the legacy cpu_to_be64wu() with stq_be_p().
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-id: 1383669517-25598-9-git-send-email-peter.maydell@linaro.org
Signed-off-by: Anthony Liguori <aliguori@amazon.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJScnrnAAoJEH8JsnLIjy/WhhsP/2No1yEGNzfhw0WLDsEGBJI7
zjG+QkRMO4q2t256SxNr84KBFJlYKBvGrx+W8xC66AdvR1feL5hmWdXAMTJovx6Z
3Qt59RI9iISZ2OEtc9FhdsC+dSdM/3qie17XuuSCqifsi4xLjIZK/s18+RnLa0t/
nRObYP4prRl0c3o1gKaUvNz2wkIqctQAIe8UQkn6R1vPC6D60m/H9dDj4Kj68HO0
ICsF4AXBR/V2a8gU36/PGexBVyfgC4HOeuN0qNSTgYOKxLuNR+SrlzzhHE+jZTs5
GASm3vg/vUgBOO1759X5T8hveO6yu8XL82l+/d5nIK4gYGORIQZT74dyV5JgQIlF
Y47d0cF28+C/fuL1jh7c+2HY5WmmJQosMi9CaCBj0lvH0k5caEjqwPeHtRBmEyu3
1wAcLQJowZrWB5ez9MjezsaL4sPCymvB/4F443xdz5V19mE41bLZGW2EIT7MXHY7
IcwLU/opx76GMOFfWVMA7jeQkjiPaqGeaQHJzdnGUzIthqyiTigQMfi5P3nXGDic
uQi+KrqP9lNpJlZk4xGQnFovHNmKZrnLhUvqOIPk7/wKMvlU6ewdzp5Fnwzqw4MW
uJ/6eBJYolMyY+q37AH3Q6ZUkwTJi9O1drCPA0Ogr/dJiCyAiOoKuL0N74VabpcD
AahXw+yYV0qh6H4YjOzW
=wGCx
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'kwolf/tags/for-anthony' into staging
Block patches for 1.7.0-rc0 (v2)
# gpg: Signature made Thu 31 Oct 2013 04:44:39 PM CET using RSA key ID C88F2FD6
# gpg: Can't check signature: public key not found
* kwolf/tags/for-anthony: (30 commits)
vmdk: Implment bdrv_get_specific_info
qapi: Add optional field 'compressed' to ImageInfo
qemu-iotests: prefill some data to test image
sheepdog: check simultaneous create in resend_aioreq
sheepdog: cancel aio requests if possible
sheepdog: make add_aio_request and send_aioreq void functions
sheepdog: try to reconnect to sheepdog after network error
coroutine: add co_aio_sleep_ns() to allow sleep in block drivers
sheepdog: reload inode outside of resend_aioreq
sheepdog: handle vdi objects in resend_aio_req
sheepdog: check return values of qemu_co_recv/send correctly
qemu-iotests: Test case for backing file deletion
qemu-iotests: drop duplicated "create_image"
qemu-iotests: Fix 051 reference output
block: Avoid unecessary drv->bdrv_getlength() calls
block: Disable BDRV_O_COPY_ON_READ for the backing file
ahci: fix win7 hang on boot
sheepdog: pass copy_policy in the request
sheepdog: explicitly set copies as type uint8_t
block: Don't copy backing file name on error
...
Message-id: 1383064269-27720-1-git-send-email-kwolf@redhat.com
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Implement .bdrv_get_specific_info to return the extent information.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After reconnection happens, all the inflight requests are moved to the
failed request list. As a result, sd_co_rw_vector() can send another
create request before resend_aioreq() resends a create request from
the failed list.
This patch adds a helper function check_simultaneous_create() and
checks simultaneous create requests more strictly in resend_aioreq().
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Tested-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch tries to cancel aio requests in pending queue and failed
queue. When the sheepdog driver cannot cancel the requests, it waits
for them to be completed.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Tested-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These functions no longer return errors. We can make them void
functions and simplify the codes.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Tested-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This introduces a failed request queue and links all the inflight
requests to the list after network error happens. After QEMU
reconnects to the sheepdog server successfully, the sheepdog block
driver will retry all the requests in the failed queue.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Tested-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This prepares for using resend_aioreq() after reconnecting to the
sheepdog server.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Tested-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The current resend_aio_req() doesn't work when the request is against
vdi objects. This fixes the problem.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Tested-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qemu_co_recv/send doesn't return the specified length, it means
that an error happened.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Tested-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The block layer generally keeps the size of an image cached in
bs->total_sectors so that it doesn't have to perform expensive
operations to get the size whenever it needs it.
This doesn't work however when using a backend that can change its size
without qemu being aware of it, i.e. passthrough of removable media like
CD-ROMs or floppy disks. For this reason, the caching is disabled when a
removable device is used.
It is obvious that checking whether the _guest_ device has removable
media isn't the right thing to do when we want to know whether the size
of the host backend can change. To make things worse, non-top-level
BlockDriverStates never have any device attached, which makes qemu
assume they are removable, so drv->bdrv_getlength() is always called on
the protocol layer. In the case of raw-posix, this causes unnecessary
lseek() system calls, which turned out to be rather expensive.
This patch completely changes the logic and disables bs->total_sectors
caching only for certain block driver types, for which a size change is
expected: host_cdrom and host_floppy on POSIX, host_device on win32; also
the raw format in case it sits on top of one of these protocols, but in
the common case the nested bdrv_getlength() call on the protocol driver
will use the cache again and avoid an expensive drv->bdrv_getlength()
call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Currently copy_policy isn't used. Recent sheepdog supports erasure coding, which
make use of copy_policy internally, but require client explicitly passing
copy_policy from base inode to newly creately inode for snapshot related
operations.
If connected sheep daemon doesn't utilize copy_policy, passing it to sheep
daemon is just one extra null effect operation. So no compatibility problem.
With this patch, sheepdog can provide erasure coded volume for QEMU VM.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
'copies' is actually uint8_t since day one, but request headers and some helper
functions parameterize it as uint32_t for unknown reasons and effectively
reserve 24 bytes for possible future use. This patch explicitly set the correct
for copies and reserve the left bytes.
This is a preparation patch that allow passing copy_policy in request header.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Opening the qcow2 image with BDRV_O_NO_FLUSH prevents any flushes during
the image creation. This means that the image has not yet been flushed
to disk when qemu-img create exits. This flush is delayed until the next
operation on the image involving opening it without BDRV_O_NO_FLUSH and
closing (or directly flushing) it. For large images and/or images with a
small cluster size and preallocated metadata, this flush may take a
significant amount of time and may occur unexpectedly.
Reopening the image without BDRV_O_NO_FLUSH right before the end of
qcow2_create2() results in hoisting the potentially costly flush into
the image creation, which is expected to take some time (whereas
successive image operations may be not).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
compatiblity -> compatibility
continously -> continuously
existance -> existence
usefull -> useful
shoudl -> should
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
this adds a check that a dynamic VHD file has not been
accidently truncated (e.g. during transfer or upload).
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Saving the VM state is done using bdrv_pwrite. This function may perform
a read-modify-write, which in this case results in data being read from
beyond the end of the virtual disk. Since we are actually trying to
access an area which is not a part of the virtual disk, zero_beyond_eof
has to be set to false before performing the partial write, otherwise
the VM state may become corrupted.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since df2a6f29a5, bdrv_co_do_writev increases the total_sectors value of
a growable block devices on writes after the current end. This leads to
the virtual disk apparently growing in qcow2_save_vmstate, which in turn
affects the disk size captured by the internal snapshot taken directly
afterwards through e.g. the HMP savevm command. Such a "grown" snapshot
cannot be loaded after reopening the qcow2 image, since its disk size
differs from the actual virtual disk size (writing a VM state does not
actually increase the virtual disk size).
Fix this by restoring total_sectors at the end of qcow2_save_vmstate.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The VMFS extent line in description file doesn't have start offset as
FLAT lines does, and it should be defaulted to 0. The flat_offset
variable is initialized to -1, so we need to set it in this case.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Previously cid of parent is parsed from image file for every IO request.
We already have L1/L2 cache and don't have assumption that parent image
can be updated behind us, so remove this to get more efficiency.
The parent CID is checked only for once after opening.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
On one occasion, hdev_open() returned -1 in case of an unknown error
instead of a proper -errno value. Adjust this to match the behavior of
raw_open() (in raw-win32), which is to return -EINVAL in this case.
Also, change the call to error_setg*() to match the one in raw_open() as
well.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
An extra 'p++' after while loop when *p == '\n' will move p to unknown
data position, risking parsing junk data or memory access violation.
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Convert "fprintf(stderr,..." and standardize error messages:
Remove a few local_error's and use errp.
Remove "VMDK:" or "Vmdk:" prefixes in error message and fix to upper
case.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make use of the error parameter in the opening and creating functions in
block/raw-posix.c.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make use of the error parameter in the opening and creating functions in
block/raw-win32.c.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Propagate errors in raw_create rather than directly reporting and
afterwards discarding them.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Evaluate the runtime overlap check options and set
BDRVQcowState.overlap_check appropriately.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Introduces the macros QCOW2_OL_CONSTANT and QCOW2_OL_ALL in addition to
the already existing QCOW2_OL_CACHED, signifying all metadata overlap
checks that can be performed in constant time (regardless of image size
etc.) and truly all available overlap checks, respectively.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add an array which assigns the option string to its corresponding
overlap check bit.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add runtime options to tune the overlap checks to be performed before
write accesses.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Replace the QCOW2_OL_DEFAULT macro by a variable overlap_check in
BDRVQcowState.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qcow2_check_metadata_overlap and qcow2_pre_write_overlap_check,
change the parameter signifying the checks to perform from its current
positive form to a negative one, i.e., it will no longer explicitly
specify every check to perform but rather a mask of checks not to
perform.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When trying to find a new snapshot ID, the existing ones are converted
to integers using strtoul. This function returns an unsigned long,
therefore its result should be saved in an unsigned long as well.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the new snapshot table could not be written in qcow2_snapshot_create,
the old snapshot table has to be restored in memory and the new one
released. This should include restoration of the old snapshot count as
well, which is added by this patch.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qcow2_write_compressed, if the compression fails, a normal cluster is
written to disk. This is done through bdrv_write on the qcow2 BDS
itself (using the guest offset), thus it is wrong to do a metadata
overlap check before.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The error message in qcow2_downgrade about an unsupported refcount
order is missing a space. This patch adds it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This field is used by blkverify to disable external snapshots creation.
It will also be used by block filters like quorum to disable external
snapshot creation.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
if a raw device like an iscsi target or host device is used
the current implementation makes a second call out to get
the block status of bs->file.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_write_snapshots relies on the length of every snapshot ID and name
fitting into an unsigned 16 bit integer. This is currently ensured by
QEMU through generally only allowing 128 byte IDs and 256 byte names.
However, if this should change in the future, the length written to the
image file should not be silently truncated (though the name itself
would be written completely).
Since this is currently not an issue but might require attention due to
internal QEMU changes in the future, an assert ensuring sanity is enough
for now.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If an error occurs during qcow2_write_snapshots, the newly allocated
snapshot table clusters are leaked and should thus be freed.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_write_snapshots does contain a fail label and there is no reason
not to use it on some errors; therefore, we should always jump there on
error.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In qcow2_free_any_clusters, preallocated zero clusters should be freed
just as normal clusters are.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, qcow2_check_metadata_overlap uses bdrv_read to read inactive
L1 tables from disk. The number of sectors to read is calculated through
a truncating integer division, therefore, if the L1 table size is not a
multiple of the sector size, the final entries will not be read and
their entries in memory remain undefined (from the g_malloc).
Using bdrv_pread fixes this.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a new ImageInfoSpecificQCow2 type as a subtype of ImageInfoSpecific.
This contains the compatibility level as a string and an optional
lazy_refcounts boolean (optional means mandatory for compat >= 1.1 and
not available for compat == 0.10).
Also, add qcow2_get_specific_info, which returns this information.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a function for generically dumping the ImageInfoSpecific information
in a human-readable format to block/qapi.c.
Use this function in bdrv_image_info_dump and qemu-io-cmds.c:info_f to
allow qemu-img info resp. qemu-io -c info to print that format specific
information.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a function for retrieving an ImageInfoSpecific object from a block
driver.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Switch the string to enum type BlockJobType in BlockJobDriver.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We will use BlockJobType as the enum type name of block jobs in QAPI,
rename current BlockJobType to BlockJobDriver, which will eventually
become a set of operations, similar to block drivers.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
# By Asias He (1) and Peter Lieven (1)
# Via Paolo Bonzini
* bonzini/scsi-next:
scsi: Allocate SCSITargetReq r->buf dynamically [CVE-2013-4344]
block/iscsi: reenable iscsi_co_get_block_status
Message-id: 1381332391-8781-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@amazon.com>
Commit f35c934a accidently disabled iscsi_co_get_block_status for all
libiscsi versions. Its not possible to check for enumeration constants
in the C preprocessor. This patch changes the check to the preprocessor
constant LIBISCSI_FEATURE_IOVECTOR which was introduced shortly after
get_lba_status support was added to libiscsi.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If an error occurs in l2_allocate, the allocated (but unused) L2 cluster
should be freed.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Switching the L1 table in memory should be an atomic operation, as far
as possible. Calling qcow2_free_clusters on the old L1 table on disk is
not a good idea when the old L1 table is no longer valid and the address
to the new one hasn't yet been written into the corresponding
BDRVQcowState field. To be more specific, this can lead to segfaults due
to qcow2_check_metadata_overlap trying to access the L1 table during the
free operation.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This blocks migration for VHDX image files, until the
functionality can be supported.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
CHECK_OFLAG_COPIED as a parameter to check_refcounts_l1 and
check_refcounts_l2 is obselete now, since the OFLAG_COPIED consistency
check is actually no longer performed by these functions (but by
check_oflag_copied).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If an inactive L1 table is loaded from disk, its entries are in big
endian and have to be converted to host byte order before using them.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers pass start = 0, and it's doubtful if any other value would
actually do what you expect. Remove the parameter.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Compressed clusters can never be contiguous, therefore the corresponding
flag does not need to be given explicitly to count_contiguous_clusters.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function is not intended to be used on compressed clusters and will
not work correctly, if used anyway, since L2E_OFFSET_MASK is not the
right mask for determining the offset of compressed clusters. Therefore,
assert that the first cluster is not compressed and always include the
compression flag in the mask of significant flags, i.e., stop the search
as soon as a compressed cluster occurs.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In expand_zero_clusters_in_l1, a new cluster is only allocated if it was
not already preallocated. On error, such preallocated clusters should
not be freed, but only the newly allocated ones.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Just returning -errno in some cases prevents
trace_qcow2_l2_allocate_done from being executed (and, in one case, also
the unused allocated L2 table from being freed). Always going down the
error path fixes this.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In l2_allocate, the fail path is executed if qcow2_cache_flush fails.
However, the L2 table has not yet been fetched from the L2 table cache.
The qcow2_cache_put in the fail path therefore basically gives an
undefined argument as the L2 table address (in this case).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since the expanded_clusters bitmap is addressed using host offsets in
the underlying image file, the correct size to use for allocating the
bitmap is not determined by the guest disk image but by the underlying
host image file.
Furthermore, this size may change during the expansion due to cluster
allocations on growable image files. In this case, the bitmap needs to
be resized as well to reflect the growth.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_alloc_cluster_link_l2 is called with a QCowL2Meta describing a
request crossing L2 boundaries, a buffer overflow will occur. This is
impossible right now since such requests are never generated (every
request is shortened to L2 boundaries before) and probably also
completely unintended (considering the name "QCowL2Meta"), however, it
is still worth an assertion.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QEDHeader is read, and written, directly from on-disk images
via bdrv_pread()/write(). To avoid any unintentional padding,
these structs should be packed.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QCowHeader and QCowExtension are structs that reside in the on-disk
image format, and are read and written directly via bdrv_pread()/write(),
and as such should be packed to avoid any unintentional struct padding.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The VHD footer and header structs (vhd_footer and vhd_dyndisk_header)
are on-disk structures for the image format, and as such should be
packed.
Go ahead and make these typedefs as well, with the preferred QEMU
naming convention, so that the packed attribute is used consistently
with the struct.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The header struct VdiHeader is an on-disk structure for the image
format, and as such should be packed.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When there are no snapshots qemu_rbd_snap_list() returns 0 and the
snapshot table pointer is NULL. Don't forget to free the snaps buffer
we allocated for librbd rbd_snap_list().
When the function succeeds don't forget to free the snaps buffer after
calling rbd_snap_list_end().
Cc: qemu-stable@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The patch fixes a warning from gcc (Debian 4.6.3-14+rpi1) 4.6.3:
block/stream.c:141:22: error:
‘copy’ may be used uninitialized in this function [-Werror=uninitialized]
This is not a real bug - a better compiler would not complain.
Now 'copy' has always a defined value, so the check for ret >= 0
can be removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some drivers will have driver specifics options but no filename.
This new bool allow the block layer to treat them correctly.
The .bdrv_needs_filename is set in drivers not having .bdrv_parse_filename and
not having .bdrv_open.
The first exception to this rule will be the quorum driver.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We use the extent size as cluster size for flat extents (where no L1/L2
table is allocated so it's safe) reuse sector calculating code with
sparse extents.
Don't pass in the cluster size for adding flat extent, just set it to
sectors later, then the cluster size checking will not fail.
The cluster_sectors is changed to int64_t to allow big flat extent.
Without this, flat extent opening is broken:
# qemu-img create -f vmdk -o subformat=monolithicFlat /tmp/a.vmdk 100G
Formatting '/tmp/a.vmdk', fmt=vmdk size=107374182400 compat6=off subformat='monolithicFlat' zeroed_grain=off
# qemu-img info /tmp/a.vmdk
image: /tmp/a.vmdk
file format: raw
virtual size: 0 (0 bytes)
disk size: 4.0K
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When trying to update the refcounts for a snapshot, the return value of
update_refcount on a compressed cluster was pretty much ignored,
cancelling the update on error but returning 0. This is caused by an
inner "ret" variable shadowing the outer one (the latter is used in the
return statement).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
# By Stefan Hajnoczi (4) and others
# Via Stefan Hajnoczi
* stefanha/block:
virtio-blk: do not relay a previous driver's WCE configuration to the current
blockdev: do not default cache.no-flush to true
block: don't lose data from last incomplete sector
qcow2: Correct snapshots size for overlap check
coroutine: fix /perf/nesting coroutine benchmark
coroutine: add qemu_coroutine_yield benchmark
qemu-timer: do not take the lock in timer_pending
qemu-timer: make qemu_timer_mod_ns() and qemu_timer_del() thread-safe
qemu-timer: drop outdated signal safety comments
osdep: warn if open(O_DIRECT) on fails with EINVAL
libcacard: link against qemu-error.o for error_report()
Message-id: 1379698931-946-1-git-send-email-stefanha@redhat.com
# By Hervé Poussineau (5) and Stefan Weil (1)
# Via Paolo Bonzini
* bonzini/scsi-next:
block/iscsi: Drop iscsi_co_get_block_status for older versions of libiscsi
lsi: add 53C810 variant
lsi: remove todo
lsi: ignore write accesses to CTEST0 registers
lsi: check ssid versus sdid only if ssid is valid
lsi: use constant name instead of its value
Using s->snapshots_size instead of snapshots_size for the metadata
overlap check in qcow2_write_snapshots leads to the detection of an
overlap with the main qcow2 image header when deleting the last
snapshot, since s->snapshots_size has not yet been updated and is
therefore non-zero. However, the offset returned by qcow2_alloc_clusters
will be zero since snapshots_size is zero. Therefore, an overlap is
detected albeit no such will occur.
This patch fixes this by replacing s->snapshots_size by snapshots_size
when calling qcow2_pre_write_overlap_check.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Debian wheezy includes libiscsi-dev 1.4.0 which does not provide
SCSI_PROVISIONING_TYPE_DEALLOCATED. Drop iscsi_co_get_block_status
in this case to allow compilation without errors.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
# By Max Reitz (16) and others
# Via Kevin Wolf
* kwolf/for-anthony: (33 commits)
qemu-iotests: Fix test 038
block: Assert validity of BdrvActionOps
qemu-iotests: Cleanup test image in test number 007
qemu-img: fix invalid JSON
coroutine: add ./configure --disable-coroutine-pool
qemu-iotests: Adjustments due to error propagation
qcow2: Use Error parameter
qemu-img create: Emit filename on error
block: Error parameter for create functions
block: Error parameter for open functions
bdrv: Use "Error" for creating images
bdrv: Use "Error" for opening images
qemu-iotests: add 057 internal snapshot for block device test case
hmp: add interface hmp_snapshot_delete_blkdev_internal
hmp: add interface hmp_snapshot_blkdev_internal
qmp: add interface blockdev-snapshot-delete-internal-sync
qmp: add interface blockdev-snapshot-internal-sync
qmp: add internal snapshot support in qmp_transaction
snapshot: distinguish id and name in snapshot delete
snapshot: new function bdrv_snapshot_find_by_id_and_name()
...
Message-id: 1379073063-14963-1-git-send-email-kwolf@redhat.com
Replace .bdrv_aio_discard with .bdrv_co_discard so that discard
requests can be split in multiple parts, each for a small amount
of sectors.
This is useful because we expose a generic API with no limit
on the amount of sectors that can be unmapped in one request.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an Error ** parameter to bdrv_create and its associated functions to
allow more specific error messages.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add an Error ** parameter to bdrv_open, bdrv_file_open and associated
functions to allow more specific error messages.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add an Error ** parameter to BlockDriver.bdrv_open and
BlockDriver.bdrv_file_open to allow more specific error messages.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Snapshot creation actually already distinguish id and name since it take
a structured parameter *sn, but delete can't. Later an accurate delete
is needed in qmp_transaction abort and blockdev-snapshot-delete-sync,
so change its prototype. Also *errp is added to tip error, but return
value is kepted to let caller check what kind of error happens. Existing
caller for it are savevm, delvm and qemu-img, they are not impacted by
introducing a new function bdrv_snapshot_delete_by_id_or_name(), which
check the return value and do the operation again.
Before this patch:
For qcow2, it search id first then name to find the one to delete.
For rbd, it search name.
For sheepdog, it does nothing.
After this patch:
For qcow2, logic is the same by call it twice in caller.
For rbd, it always fails in delete with id, but still search for name
in second try, no change to user.
Some code for *errp is based on Pavel's patch.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
To make it clear about id and name in searching, add this API
to distinguish them. Caller can choose to search by id or name,
*errp will be set only for exception.
Some code are modified based on Pavel's patch.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Implement bdrv_amend_options for compat, size, backing_file, backing_fmt
and lazy_refcounts.
Downgrading images from compat=1.1 to compat=0.10 is achieved through
handling all incompatible flags accordingly, clearing all compatible and
autoclear flags and expanding all zero clusters.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Save the image refcount order in BDRVQcowState. This will be relevant
for future code supporting different refcount orders than four and also
for code that needs to verify a certain refcount order for an opened
image.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add functionality for expanding zero clusters. This is necessary for
downgrading the image version to one without zero cluster support.
For non-backed images, this function may also just discard zero clusters
instead of truly expanding them.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add a function for emptying a cache, i.e., flushing it and marking all
elements invalid.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is a valid case that the read data's size is smaller than the
requested size since there could be files that are smaller than
the minimum block size (For ex. when a VMDK disk descriptor file)
Signed-off-by: Tal Kain <tal.kain@ravellosystems.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
During savevm, the VM state is written to the active L1 of the image and
then a snapshot is taken. After that, the VM state isn't needed any more
in the active L1 and should be discarded. This is implemented by this
patch.
The impact of not discarding the VM state is that a snapshot can never
become smaller than any previous snapshot (because it would be padded
with old VM state), and more importantly that future savevm operations
cause unnecessary COWs (with associated flushes), which makes subsequent
snapshots much slower.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
The function will be used internally instead of only being called for
guest discard requests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
this patch adds a coroutine for .bdrv_co_block_status as well as
a generic framework that can be used to build coroutines in block/iscsi.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The UUID is unique even across multiple hosts, thus it is
better than a VM name even if it is less user-friendly.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These are created for example with XFS_IOC_ZERO_RANGE.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
For now, bdrv_get_block_status is just another name for bdrv_is_allocated.
The next patches will add more flags.
This also touches all block drivers with a mostly mechanical rename. The
sole exception is cow; because it calls cow_co_is_allocated from the read
code, we keep that function and make cow_co_get_block_status a wrapper.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Some bdrv_is_allocated callers do not expect errors, but the fallback
in qcow2.c might make other callers trip on assertion failures or
infinite loops.
Fix the callers to always look for errors.
Cc: qemu-stable@nongnu.org
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now that bdrv_is_allocated detects coroutine context, the two can
use the same code.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_is_allocated can detect coroutine context and go through a fast
path, similar to other block layer functions.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
As we change bdrv_is_allocated to gather more information from bs and
bs->file, it will become a bit slower. It is still appropriate for online
jobs, but not for reads/writes. Call the internal function instead.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Only sync once per write, rather than once per sector.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Do not do two reads for each sector; load each sector of the bitmap
and use bitmap operations to process it.
Writes are still dog slow!
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Manage BlockDriverState lifecycle with refcnt, so bdrv_delete() is no
longer public and should be called by bdrv_unref() if refcnt is
decreased to 0.
This is an identical change because effectively, there's no multiple
reference of BDS now: no caller of bdrv_ref() yet, only bdrv_new() sets
bs->refcnt to 1, so all bdrv_unref() now actually delete the BDS.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
BlockDriverState structure needs bdrv_new() to initialize refcnt, don't
allocate a local structure variable and memset to 0, becasue with coming
refcnt implementation, bdrv_unref will crash if bs->refcnt not
initialized to 1.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
we need bdrv_new() to properly initialize BDS, don't allocate memory
manually.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
QEMU failed to open host devices like \\.\PhysicalDrive0 (first hard disk)
since some time (commit 8a79380b8ef1b02d2abd705dd026a18863b09020?).
Those devices use hdev_open which did not use the latest API for options.
This resulted in a fatal runtime error:
Block protocol 'host_device' doesn't support the option 'filename'
Duplicate code from raw_open to fix this.
Cc: qemu-stable@nongnu.org
Reported-by: David Brenner <david.brenner3@gmail.com>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This feature can be used in case where users are avoiding the iops limit by
doing jumbo I/Os hammering the storage backend.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The max parameter of the leaky bucket throttling algorithm can be used to
allow the guest to do bursts.
The max value is a pool of I/O that the guest can use without being throttled
at all. Throttling is triggered once this pool is empty.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If no corruptions remain after an image repair (and no errors have been
encountered), clear the corrupt flag in qcow2_check.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the refcount of a refcount block is greater than one, we can at least
try to repair that problem by duplicating the affected block.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Drop error code path which cannot be taken since qemu_bh_new() does not
return NULL.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Most typos were found using a modified version of codespell:
accross -> across
issueing -> issuing
TICNT_THRESHHOLD -> TICNT_THRESHOLD
bandwith -> bandwidth
VCARD_7816_PROPIETARY -> VCARD_7816_PROPRIETARY
occured -> occurred
gaurantee -> guarantee
sofware -> software
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Since the OFLAG_COPIED checks are now executed after the refcounts have
been repaired (if repairing), it is safe to assume that they are correct
but the OFLAG_COPIED flag may be not. Therefore, if its value differs
from what it should be (considering the according refcount), that
discrepancy can be repaired by correctly setting (or clearing that flag.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move the OFLAG_COPIED checks out of check_refcounts_l1 and
check_refcounts_l2 and after the actual refcount checks/fixes (since the
refcounts might actually change there).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The pre-write overlap check function is now called before most of the
qcow2 writes (aborting it on collision or other error).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Two new functions are added; the first one checks a given range in the
image file for overlaps with metadata (main header, L1 tables, L2
tables, refcount table and blocks).
The second one should be used immediately before writing to the image
file as it calls the first function and, upon collision, marks the
image as corrupt and makes the BDS unusable, thereby preventing
further access.
Both functions take a bitmask argument specifying the structures which
should be checked for overlaps, making it possible to also check
metadata writes against colliding with other structures.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds an incompatible bit indicating corruption to qcow2. Any image
with this bit set may not be written to unless for repairing (and
subsequently clearing the bit if the repair has been successful).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Account for all cluster types in qcow2_update_snapshot_refcounts;
this prevents this function from updating the refcount of unallocated
zero clusters which effectively led to wrong adjustments of the refcount
of cluster 0 (the main qcow2 header). This in turn resulted in images
with (unallocated) zero clusters having a cluster 0 refcount greater
than one after creating a snapshot.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently if gluster AIO callback thread fails to notify the QEMU thread about
AIO completion, we try graceful recovery by marking the disk drive as
inaccessible. This error recovery code is race-prone as found by Asias and
Stefan. However as found out by Paolo, this kind of error is impossible and
hence simplify the code that handles this error recovery.
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
"Incoming" function prototypes and "outgoing" function calls must match
reality. Implemented using the "struct BlockDriver" definition in
"include/block/block_int.h", and gcc errors & warnings.
v1->v2:
On 08/20/13 09:51, Kevin Wolf wrote:
> Am 18.08.2013 um 16:29 hat Paolo Bonzini geschrieben:
>> Il 16/08/2013 16:15, Laszlo Ersek ha scritto:
>>> +static int raw_reopen_prepare(BDRVReopenState *reopen_state,
>>> + BlockReopenQueue *queue, Error **errp)
>>> {
>>> - return bdrv_reopen_prepare(bs->file);
>>> + BDRVReopenState tmp = *reopen_state;
>>> +
>>> + tmp.bs = tmp.bs->file;
>>> + return bdrv_reopen_prepare(&tmp, queue, errp);
>>> }
>>
>> This should just return zero, my fault.
>
> Which is because bdrv_reopen_queue() already queues bs->file for reopen.
> The simple return 0; implementation is shared by all other format drivers
> that support reopening images.
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On 08/05/13 15:03, Paolo Bonzini wrote:
>
> [...]
>
> 5) Formats are registered with bdrv_register (takes a BlockDriver*). You
> also need to pass the caller of bdrv_register to block_init.
Fill in the BlockDriver structure with the raw_*() functions that have
been added to "block/raw_bsd.c", in the order the fields are defined in
"include/block/block_int.h".
I needed more explanation / naming examples for registering the driver
than what Paolo gave me, so I copied / adapted from "block/qcow2.c". The
parts I took as basis for modification are blamed on
commit 5efa9d5a8b
Author: Anthony Liguori <aliguori@us.ibm.com>
Date: Sat May 9 17:03:42 2009 -0500
Convert block infrastructure to use new module init functionality
commit 20d97356c9
Author: Blue Swirl <blauwirbel@gmail.com>
Date: Fri Apr 23 20:19:47 2010 +0000
Fix OpenBSD build
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On 08/05/13 15:03, Paolo Bonzini wrote:
>
> [...]
>
> 4) There is another member, .create_options, which is an array of
> QEMUOptionParameter structs, terminated by an all-zero item. The only
> option you need is for the virtual disk size. You will find something
> to copy from in other block drivers, for example block/qcow2.c.
Code taken and adapted from "block/qcow2.c", as suggested. The code being
copied/modified is blamed on
commit 20d97356c9
Author: Blue Swirl <blauwirbel@gmail.com>
Date: Fri Apr 23 20:19:47 2010 +0000
Fix OpenBSD build
and
commit 7c80ab3f21
Author: Jes Sorensen <Jes.Sorensen@redhat.com>
Date: Fri Dec 17 16:02:39 2010 +0100
block/qcow2.c: rename qcow_ functions to qcow2_
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On 08/05/13 15:03, Paolo Bonzini wrote:
>
> [...]
>
> 3) These members are special
>
> .format_name is the string "raw"
> .bdrv_open raw_open should set bs->sg to bs->file->sg and return 0
> .bdrv_close raw_close should do nothing
> .bdrv_probe raw_probe should just return 1.
v1->v2:
On 08/20/13 10:11, Kevin Wolf wrote:
> Am 16.08.2013 um 16:15 hat Laszlo Ersek geschrieben:
>> +static int raw_probe(void)
>> +{
>> + return 1;
>> +}
>
> Maybe add a comment here like "smallest possible positive score so that
> raw is used if and only if no other block driver works".
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On 08/05/13 15:03, Paolo Bonzini wrote:
>
> [...]
>
> 2) This is also a simple forwarder function:
>
> .bdrv_create
>
> but there is no BlockDriverState argument so the forwarded-to function
> does not have a bs->file argument either. The forwarded-to function is
> bdrv_create_file.
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On 08/05/13 15:03, Paolo Bonzini wrote:
>
> [...]
>
> 1) BlockDriver is a struct in which these function members are
> interesting:
>
> .bdrv_reopen_prepare
> .bdrv_co_readv
> .bdrv_co_writev
> .bdrv_co_is_allocated
> .bdrv_co_write_zeroes
> .bdrv_co_discard
> .bdrv_getlength
> .bdrv_get_info
> .bdrv_truncate
> .bdrv_is_inserted
> .bdrv_media_changed
> .bdrv_eject
> .bdrv_lock_medium
> .bdrv_ioctl
> .bdrv_aio_ioctl
> .bdrv_has_zero_init
>
> They should be implemented as simple forwarders (see above). There are
> 16 functions listed here, you can easily see how this already accounts
> for 100+ SLOC roughly...
>
> The implementations of bdrv_co_readv and bdrv_co_writev should also call
> BLKDBG_EVENT on bs->file too, before forwarding to bs->file. The events
> to be generated are BLKDBG_READ_AIO and BLKDBG_WRITE_AIO.
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On 08/05/13 15:03, Paolo Bonzini wrote:
>
>
> ----- Original Message -----
>> From: "Laszlo Ersek" <lersek@redhat.com>
>> To: "Paolo Bonzini" <pbonzini@redhat.com>
>> Sent: Monday, August 5, 2013 2:43:46 PM
>> Subject: Re: [PATCH 1/2] raw: add license header
>>
>> On 08/02/13 00:27, Paolo Bonzini wrote:
>>> On 08/01/2013 10:13 AM, Christoph Hellwig wrote:
>>>> On Wed, Jul 31, 2013 at 08:19:51AM +0200, Paolo Bonzini wrote:
>>>>> Most of the block layer is under the BSD license, thus it is
>>>>> reasonable to license block/raw.c the same way. CCed people should
>>>>> ACK by replying with a Signed-off-by line.
>>>>
>>>> The coded was intended to be GPLv2.
>>>
>>> Laszlo, would you be willing to do clean-room reverse engineering?
>>>
>>> (No rants, please. :))
>>
>> What's the scope exactly?
>
> It's quite small, it's a file full of forwarders like
>
> static void raw_foo(BlockDriverState *bs)
> {
> return bdrv_foo(bs->file);
> }
>
> It's 170 lines of code, all as boring as this. I only picked you
> because I'm quite certain you have never seen the file (and the answer
> confirmed it).
>
> Basically:
>
> 1) BlockDriver is a struct in which these function members are
> interesting:
>
> .bdrv_reopen_prepare
> .bdrv_co_readv
> .bdrv_co_writev
> .bdrv_co_is_allocated
> .bdrv_co_write_zeroes
> .bdrv_co_discard
> .bdrv_getlength
> .bdrv_get_info
> .bdrv_truncate
> .bdrv_is_inserted
> .bdrv_media_changed
> .bdrv_eject
> .bdrv_lock_medium
> .bdrv_ioctl
> .bdrv_aio_ioctl
> .bdrv_has_zero_init
>
> They should be implemented as simple forwarders (see above).
> There are 16 functions listed here, you can easily see how this
> already accounts for 100+ SLOC roughly...
>
> The implementations of bdrv_co_readv and bdrv_co_writev should also
> call BLKDBG_EVENT on bs->file too, before forwarding to bs->file. The
> events to be generated are BLKDBG_READ_AIO and BLKDBG_WRITE_AIO.
>
> 2) This is also a simple forwarder function:
>
> .bdrv_create
>
> but there is no BlockDriverState argument so the forwarded-to function
> does not have a bs->file argument either. The forwarded-to function
> is bdrv_create_file.
>
> 3) These members are special
>
> .format_name is the string "raw"
> .bdrv_open raw_open should set bs->sg to bs->file->sg and return 0
> .bdrv_close raw_close should do nothing
> .bdrv_probe raw_probe should just return 1.
>
> 4) There is another member, .create_options, which is an array of
> QEMUOptionParameter structs, terminated by an all-zero item. The only
> option you need is for the virtual disk size. You will find something
> to copy from in other block drivers, for example block/qcow2.c.
>
> 5) Formats are registered with bdrv_register (takes a BlockDriver*).
> You also need to pass the caller of bdrv_register to block_init.
>
> 6) I'm not sure how to organize the patch series, so I'll leave this to
> your creativity. I guess in this case move/copy detection of git should
> be disabled. I would definitely include this spec in the commit
> message as a proof of clean-room reverse engineering.
>
> 7) Remember a BSD header like the one in block.c.
>
> Paolo
This patch implements the email up to the paragraph ending with "100+ SLOC
roughly". The skeleton is generated from the list there, with a simple
shell loop using "sed" and the raw_foo() template.
The BSD license block is copied (and reflowed) from
"util/qemu-progress.c".
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The expression "1LL << 63" tries to shift the 1 into the sign bit of a
'long long', which provokes a clang sanitizer warning:
runtime error: left shift of 1 by 63 places cannot be represented in type 'long long'
Use "1ULL << 63" as the definition of QCOW_OFLAG_COPIED instead
to avoid this. For consistency, we also update the other QCOW_OFLAG
definitions to use the ULL suffix rather than LL, though only the
shift by 63 is undefined behaviour.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
By the time that qemu 1.7 will be released, enough time will have passed
since qemu 1.1, which is the first version to understand version 3
images, that changing the default shouldn't hurt many people any more
and the benefits of using the new format outweigh the pain.
qemu-iotests already runs with compat=1.1 by default.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The io_flush argument to qemu_aio_set_event_notifier() has been removed
since the block layer learnt to drain requests by itself. Fix the
Windows build for win32-aio.o by updating the
qemu_aio_set_event_notifier() call and dropping win32_aio_flush_cb().
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is an autogenerated patch using scripts/switch-timer-api.
Switch the entire code base to using the new timer API.
Note this patch may introduce some line length issues.
Signed-off-by: Alex Bligh <alex@alex.org.uk>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Convert block_job_sleep_ns and co_sleep_ns to use the new timer
API.
Signed-off-by: Alex Bligh <alex@alex.org.uk>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
VMware ESX hosts also use different create and extent types for flat
files, respectively "vmfs" and "VMFS". This is not documented, but it
can be found at http://kb.vmware.com/kb/10002511 (Recreating a missing
virtual machine disk (VMDK) descriptor file).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
VMware ESX hosts use a variant of the VMDK3 format, identified by the
vmfsSparse create type ad the VMFSSPARSE extent type.
It has 16 KB grain tables (L2) and a variable-size grain directory (L1).
In addition, the grain size is always 512, but that is not a problem
because it is included in the header.
The format of the extents is documented in the VMDK spec. The format
of the descriptor file is not documented precisely, but it can be
found at http://kb.vmware.com/kb/10026353 (Recreating a missing virtual
machine disk (VMDK) descriptor file for delta disks).
With these patches, vmfsSparse files only work if opened through the
descriptor file. Data files without descriptor files, as far as I
could understand, are not supported by ESX.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
--
v2: Rebase to patch 01.
Change le64_to_cpu to le32_to_cpu.
Rename vmdk_open_vmdk3 to vmdk_open_vmfs_sparse, which represents the
current usage of this format.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
VMDK3 header has the field l1dir_size, but vmdk_open_vmdk3 hardcoded the
value. This patch honors the header field.
And the L2 table size is 4096 according to VMDK spec[1], instead of
1 << 9 (512).
[1]:
http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?src=vmdk
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This header check is common to VMDK3 and VMDK4, so move it into
vmdk_add_extent().
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In 4146b46c42e0989cb5842e04d88ab6ccb1713a48 (block: Produce zeros when
protocols reading beyond end of file), we break qemu-iotests ./check
-qcow2 022. This happens because qcow2 temporarily sets ->growable = 1
for vmstate accesses (which are stored beyond the end of regular image
data).
We introduce the bs->zero_beyond_eof to allow qcow2_load_vmstate() to
disable ->zero_beyond_eof temporarily in addition to enable ->growable.
[Since the broken patch "block: Produce zeros when protocols reading
beyond end of file" has not been merged yet, I have applied this fix
*first* and will then apply the next patch to keep the tree bisectable.
-- Stefan]
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
By the time that qemu 1.7 will be released, enough time will have passed
since qemu 1.1, which is the first version to understand version 3
images, that changing the default shouldn't hurt many people any more
and the benefits of using the new format outweigh the pain.
qemu-iotests already runs with compat=1.1 by default.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The .io_flush() handler no longer exists and has no users. Drop the
io_flush argument to aio_set_fd_handler() and related functions.
The AioFlushEventNotifierHandler and AioFlushHandler typedefs are no
longer used and are dropped too.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
.io_flush() is no longer called so drop qemu_rbd_aio_flush_cb().
qemu_aio_count is unused now so drop it too.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
.io_flush() is no longer called so drop nbd_have_request(). We cannot
drop in_flight since it is still used by other block/nbd.c code.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
.io_flush() is no longer called so drop qemu_laio_completion_cb(). It
turns out that count is now unused so drop that too.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Since .io_flush() is no longer called we do not need
qemu_gluster_aio_flush_cb() anymore. It turns out that qemu_aio_count
is unused now and can be dropped.
Thanks to Bharata B Rao <bharata@linux.vnet.ibm.com> for catching a
build failure with CONFIG_GLUSTERFS_DISCARD, which has been fixed.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
.io_flush() is no longer called so drop curl_aio_flush(). The acb[]
array that the function checks is still used in other parts of
block/curl.c. Therefore we cannot remove acb[], it is needed.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If a block driver has no file descriptors to monitor but there are still
active requests, it can return 1 from .io_flush(). This is used to spin
during synchronous I/O.
Stop relying on .io_flush() and instead check
QLIST_EMPTY(&bs->tracked_requests) to decide whether there are active
requests.
This is the first step in removing .io_flush() so that event loops no
longer need to have the concept of synchronous I/O. Eventually we may
be able to kill synchronous I/O completely by running everything in a
coroutine, but that is future work.
Note this patch moves bs->throttled_reqs initialization to bdrv_new() so
that bdrv_requests_pending(bs) can safely access it. In practice bs is
g_malloc0() so the memory is already zeroed but it's safer to initialize
the queue properly.
We also need to fix up block/stream.c:close_unused_images() to prevent
traversing a dangling pointer while it rearranges the backing file
chain. This is necessary since the new bdrv_drain_all() traverses the
backing file chain.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Most of the block layer is under the BSD license, thus it is reasonable
to license block/raw.c the same way. CCed people should ACK by replying
with a Signed-off-by line.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Jeff Cody <jcody@redhat.com>
Cc: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1375251592-2537-2-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
num_gtes_per_gte is a historical typo, rename it to a more sensible
name. It means "number of GrainTableEntries per GrainTable".
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We should never grow the stack beyond 1 MB, otherwise we'll fall off the
end. Thread stacks and coroutine stacks (1 MB) do not grow.
get_cluster_offset() allocates a big stack offset, it will fail for big
cluster images, change to heap allocated buffer.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
L1 table size is calculated from capacity, granularity and l2 table
size. If capacity is too big or later two are too small, the L1 table
will be too big to allocate in memory. Limit it to a reasonable range.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
header.num_gtes_per_gte determines size for L2 table. Check for too big
value before using it. Limit to 512M entries (2GB per one L2 table).
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Granularity is used to calculate the cluster size and allocate r/w
buffer. Check the value from image before using it, so we don't abort()
for unbounded memory allocation.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The size and offset fields are all non-negative values, use uint64_t for
them to avoid getting negative in memory value by int overflow.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's best to make it consistent that all on disk structures are
QEMU_PACKED.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 3ac21627 changed the behaviour of bdrv_has_zero_init() to default
to 0. In the review for Sheepdog it turned out that enabling it is safe,
so that commit updated one BlockDriver definition of sheepdog to use
bdrv_has_zero_init_1, missed however that there are more BlockDrivers in
the driver. Fix these now.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The comment was truncated. Add the missing parts, especially explain why
we need zero_dry_run.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
The error on armv7hl was:
block/iscsi.c: In function ‘is_request_lun_aligned’:
block/iscsi.c:251:26: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘int64_t’ [-Werror=format=]
iscsilun->block_size, sector_num, nb_sectors);
^
This also splits the long line to comply with qemu coding guidelines.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
'dprintf' is the name of a POSIX standard function so we should not be
stealing it for our debug macro. Rename to 'DPRINTF' (in line with
a number of other source files.)
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Acked-by: Richard Henderson <rth@twiddle.net>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1375100199-13934-2-git-send-email-peter.maydell@linaro.org
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
# By Stefan Hajnoczi (4) and others
# Via Stefan Hajnoczi
* stefanha/block:
dataplane: refuse to start if device is already in use
dataplane: enable virtio-blk x-data-plane=on live migration
migration: fix spice migration
migration: notify migration state before starting thread
block: Repair the throttling code.
gluster: Add image resize support
Message-id: 1375112172-24863-1-git-send-email-stefanha@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Implement .bdrv_truncate in GlusterFS block driver so that GlusterFS backend
can support image resizing.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
All these typos were found by codespell.
sould -> should
emperical -> empirical
intialization -> initialization
successfuly -> successfully
gaurantee -> guarantee
Fix also another error (before before) in the same context.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This patch adds sync-modes to the drive-backup interface and
implements the FULL, NONE and TOP modes of synchronization.
FULL performs as before copying the entire contents of the drive
while preserving the point-in-time using CoW.
NONE only copies new writes to the target drive.
TOP copies changes to the topmost drive image and preserves the
point-in-time using CoW.
For sync mode TOP are creating a new target image using the same backing
file as the original disk image. Then any new data that has been laid
on top of it since creation is copied in the main backup_run() loop.
There is an extra check in the 'TOP' case so that we don't bother to copy
all the data of the backing file as it already exists in the target.
This is where the bdrv_co_is_allocated() is used to determine if the
data exists in the topmost layer or below.
Also any new data being written is intercepted via the write_notifier
hook which ends up calling backup_do_cow() to copy old data out before
it gets overwritten.
For mode 'NONE' we create the new target image and only copy in the
original data from the disk image starting from the time the call was
made. This preserves the point in time data by only copying the parts
that are *going to change* to the target image. This way we can
reconstruct the final image by checking to see if the given block exists
in the new target image first, and if it does not, you can get it from
the original image. This is basically an optimization allowing you to
do point-in-time snapshots with low overhead vs the 'FULL' version.
Since there is no old data to copy out the loop in backup_run() for the
NONE case just calls qemu_coroutine_yield() which only wakes up after
an event (usually cancel in this case). The rest is handled by the
before_write notifier which again calls backup_do_cow() to write out
the old data so it can be preserved.
Signed-off-by: Ian Main <imain@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is what QMP wants to use. The options haven't been enabled in any
release yet, so we're still free to change them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
s->qcow and s->qcow_filename are allocated but not freed on error. Fix the
possible leaks, remove unnecessary check for bdrv_new(), propagate ret code of
bdrv_create() and also the one of enable_write_target().
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Implement bdrv_aio_discard for gluster.
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
if the blocksize of an iSCSI LUN is bigger than the BDRV_SECTOR_SIZE
it is possible that sector_num or nb_sectors are not correctly
aligned.
to avoid corruption we fail requests which are misaligned.
Signed-off-by: Peter Lieven <pl@kamp.de>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
this hask is not working (anymore). support for misaligned offsets should
be handled at the block layer.
Signed-off-by: Peter Lieven <pl@kamp.de>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
the -ENOPSC case did not work due to the missing goto.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't assume that SG_IO is always invoked with a simple buffer,
check the iovec_count and if it is >= 1 then we need to pass an array
of iovectors to libiscsi instead of just a plain buffer.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
One of the major reasons for doing something new for -blockdev and
blockdev-add was that the old block layer code parses filenames instead
of just taking them literally. So we should really leave it untouched
when it's passing using the new interfaces (like -drive
file.filename=...).
This allows opening relative file names that contain a colon.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
CURL driver requests partial data from server on guest IO req. For HTTP
and HTTPS, it uses "Range: ***" in requests, and this will not work if
server not accepting range. This patch does this check when open.
* Removed curl_size_cb, which is not used: On one hand it's registered to
libcurl as CURLOPT_WRITEFUNCTION, instead of CURLOPT_HEADERFUNCTION,
which will get called with *data*, not *header*. On the other hand the
s->len is assigned unconditionally later.
In this gone function, the sscanf for "Content-Length: %zd", on
(void *)ptr, which is not guaranteed to be zero-terminated, is
potentially a security bug. So this patch fixes it as a side-effect. The
bug is reported as: https://bugs.launchpad.net/qemu/+bug/1188943
(Note the bug is marked "private" so you might not be able to see it)
* Introduced curl_header_cb, which is used to parse header and mark the
server as accepting range if "Accept-Ranges: bytes" line is seen from
response header. If protocol is HTTP or HTTPS, but server response has
no not this support, refuse to open this URL.
Note that python builtin module SimpleHTTPServer is an example of not
supporting range, if you need to test this driver, get a better server
or use internet URLs.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Depending on the subformat, has_zero_init queries underlying storage for
flat extent. If it has a flat extent and its underlying storage doesn't
have zero init, return 0. Otherwise return 1.
Aligns the operator assignments.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
.has_zero_init defaults to 1 for all formats and protocols.
this is a dangerous default since this means that all
new added drivers need to manually overwrite it to 0 if
they do not ensure that a device is zero initialized
after bdrv_create().
if a driver needs to explicitly set this value to
1 its easier to verify the correctness in the review process.
during review of the existing drivers it turned out
that ssh and gluster had a wrong default of 1.
both protocols support host_devices as backend
which are not by default zero initialized. this
wrong assumption will lead to possible corruption
if qemu-img convert is used to write to such a backend.
vpc and vmdk also defaulted to 1 altough they support
fixed respectively flat extends. this has to be addresses
in separate patches. both formats as well as the mentioned
ssh and gluster are turned to the default of 0 with this
patch for safety.
a similar problem with the wrong default existed for
iscsi most likely because the driver developer did
oversee the default value of 1.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Depending on the subformat, has_zero_init on VHD must behave like raw
and query the underlying storage (fixed) or like other sparse formats
that can always return 1 (dynamic, differencing).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When creating image with backing file, the driver tries to calculate the
relative path from created image file to backing file, but the path
computation is incorrect. e.g.:
$ qemu-img create -f vmdk -b vmdk-data-disk.vmdk vmdk-data-snapshot1
Formatting 'vmdk-data-snapshot1', fmt=vmdk size=10737418240
backing_file='vmdk-data-disk.vmdk' compat6=off zeroed_grain=off
$ qemu-img info vmdk-data-snapshot1
image: vmdk-data-snapshot1
file format: vmdk
virtual size: 10G (10737418240 bytes)
disk size: 12K
-> backing file: disk.vmdk
The common part in file names, "vmdk-data-", is incorrectly forgotten by
relative_path(). As the VMDK specification has no restriction on
parentNameHint to be relative path, we simply remove this by using the
backing_file option.
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
GlusterFS volumes can be backed by block devices, in which case
bdrv_create() doesn't make sure that the image is zeroed out. It is
currently not possibly to detect whether a given image is backed by a
file or a block device, and incorrectly assuming that it is zeroed
corrupts images during qemu-img convert, so let's err on the side of
caution and always return 0.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the remote is a regular file, set it to true (ie. reads of
uninitialized areas in a newly created file will return zeroes).
If we can't prove that, return false (a safe default).
Tested by adding a debugging print statement [not part of this commit]
and creating a remote file and a remote block device:
$ ./qemu-img create ssh://localhost/tmp/new 100M
Formatting 'ssh://localhost/tmp/new', fmt=raw size=104857600
filename ssh://localhost/tmp/new: has_zero_init = 1
$ sudo lvcreate -L 1G -n tmp /dev/fedora
Logical volume "tmp" created
$ ./qemu-img create ssh://localhost/dev/fedora/tmp 1G
Formatting 'ssh://localhost/dev/fedora/tmp', fmt=raw size=1073741824
filename ssh://localhost/dev/fedora/tmp: has_zero_init = 0
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
backup_start() creates a block job that copies a point-in-time snapshot
of a block device to a target block device.
We call backup_do_cow() for each write during backup. That function
reads the original data from the block device before it gets
overwritten. The data is then written to the target device.
Currently backup cluster size is hardcoded to 65536 bytes.
[I made a number of changes to Dietmar's original patch and folded them
in to make code review easy. Here is the full list:
* Drop BackupDumpFunc interface in favor of a target block device
* Detect zero clusters with buffer_is_zero() and use bdrv_co_write_zeroes()
* Use 0 delay instead of 1us, like other block jobs
* Unify creation/start functions into backup_start()
* Simplify cleanup, free bitmap in backup_run() instead of cb
* function
* Use HBitmap to avoid duplicating bitmap code
* Use bdrv_getlength() instead of accessing ->total_sectors
* directly
* Delete the backup.h header file, it is no longer necessary
* Move ./backup.c to block/backup.c
* Remove #ifdefed out code
* Coding style and whitespace cleanups
* Use bdrv_add_before_write_notifier() instead of blockjob-specific hooks
* Keep our own in-flight CowRequest list instead of using block.c
tracked requests. This means a little code duplication but is much
simpler than trying to share the tracked requests list and use the
backup block size.
* Add on_source_error and on_target_error error handling.
* Use trace events instead of DPRINTF()
-- stefanha]
Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The raw-posix driver has code to provide a /dev/cdrom on OS X even
though it doesn't really exist. However, since commit c66a6157 the real
filename is dismissed after finding it, so opening /dev/cdrom fails.
Put the filename back into the options QDict to make this work again.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Refuse to open higher version for safety.
Although we try to be compatible with published VMDK spec, VMware has
newer version from ESXi 5.1 exported OVF/OVA, which we have no knowledge
what's changed in it. And it is very likely to have more new versions in
the future, so it's not safe to open them blindly.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This optimises the discard operation for freed clusters by batching
discard requests (both snapshot deletion and bdrv_discard end up
updating the refcounts cluster by cluster).
Note that we don't discard asynchronously, but keep s->lock held. This
is to avoid that a freed cluster is reallocated and written to while the
discard is still in flight.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Deleted snapshots are discarded in the image file by default, discard
requests take their default from the -drive discard=... option and other
places that free clusters must always be enabled explicitly.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds a refcount update reason to all callers of update_refcounts(),
so that a follow-up patch can use this information to decide whether
clusters that reach a refcount of 0 should be discarded in the image
file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
# By Paolo Bonzini (3) and others
# Via Paolo Bonzini
* bonzini/scsi-next:
iscsi: reorganize iscsi_readcapacity_sync
iscsi: simplify freeing of tasks
vhost-scsi: fix k->set_guest_notifiers() NULL dereference
scsi-disk: scsi-block device for scsi pass-through should not be removable
scsi-generic: check the return value of bdrv_aio_ioctl in execute_command
scsi-generic: fix sign extension of READ CAPACITY(10) data
scsi: reset cdrom tray statuses on scsi_disk_reset
Message-id: 1371565016-2643-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Avoid the goto, and use the same retry logic for the 10- and 16-
byte versions.
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Always free them in the iscsi_aio_*_acb functions and remove the
checks in their callers. Remove ifs when the task struct was
previously dereferenced (spotted by Coverity).
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Otherwise they would get passed to getaddrinfo and fail with:
address resolution failed for [::1]🔢 Name or service not known
(Broken by commit v1.4.0-736-gf17c90b)
Signed-off-by: Ján Tomko <jtomko@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
# By Luiz Capitulino
# Via Luiz Capitulino
* luiz/queue/qmp:
qerror: drop QERR_OPEN_FILE_FAILED macro
block: bdrv_reopen_prepare(): don't use QERR_OPEN_FILE_FAILED
savevm: qmp_xen_save_devices_state(): use error_setg_file_open()
dump: qmp_dump_guest_memory(): use error_setg_file_open()
cpus: use error_setg_file_open()
blockdev: use error_setg_file_open()
block: mirror_complete(): use error_setg_file_open()
rng-random: use error_setg_file_open()
error: add error_setg_file_open() helper
Message-id: 1371484631-29510-1-git-send-email-lcapitulino@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
the hard-coded 2k buffer on the stack won't allow reading big descriptor
files which can be generated when storing big images. For example 500G
vmdk splitted to 2G chunks.
Signed-off-by: Evgeny Budilovsky <evgeny.budilovsky@ravellosystems.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
(Found by Kamil Dudka)
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Cc: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Remember to byteswap VMDK4Header.desc_offset on big-endian machines.
Cc: qemu-stable@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Just call sd_create_branch() in the snapshot_goto to rollback the image is good
enough. With this patch, 'loadvm' process for sheepdog is modified:
Suppose we have a snapshot chain A --> B --> C, we do 'loadvm A' so as to get
a new chain,
A --> B
|
V
C1
in the old code:
1 reload inode of A (in snapshot_goto)
2 read vmstate via A's vdi_id (loadvm_state)
3 delete C and create C1, reload inode of C1 (sd_create_branch on write)
with this patch applied:
1 reload inode of A, delete C and create C1 (in snapshot_goto)
2 read vmstate via C1's parent, that is A's vdi_id (loadvm_state)
This will fix the possible bug that QEMU exit between 2 and 3 in the old code
Cc: qemu-devel@nongnu.org
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is an old and obvious bug. We should pass snapshot_id to the
tag. Or simple command like 'qemu-img snapshot -a tag sheepdog:image' will fail
Cc: qemu-devel@nongnu.org
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now image info will be retrieved as an embbed json object inside
BlockDeviceInfo, backing chain info and all related internal snapshot
info can be got in the enhanced recursive structure of ImageInfo. New
recursive member *backing-image is added to reflect the backing chain
status.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch adds function bdrv_query_image_info(), which will
retrieve image info in qmp object format. The implementation is
based on the code moved from qemu-img.c, but uses block layer
function to get snapshot info.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch adds function bdrv_query_snapshot_info_list(), which will
retrieve snapshot info of an image in qmp object format. The implementation
is based on the code moved from qemu-img.c with modification to fit more
for qmp based block layer API.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_snapshot_dump() and bdrv_image_info_dump() do not dump to a buffer now,
some internal buffers are still used for format control, which have no
chance to be truncated. As a result, these two functions have no more issue
of truncation, and they can be used by both qemu and qemu-img with correct
parameter specified.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is a pure code move patch, except following modification:
1 get_human_readable_size() is changed to static function.
2 dump_human_image_info() is renamed to bdrv_image_info_dump().
3 in qmp_query_block() and qmp_query_blockstats, use bdrv_next(bs)
instead of direct traverse of global array 'bdrv_states'.
4 collect_snapshots() and collect_image_info() are renamed, unused parameter
*fmt in collect_image_info() is removed.
5 code style fix.
To avoid conflict and tip better, macro in header file is BLOCK_QAPI_H
instead of QAPI_H. Now block.h and snapshot.h are at the same level in
include path, block_int.h and qapi.h will both include them.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All snapshot related code, except bdrv_snapshot_dump() and
bdrv_is_snapshot(), is moved to block/snapshot.c. bdrv_snapshot_dump()
will be moved to another file later. bdrv_is_snapshot() is not related
with internal snapshot. It also fixes small code style errors reported
by check script.
Signed-off-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is used to remove twice include of "qemu-common.h" in
block/win32-aio.c
Signed-off-by: Qiao Nuohan <qiaonuohan@cn.fujitsu.com>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This catches the situation that is described in the bug report at
https://bugs.launchpad.net/qemu/+bug/865518 and goes like this:
$ qemu-img create -f qcow2 huge.qcow2 $((1024*1024))T
Formatting 'huge.qcow2', fmt=qcow2 size=1152921504606846976 encryption=off cluster_size=65536 lazy_refcounts=off
$ qemu-io /tmp/huge.qcow2 -c "write $((1024*1024*1024*1024*1024*1024 - 1024)) 512"
Segmentation fault
With this patch applied the segfault will be avoided, however the case
will still fail, though gracefully:
$ qemu-img create -f qcow2 /tmp/huge.qcow2 $((1024*1024))T
Formatting 'huge.qcow2', fmt=qcow2 size=1152921504606846976 encryption=off cluster_size=65536 lazy_refcounts=off
qemu-img: The image size is too large for file format 'qcow2'
Note that even long before these overflow checks kick in, you get
insanely high memory usage (up to INT_MAX * sizeof(uint64_t) = 16 GB for
the L1 table), so with somewhat smaller image sizes you'll probably see
qemu aborting for a failed g_malloc().
If you need huge image sizes, you should increase the cluster size to
the maximum of 2 MB in order to get higher limits.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Use special offset to write zeroes efficiently, when zeroed-grain GTE is
available. If zero-write an allocated cluster, cluster is leaked because
its offset pointer is overwritten by "0x1".
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Previously VmdkMetaData.offset is stored little endian while other
fields are cpu endian. This changes offset to cpu endian and convert
before writing to image.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add image create option "zeroed-grain" to enable zeroed-grain GTE
feature of vmdk sparse extents. When this option is on, header version
of newly created extent will be 2 and VMDK4_FLAG_ZERO_GRAIN flag bit
will be set.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduced support for zeroed-grain GTE, as specified in Virtual Disk
Format 5.0[1].
Recent VMware hosted platform products support a new “zeroed‐grain”
grain table entry (GTE). The zeroed‐grain GTE returns all zeros on
read. In other words, the zeroed‐grain GTE indicates that a grain
in the child disk is zero‐filled but does not actually occupy space
in storage. A sparse extent with zeroed‐grain GTE has the following
in its header:
* SparseExtentHeader.version = 2
* SparseExtentHeader.flags has bit 2 set
Other than the new flag and the possibly zeroed‐grain GTE, version 2
sparse extents are identical to version 1. Also, a zeroed‐grain GTE
has value 0x1 in the GT table.
[1] Virtual Disk Format 5.0, http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?src=vmdk
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Internal routines in vmdk.c previously return -1 on error and 0 on
success. More return values are useful for future changes such as
zeroed-grain GTE. Change all the magic `return 0` and `return -1` to
macro names:
* VMDK_OK 0
* VMDK_ERROR (-1)
* VMDK_UNALLOC (-2)
* VMDK_ZEROED (-3)
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds in read-only support to the VHDX image format. This supports
reads for fixed-size, and dynamic sized VHDX images.
Differencing files are still unsupported.
The image must be opened without BDRV_O_RDWR set, because we do not
yet update the headers. I.e., pass 'readonly=on' in the drive image
options from the QEMU commandline.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is the initial block driver framework for VHDX image support
(i.e. Hyper-V image file formats), that supports opening VHDX files, and
parsing the headers.
This commit does not yet enable:
- reading
- writing
- updating the header
- differencing files (images with parents)
- log replay / dirty logs (only clean images)
This is based on Microsoft's VHDX specification:
"VHDX Format Specification v0.95", published 4/12/2012
https://www.microsoft.com/en-us/download/details.aspx?id=29681
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is based on Microsoft's VHDX specification:
"VHDX Format Specification v0.95", published 4/12/2012
https://www.microsoft.com/en-us/download/details.aspx?id=29681
These structures define the various header, metadata, and other
block structures defined in the VHDX specification.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently the 'loadvm' opertaion works as following:
1. switch to the snapshot
2. mark current working VDI as a snapshot
3. rely on sd_create_branch to create a new working VDI based on the snapshot
This works not the same as other format as QCOW2. For e.g,
qemu > savevm # get a live snapshot snap1
qemu > savevm # snap2
qemu > loadvm 1 # This will steally create snap3 of the working VDI
Which will result in following snapshot chain:
base <-- snap1 <-- snap2 <-- snap3
^
|
working VDI
snap3 was unnecessarily created and might be annoying users.
This patch discard the unnecessary 'snap3' creation. and implement
rollback(loadvm) operation to the specified snapshot by
1. switch to the snapshot
2. delete working VDI
3. rely on sd_create_branch to create a new working VDI based on the snapshot
The snapshot chain for above example will be:
base <-- snap1 <-- snap2
^
|
working VDI
Cc: qemu-devel@nongnu.org
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When a snapshot is taken from out side of qemu (e.g. qemu-img
snapshot), write requests to the current vdi return SD_RES_READONLY.
In this case, the sheepdog block driver needs to update the current
inode to the latest one and resend the write requests.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This adds a helper function to update the current inode state with the
specified vdi object.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Sheepdog returns SD_RES_READONLY when qemu sends write requests to the
snapshot vdi. This adds the result code and makes sd_strerror() print
its error reason.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This makes 'filename' and 'tag' constant variables, and renames
'for_snapshot' to 'lock' to clear how it works.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit a9ccedc3 frees the QemuOpts for the driver-specific options
immediately, even though it still needs the filename string that is
contained there. This doesn't work. Move the deletion of the QemuOpts to
the end of the function where its content isn't needed any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The 'TRIM' command from VM that is to release underlying data storage for
better thin-provision is already supported by the Sheepdog.
This patch adds the TRIM support at QEMU part.
For older Sheepdog that doesn't support it, we return 0(success) to upper layer.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
# By Kevin Wolf (16) and Stefan Hajnoczi (4)
# Via Kevin Wolf
* kwolf/for-anthony:
qemu-iotests: add 053 unaligned compressed image size test
block: Allow overriding backing.file.filename
block: Remove filename parameter from .bdrv_file_open()
vvfat: Use bdrv_open options instead of filename
sheepdog: Use bdrv_open options instead of filename
rbd: Use bdrv_open options instead of filename
iscsi: Use bdrv_open options instead of filename
gluster: Use bdrv_open options instead of filename
curl: Use bdrv_open options instead of filename
blkverify: Use bdrv_open options instead of filename
blkdebug: Use bdrv_open options instead of filename
raw-win32: Use bdrv_open options instead of filename
raw-posix: Use bdrv_open options instead of filename
block: Enable filename option
block: Add driver-specific options for backing files
block: Fail gracefully when using a format driver on protocol level
qemu-iotests: Fix _filter_qemu
qemu-img: do not zero-pad the compressed write buffer
qcow: allow sub-cluster compressed write to last cluster
qcow2: allow sub-cluster compressed write to last cluster
Message-id: 1366630294-18984-1-git-send-email-kwolf@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
# By Stefan Hajnoczi
# Via Paolo Bonzini
* bonzini/nbd-next:
nbd: set TCP_NODELAY
nbd: use TCP_CORK in nbd_co_send_request()
nbd: unlock mutex in nbd_co_send_request() error path
Message-id: 1366381830-11267-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This is only to convert the internal interface that is used for passing
the "filename" to be parsed, but converting to actual fine grained
options is left for another day, as it doesn't look trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
As a bonus, going through the QemuOpts QEMU_OPT_SIZE parser for the
readahead option gives us proper error reporting that the previous use
of atoi() lacked.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Options starting in "backing." are passed to the backing file now. If
you don't need to specify the filename for the backing file, you can add
it on the command line instead of in the image file:
$ qemu-nbd -t /tmp/test.img
$ qemu-img create -f qcow2 empty.qcow2 1G
$ qemu-system-x86_64 -drive file=empty.qcow2,backing.file.driver=nbd,\
backing.file.host=localhost
Note that this doesn't override the backing filename from the image. If
the image has one, this will fail because NBD doesn't want the options
and a filename at the same time.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Compression in qcow requires image length to be a multiple of the
cluster size. Lift this requirement by zero-padding the final cluster
when necessary. The virtual disk size is still not cluster-aligned, so
the guest cannot access the zero sectors.
Note that this is almost identical to the qcow2 version of this code.
qcow2's compression code is drawn from qcow.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compression in qcow2 requires image length to be a multiple of the
cluster size. Lift this requirement by zero-padding the final cluster
when necessary. The virtual disk size is still not cluster-aligned, so
the guest cannot access the zero sectors.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now gcc will check whether format string and variable arguments match.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Disable the Nagle algorithm to reduce latency. Note this means we must
also use TCP_CORK when sending header followed by payload to avoid
fragmenting lots of little packets. The previous patch took care of
that.
Suggested-by: Nick Thomas <nick@bytemark.co.uk>
Tested-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use TCP_CORK to defer packet transmission until both the header and the
payload have been written.
Suggested-by: Nick Thomas <nick@bytemark.co.uk>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The existing bdrv_co_flush_to_disk implementation uses rbd_flush(),
which is sychronous and causes the main qemu thread to block until it
is complete. This results in unresponsiveness and extra latency for
the guest.
Fix this by using an asynchronous version of flush. This was added to
librbd with a special #define to indicate its presence, since it will
be backported to stable versions. Thus, there is no need to check the
version of librbd.
Implement this as bdrv_aio_flush, since it matches other aio functions
in the rbd block driver, and leave out bdrv_co_flush_to_disk when the
asynchronous version is available.
Reported-by: Oliver Francke <oliver@filoo.de>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
libssh2_sftp_fsync is an extension to libssh2 to support fsync(2) over
sftp, which is itself an extension of OpenSSH.
If both libssh2 and the ssh daemon support it, this will allow
bdrv_flush_to_disk to commit changes through to disk on the remote
server.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qemu-system-x86_64 -drive file=ssh://hostname/some/image
QEMU will ssh into 'hostname' and open '/some/image' which is made
available as a standard block device.
You can specify a username (ssh://user@host/...) and/or a port number
(ssh://host:port/...). You can also use an alternate syntax using
properties (file.user, file.host, file.port, file.path).
Current limitations:
- Authentication must be done without passwords or passphrases, using
ssh-agent. Other authentication methods are not supported.
- Uses a single connection, instead of concurrent AIO with multiple
SSH connections.
This is implemented using libssh2 on the client side. The server just
requires a regular ssh daemon with sftp-server support. Most ssh
daemons on Unix/Linux systems will work out of the box.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Directly pass the QEMUIOVector on instead of linearising it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Move aes.h from include/block to include/qemu to show it can be reused
by other subsystems.
Cc: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@gmail.com>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Many of these should be cleaned up with proper qdev-/QOM-ification.
Right now there are many catch-all headers in include/hw/ARCH depending
on cpu.h, and this makes it necessary to compile these files per-target.
However, fixing this does not belong in these patches.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It ignored the error code, and at least the 'goto fail' is obvious
nonsense as it creates an endless loop (if the next attempt doesn't
magically succeed) and leaves the in-memory L1 table in big-endian
instead of converting it back.
In error cases, there's no point in writing an updated L1 table, so
skip this part for them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The fcntl(fd, F_SETFL, O_NONBLOCK) flag is not specific to sockets.
Rename to qemu_set_nonblock() just like qemu_set_cloexec().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Instead of just checking once in exactly this order if there are
dependendies, non-COW clusters and new allocation, this starts looping
around these. This way we can, for example, gather non-COW clusters after
new allocations as long as the host cluster offsets stay contiguous.
Once handle_dependencies() is extended so that COW areas of in-flight
allocations can be overwritten, this allows to continue with gathering
other clusters (we wouldn't be able to do that without this change
because we would have missed a possible second dependency in one of the
next clusters).
This means that in the typical sequential write case, we can combine the
COW overwrite of one cluster with the allocation of the next cluster as
soon as something like Delayed COW gets actually implemented. It is only
by avoiding splitting requests this way that Delayed COW actually starts
improving performance noticably.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch is mainly to separate the indentation change from the
semantic changes. All that really changes here is that everything moves
into a while loop, all 'goto done' become 'break' and at the end of the
loop a new 'break is inserted.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of expecting a single l2meta, have a list of them. This allows
to still have a single I/O request for the guest data, even though
multiple l2meta may be needed in order to describe both a COW overwrite
and a new cluster allocation (typical sequential write case).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This gets rid of the nb_clusters and keep_clusters and the associated
complicated calculations. Just advance the number of bytes that have
been processed and everything is fine.
This patch advances the variables even after the last operation even
though they aren't used any more afterwards to make things look more
uniform. A later patch will turn the whole thing into a loop and then
it actually starts making sense.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This makes handle_alloc() and handle_copied() return byte-granularity
host offsets instead of returning always the cluster start. This is
required so that qcow2_alloc_cluster_offset() can stop aligning
everything to cluster boundaries.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Look only for clusters that start at a given physical offset.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now *bytes is used to return the length of the area that can be written
to without performing an allocation or COW.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
handle_copied() uses its bytes parameter now to determine how many
clusters it should try to find.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Things can be simplified a bit now. No semantic changes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The interface works completely on a byte granularity now and duplicated
parameters are removed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
handle_alloc() is now called with the offset at which the actual new
allocation starts instead of the offset at which the whole write request
starts, part of which may already be processed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We already communicate the same information in *bytes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This moves some code that prepares the allocation of new clusters to
where the actual allocation happens. This is the minimum required to be
able to move it to a separate function in the next patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is a more precise description of what really constitutes a
dependency. The behaviour doesn't change at this point because the COW
area of the old request is still aligned to cluster boundaries and
therefore an overlap is detected wheneven the requests touch any part of
the same cluster.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The old code detected an overlapping allocation even when the
allocations didn't actually overlap, but were only adjacent.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Handling overlapping allocations isn't just a detail of cluster
allocation. It is rather one of three ways to get the host cluster
offset for a write request:
1. If a request overlaps an in-flight allocations, the cluster offset
can be taken from there (this is what handle_dependencies will evolve
into) or the request must just wait until the allocation has
completed. Accessing the L2 is not valid in this case, it has
outdated information.
2. Outside overlapping areas, check the clusters that can be written to
as they are, with no COW involved.
3. If a COW is required, allocate new clusters
Changing the code to reflect this doesn't change the behaviour because
overlaps cannot exist for clusters that are kept in step 2. It does
however make it easier for later patches to work on clusters that belong
to an allocation that is still in flight.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The unlock wakes up the next coroutine, but the currently running
coroutine will lock it again before it yields, so this doesn't make a
lot of sense.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This should be based on the virtual disk size, not on the size of the
image.
Interesting observation: With some VM state stored in the image file,
percentages higher than 100% are possible, even though snapshots
themselves are ignored. This is a qcow2 bug to be fixed another day: The
VM state should be discarded in the active L2 tables after completing
the snapshot creation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The new parameter is unused yet.
This part was missing in commit 787e4a8500.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit 787e4a85 [block: Add options QDict to bdrv_file_open() prototypes] didn't
update rbd.c accordingly.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
A file name may only specified if no host or socket path is specified.
The latter two may not appear at the same time either.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The URL method already takes care to apply the default port when none is
specfied. Directly specifying driver-specific options required the port
number until now. Allow leaving it out and apply the default.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
In order to achieve this, the .bdrv_probe callbacks of all drivers must
cope with this. The DMG driver is the only one that bases its decision
on the filename and it needs to be changed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If a driver needs structured data and not just a string, it can provide
a .bdrv_parse_filename callback now that parses the command line string
into separate options. Keeping this separate from .bdrv_open_filename
ensures that the preferred way of directly specifying the options always
works as well if parsing the string works.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The existing parsers for the file name now parse everything into the
bdrv_open() options QDict. Instead of using these parsers, you can now
directly specify the options on the command line, like this:
qemu-system-x86_64 -drive file=nbd:,file.port=1234,file.host=::1
Clearly the file=... part could use further improvement, but it's a
start.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The NBD block supports an URL syntax, for which a URL parser returns
separate hostname and port fields. It also supports the traditional qemu
syntax encoded in a filename. Until now, after parsing the URL to get
each piece of information, a new string is built to be fed to socket
functions.
Instead of building a string in the URL case that is immediately parsed
again, parse the string in both cases and use the QemuOpts interface to
qemu-sockets.c.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Need to pass an options QDict to qcow2_open() now. This fixes a segfault
on the migration target with qcow2.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sheepdog (neither quorum nor unsafe mode) will refuse to serve IO requests when
number of alive nodes is less than that of copies specified by users. This will
return 0x19 to QEMU client which currently doesn't recognize it.
This patch adds an error description when QEMU client receives it, other than
plainly printing 'Invalid error code'
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that each AioContext has a ThreadPool and the main loop AioContext
can be fetched with bdrv_get_aio_context(), we can eliminate the concept
of a global thread pool from thread-pool.c.
The submit functions must take a ThreadPool* argument.
block/raw-posix.c and block/raw-win32.c use
aio_get_thread_pool(bdrv_get_aio_context(bs)) to fetch the main loop's
ThreadPool.
tests/test-thread-pool.c must be updated to reflect the new
thread_pool_submit() function prototypes.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
If an io_flush handler is not set, qemu_aio_wait doesn't invoke
callbacks.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Using a blocking socket in the coroutine context reduces the chance of
switching to other work. This patch makes the sheepdog driver use a
non-blocking fd always.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Otherwise, live migration of the top layer will miss zero clusters and
let the backing file show through. This also matches what is done in qed.
QCOW2_CLUSTER_ZERO clusters are invalid in v2 image files. Check this
directly in qcow2_get_cluster_offset instead of replicating the test
everywhere.
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We already flush when the function completes. There is no need to flush
after every compressed cluster.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The update_cluster_refcount() function increments/decrements a cluster's
refcount and then returns the new refcount value.
There is no need to flush since both update_cluster_refcount() callers
already take care of this:
1. qcow2_alloc_bytes() calls update_cluster_refcount() when compressed
sectors will be appended to an existing cluster with enough free
space. qcow2_alloc_bytes() already flushes so there is no need to do
so in update_cluster_refcount().
2. qcow2_update_snapshot_refcount() sets a cache dependency on refcounts
if it needs to update L2 entries. It also flushes before completing.
Removing this flush significantly speeds up qcow2 snapshot creation:
$ qemu-img create -f qcow2 test.qcow2 -o size=50G,preallocation=metadata
$ time qemu-img snapshot -c new test.qcow2
Time drops from more than 3 minutes to under 1 second.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Users of qcow2_update_snapshot_refcount() do not flush consistently.
qcow2_snapshot_create() flushes but qcow2_snapshot_goto() and
qcow2_snapshot_delete() do not.
Solve this by moving the bdrv_flush() into
qcow2_update_snapshot_refcount().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Compressed writes use qcow2_alloc_bytes() to allocate space with byte
granularity. The affected clusters' refcounts will be incremented but
we do not need to flush yet.
Set a L2 cache dependency on the refcount block cache, so that the
refcounts get written out before the L2 updates.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since qcow2 metadata is cached we need to flush the caches, not just the
underlying file. Use bdrv_flush(bs) instead of bdrv_flush(bs->file).
Also add the error return path when bdrv_flush() fails and move the
flush after checking for qcow2_alloc_clusters() failure so that the
qcow2_alloc_clusters() error return value takes precedence.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
update_refcount() affects the refcount cache, it does not write to disk.
Therefore bdrv_flush(bs->file) does nothing. We need to flush the
refcount cache in order to write out the refcount updates!
While we're here also add error returns when qcow2_cache_flush() fails.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 images now accept a boolean lazy_refcounts options. Use it like
this:
-drive file=test.qcow2,lazy_refcounts=on
If the option is specified on the command line, it overrides the default
specified by the qcow2 header flags that were set when creating the
image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It doesn't do anything yet except storing the options QDict in the
BlockDriverState.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
this patch adds iscsi_truncate which effectively allows for
online resizing of iscsi volumes. for this to work you have
to resize the volume on your storage and then call
block_resize command in qemu which will issue a
readcapacity16 to update the capacity.
v4:
- factor out complete readcapacity logic into a separate function
- handle capacity change check condition in readcapacity function
(this happens if the block_resize cmd is the first iscsi task
executed after a resize on the storage)
v3:
- remove switch statement in iscsi_open
- create separate patch for brdv_drain_all() in bdrv_truncate()
v2:
- add a general bdrv_drain_all() before bdrv_truncate() to avoid
in-flight AIOs while the device is truncated
- since no AIOs are in flight we can use a sync libiscsi call
to re-read the capacity
- factor out the readcapacity16 logic as it is redundant
to iscsi_open() and iscsi_truncate().
Signed-off-by: Peter Lieven <pl@kamp.de>
[allow any type of unit attention check condition in iscsi_readcapacity_sync(),
as in Message-ID: <51263A2A.6070304@dlhnet.de> - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
the storage might return a check condition status for various reasons.
(e.g. bus reset, capacity change, thin-provisioning info etc.)
currently all these informative status responses lead to an I/O error
which is populated to the guest. this patch introduces a retry mechanism
to avoid this.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds support for a unix domain socket for a connection
between qemu and local sheepdog server. You can use the unix domain
socket with the following syntax:
$ qemu sheepdog+unix:///<vdiname>?socket=<socket path>[#snapid]
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This uses the form "<host>:<port>" for the representation of the
sheepdog server to use inet_connect.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The URI syntax is consistent with the NBD and Gluster syntax. The
syntax is
sheepdog[+tcp]://[host:port]/vdiname[#snapid|#tag]
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The qemu-img check command can display fragmentation statistics:
* Total number of clusters in virtual disk
* Number of allocated clusters
* Number of fragmented clusters
This patch adds fragmentation statistics support to qcow2.
Compressed and normal clusters count as allocated. Zero clusters are
not counted as allocated unless their L2 entry has a non-zero offset
(e.g. preallocation).
Only the current L1 table counts towards the statistics - snapshots are
ignored.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The check_refcounts_l1/l2() functions have a check_copied argument to
check that the QCOW_O_COPIED flag is consistent with refcount == 1.
This should be a bool, not an int.
However, the next patch introduces qcow2 fragmentation statistics and
also needs to pass an option to check_refcounts_l1/l2(). This is a good
opportunity to use an int flags field.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch adds the support for reporting the image end offset (in
bytes). This is particularly useful after a conversion (or a rebase)
where the destination is a block device in order to find the first
unused byte at the end of the image.
Signed-off-by: Federico Simoncelli <fsimonce@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The curl_easy_setopt(state->curl, CURLOPT_PROTOCOLS, ...) interface was
introduced in libcurl 7.19.4. Therefore we cannot protect against
CVE-2013-0249 when linking against an older libcurl.
This fixes the build failure introduced by
fb6d1bbd24.
Reported-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Andreas Färber <andreas.faeber@web.de>
Message-id: 1360743934-8337-1-git-send-email-stefanha@redhat.com
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This reverts commit f880defbb0.
Jeff Cody's testing revealed that the interpretation of size differs
even between VirtualPC and HyperV. Revert this so there is time to
consider the impact of any backwards incompatible behavior this change
creates.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Linux block devices can be set read-only with "blockdev --setro
<device>". The same thing can be done for LVM volumes using "lvchange
--permission r <volume>". This read-only setting is independent of
device node permissions. Therefore the device can still be opened
O_RDWR but actual writes will fail.
This results in odd behavior for QEMU. bdrv_open() is supposed to fail
if a read-only image is being opened with BDRV_O_RDWR. By not failing
for Linux block devices, the guest boots up but every write produces an
I/O error.
This patch checks whether the block device is read-only so that Linux
block devices behave like regular files.
Reported-by: Sibiao Luo <sluo@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The size calculated from the CHS values is not the real image (disk) size,
but usually a smaller value. This is caused by rounding effects.
Only older operating systems use CHS. Such guests won't be able to use
the whole disk. All modern operating systems use the real size.
This patch fixes https://bugs.launchpad.net/qemu/+bug/1105670/.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Message-id: 1360265212-22037-1-git-send-email-sw@weilnetz.de
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
There is a buffer overflow in libcurl POP3/SMTP/IMAP. The workaround is
simple: disable extra protocols so that they cannot be exploited. Full
details here:
http://curl.haxx.se/docs/adv_20130206.html
QEMU only cares about HTTP, HTTPS, FTP, FTPS, and TFTP. I have tested
that this fix prevents the exploit on my host with
libcurl-7.27.0-5.fc18.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Commit eeb6b45d48 (block: raw-posix image
file reopen) broke the build on OpenIndiana.
illumos has no O_ASYNC. Exclude it from flags to be compared
and instead assert that it is not set where defined.
Cf. e61ab1da7e for qemu-ga.
Cc: qemu-stable@nongnu.org (1.3.x)
Cc: Jeff Cody <jcody@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The previous scanf() format string stopped parsing the file name on the
first white white space, which seems to be allowed at least by VMware
Workstation.
Change the format string to collect everything between the first and
second quote as the file name, disallowing line breaks.
Signed-off-by: Philipp Hahn <hahn@univention.de>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. Hey, no memory leak to fix here
while we're touching it!
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The buffers are allocated with g_(re)alloc, so use g_free to free them.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors and add error checks in some
places that didn't have one. Passing things by reference requires more
correct typing, replaced a few off_ts therefore - with a 32-bit off_t
this is even a fix for truncation bugs.
While touching the code, fix even some more memory leaks than in the
other drivers...
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. While touching the
code, fix a memory leak.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. While touching the
code, fix a memory leak.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Return -errno instead of -1 on errors. While touching the
code, fix a memory leak.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Sheep daemon needs vdi_id to identify which vdi is closed to release resources
such as object cache.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Introduce a new option "adapter_type" when converting to vmdk images.
It can be one of the following: ide (default), buslogic, lsilogic
or legacyESX (according to the vmdk spec from vmware).
In case of a non-ide adapter, heads is set to 255 instead of the 16.
The latter is used for "ide".
Also see LP#545089
Signed-off-by: Othmar Pasteka <pasteka@kabsi.at>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Once upon a time, it was decided that qemu_malloc(0) should abort.
Switching to glib retired that bright idea. Some code that was added
to cope with it (e.g. in commits 702ef63, b76b6e9) is still around.
Bury it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
On a zero-sized disk we need to break out of the job successfully
before bdrv_dirty_iter_init is called, otherwise you will get an
assertion failure with the next patch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi_open did not check for a bad signature.
This check was only in vdi_probe.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi_open returned -1 in case of any error, but it should return an
error code (negative value of errno or -EMEDIUMTYPE).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The signature is a 32 bit value and needs up to 8 hex digits for printing.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This improves error reports for bochs, cow, qcow, qcow2, qed and vmdk
when a file with the wrong format is selected.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Yet another optimization is to extend the mirroring iteration to include more
adjacent dirty blocks. This limits the number of I/O operations and makes
mirroring efficient even with a small granularity. Most of the infrastructure
is already in place; we only need to put a loop around the computation of
the origin and sector count of the iteration.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With AIO support in place, we can start copying more than one chunk
in parallel. This patch introduces the required infrastructure for
this: the buffer is split into multiple granularity-sized chunks,
and there is a free list to access them.
Because of copy-on-write, a single operation may already require
multiple chunks to be available on the free list.
In addition, two different iterations on the HBitmap may want to
copy the same cluster. We avoid this by keeping a bitmap of in-flight
I/O operations, and blocking until the previous iteration completes.
This should be a pretty rare occurrence, though; as long as there is
no overlap the next iteration can start before the previous one finishes.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes sense when the next commit starts using the extra buffer space
to perform many I/O operations asynchronously.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is really no change in the behavior of the job here, since
there is still a maximum of one in-flight I/O operation between
the source and the target. However, this patch already introduces
the AIO callbacks (which are unmodified in the next patch)
and some of the logic to count in-flight operations and only
complete the job when there is none.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The desired granularity may be very different depending on the kind of
operation (e.g. continuous replication vs. collapse-to-raw) and whether
the VM is expected to perform lots of I/O while mirroring is in progress.
Allow the user to customize it, while providing a sane default so that
in general there will be no extra allocated space in the target compared
to the source.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When mirroring runs, the backing files for the target may not yet be
ready. However, this means that a copy-on-write operation on the target
would fill the missing sectors with zeros. Copy-on-write only happens
if the granularity of the dirty bitmap is smaller than the cluster size
(and only for clusters that are allocated in the source after the job
has started copying). So far, the granularity was fixed to 1MB; to avoid
the problem we detected the situation and required the backing files to
be available in that case only.
However, we want to lower the granularity for efficiency, so we need
a better solution. The solution is to always copy a whole cluster the
first time it is touched. The code keeps a bitmap of clusters that
have already been allocated by the mirroring job, and only does "manual"
copy-on-write if the chunk being copied is zero in the bitmap.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This actually uses the dirty bitmap in the block layer, and converts
mirroring to use an HBitmapIter.
Reviewed-by: Laszlo Ersek <lersek@redhat.com> (except block/mirror.c parts)
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for directly passing the iovec
array from QEMUIOVector if libiscsi supports it (1.8.0
or newer).
Signed-off-by: Peter Lieven <pl@kamp.de>
[Preserve the improvements from commit 4cc841b, iscsi: partly
avoid iovec linearization in iscsi_aio_writev, 2012-11-19 - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
acb->buf is freed in the WRITE(16) callback, but this may not
get called at all when commands are aborted. Add another
free in the ABORT TASK callback, which requires setting acb->buf
to NULL everywhere.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
# By Peter Lieven (3) and others
# Via Paolo Bonzini
* bonzini/scsi-next:
scsi: Drop useless null test in scsi_unit_attention()
lsi: use qbus_reset_all to reset SCSI bus
scsi: fix segfault with 0-byte disk
iscsi: add support for iSCSI NOPs [v2]
iscsi: partly avoid iovec linearization in iscsi_aio_writev
iscsi: add iscsi_create support
This patch will send NOP-Out PDUs every 5 seconds to the iSCSI target.
If a consecutive number of NOP-In replies fail a reconnect is initiated.
iSCSI NOPs help to ensure that the connection to the target is still operational.
This should not, but in reality may be the case even if the TCP connection is still
alive if there are bugs in either the target or the initiator implementation.
v2:
- track the NOPs inside libiscsi so libiscsi can reset the counter
in case it initiates a reconnect.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
libiscsi expects all write16 data in a linear buffer. If the
iovec only contains one buffer we can skip the linearization
step as well as the additional malloc/free and pass the
buffer directly.
Reported-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds support for bdrv_create. This allows e.g.
to use qemu-img to convert from any supported device to
an iscsi backed storage as destination.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
# By Kevin Wolf (4) and others
# Via Stefan Hajnoczi
* stefanha/block:
dataplane: support viostor virtio-pci status bit setting
dataplane: avoid reentrancy during virtio_blk_data_plane_stop()
win32-aio: use iov utility functions instead of open-coding them
win32-aio: Fix memory leak
win32-aio: Fix vectored reads
aio: Fix return value of aio_poll()
ide: Remove wrong assertion
block: fix null-pointer bug on error case in block commit
Fixes the build on OpenBSD among others.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
We have iov_from_buf() and iov_to_buf(), use them instead of
open-coding these in block/win32-aio.c
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The buffer is allocated for both reads and writes, and obviously it
should be freed even if an error occurs.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Copying data in the right direction really helps a lot!
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is a bug that was caught by a coverity run by Markus. In
the error case when we errored out to exit_restore_open early in the
function, 'overlay_bs' was still NULL at that point, although it is
used to look up flags and perform a bdrv_reopen().
Move the overlay_bs lookup to where it is needed, and check for NULL
before restoring the flags. Also get rid of the unneeded parameter
initialization.
Reported-By: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It allocates with qemu_blockalign(), therefore it must free with
qemu_vfree(), not g_free().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
win32_aio_submit() allocates it with qemu_blockalign(), therefore it
must be freed with qemu_vfree(), not g_free().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The last two parameters of sd_aio_setup() are never used, so remove them.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This will reduce sockfds connected to the sheep server to one, which simply the
future hacks.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is easy with the thread pool, because we can use s->is_xfs and
s->has_discard from the worker function.
QEMU has a widespread assumption that each I/O operation writes less
than 2^32 bytes. This patch doesn't fix it throughout of course,
but it starts correcting struct RawPosixAIOData so that there is
no regression with respect to the synchronous discard implementation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Block devices use a ioctl instead of fallocate, so add a separate
implementation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Avoid sending system calls repeatedly if they shall fail. This
does not apply to XFS: if the filesystem-specific ioctl fails,
something weird is happening.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Linux 2.6.38 introduced the filesystem independent interface to
deallocate part of a file. As of Linux 3.7, btrfs, ext4, ocfs2,
tmpfs and xfs support it.
Even though the system calls here are in practice issued on Linux,
the code is structured to allow plugging in alternatives for other Unix
variants. EOPNOTSUPP is used unconditionally in this patch, but it is
supported in both OpenBSD and Mac OS X since forever (see for example
http://lists.debian.org/debian-glibc/2006/02/msg00337.html).
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
One of the recent refactoring patches (commit f50f88b9) didn't take care
to initialise l2meta properly, so with zero-length writes, which don't
even enter the write loop, qemu just segfaulted.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The qiov_is_aligned() function checks whether a QEMUIOVector meets a
BlockDriverState's alignment requirements. This is needed by
virtio-blk-data-plane so:
1. Move the function from block/raw-posix.c to block/block.c.
2. Make it public in block/block.h.
3. Rename to bdrv_qiov_is_aligned().
4. Change return type from int to bool.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When the raw-posix aio=thread code was moved from posix-aio-compat.c
to block/raw-posix.c, there was an unintended change to the ioctl code.
The code used to return the ioctl command, which posix_aio_read()
would later morph into a zero. This hack is not necessary anymore,
and in fact breaks scsi-generic (which expects a zero return code).
Remove it.
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Sheepdog supports both writeback/writethrough write but has not yet supported
DIRECTIO semantics which bypass the cache completely even if Sheepdog daemon is
set up with cache enabled.
Suppose cache is enabled on Sheepdog daemon size, the new cache control is
cache=writeback # enable the writeback semantics for write
cache=writethrough # enable the emulated writethrough semantics for write
cache=directsync # disable cache competely
Guest WCE toggling on the run time to toggle writeback/writethrough is also
supported.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows removing of MinGW specific code and improves
reentrancy for POSIX hosts.
[Removed unused ret variable in qemu_get_timedate() to fix warning:
vl.c: In function ‘qemu_get_timedate’:
vl.c:451:16: error: variable ‘ret’ set but not used [-Werror=unused-but-set-variable]
-- Stefan Hajnoczi]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
For the error case such as SD_RES_NO_SPACE, we shouldn't update the inode bitmap
to avoid the scenario that the object is allocated but wasn't created at the
server side. This will result in VM's IO error on the failed object.
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit fbcad04d6b added fprintf statements
with wrong format specifiers.
GetLastError() returns a DWORD which is unsigned long, so %lu must be used.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The raw_get_aio_fd() function allows virtio-blk-data-plane to get the
file descriptor of a raw image file with Linux AIO enabled. This
interface is really a layering violation that can be resolved once the
block layer is able to run outside the global mutex - at that point
virtio-blk-data-plane will switch from custom Linux AIO code to using
the block layer.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Touching char/char.h basically causes the whole of QEMU to
be rebuilt. Avoid this, it is usually unnecessary.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Various header files rely on qemu-char.h including qemu-config.h or
main-loop.h, but they really do not need qemu-char.h at all (particularly
interesting is the case of the block layer!). Clean this up, and also
add missing inclusions of qemu-char.h itself.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's no reason for run_dependent_requests() to hold s->lock, and a
later patch will require that in fact the lock is not held.
Also, before this patch, run_dependent_requests() not only does what its
name suggests, but also removes the l2meta from the list of in-flight
requests. When changing this, it becomes an one-liner, so just inline it
completely.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is closer to where the dirty flag is really needed, and it avoids
having checks for special cases related to cluster allocation directly
in the writev loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even for writes to already allocated clusters, an l2meta is allocated,
though it stays effectively unused. After this patch, only allocating
requests still have one. Each l2meta now describes an in-flight request
that writes to clusters that are not yet hooked up in the L2 table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There's no real reason to have an l2meta for normal requests that don't
allocate anything. Before we can get rid of it, we must return the host
cluster offset in a different way.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As soon as delayed COW is introduced, the l2meta struct is needed even
after completion of the request, so it can't live on the stack.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This makes it easier to address the areas for which a COW must be
performed. As a nice side effect, the COW code in
qcow2_alloc_cluster_link_l2 becomes really trivial.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The offset within the cluster is already present as n_start and this is
what the code uses. QCowL2Meta.offset is only needed at a cluster
granularity.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to use these events to suspend requests for testing concurrent
AIO requests. Suspending requests while they are holding the CoMutex is
rather boring for this purpose.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows more systematic AIO testing. The patch adds three new
operations to blkdebug:
* Setting a "breakpoint" on a blkdebug event. The next request that
triggers this breakpoint is suspended and is tagged with a name.
The breakpoint is removed after a request has triggered it.
* A suspended request (identified by it's tag) can be resumed
* It's possible to check whether a suspended request with a given
tag exists. This can be used for waiting for an event.
Ideally, we would instead tag requests right when they are created and
set breakpoints for individual requests. However, at this point the
block layer doesn't allow this easily, and breakpoints that trigger for
any request already allow a lot of useful testing.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cleanup work to remove a rule depends on the type of the rule. It's
easy for the existing rules as there is no data that must be cleaned up
and is specific to a type yet, but the next patch will change this.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As soon as new rules can be set during runtime, as introduced by the
next patch, blkdebug makes sense even without a config file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
An error has occurred if the return value is invalid_set_file_pointer
and getlasterror doesn't return no_error.
Signed-off-by: Fabien Chouteau <chouteau@adacore.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
This one fixes a race which qemu had also in iscsi block driver
between cancellation and io completition.
qemu_rbd_aio_cancel was not synchronously waiting for the end of
the command.
To archieve this it introduces a new status flag which uses
-EINPROGRESS.
Signed-off-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
clang now warns about an unused function:
CC block/raw-posix.o
block/raw-posix.c:707:26: warning: unused function paio_ioctl
[-Wunused-function]
static BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd,
^
1 warning generated.
because the only use of paio_ioctl() is inside a #if defined(__linux__)
guard and it is static now.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The VHD specification allows for up to a 2 TB disk size. The current
implementation in qemu emulates EIDE and ATA-2 hardware which only allows
for up to 127 GB. This disk size limitation can be overridden by allowing
up to 255 heads instead of the normal 4 bit limitation of 16. Doing so
allows disk images to be created of up to nearly 2 TB. This change does
not violate the VHD format specification nor does it change how smaller
disks (ie, <=127GB) are defined.
[Charles Arnold also writes: "In analyzing a 160 GB VHD fixed disk image
created on Windows 2008 R2, it appears that MS is also ignoring the CHS
values in the footer geometry field in whatever driver they use for
accessing the image. The CHS values are set at 65535,16,255 which
obviously doesn't represent an image size of 160 GB." -- Stefan]
Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Initialize the uuid field in the footer with a generated uuid.
Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Without any complex checks we can't assume that an
iscsi target is initialized to zero.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the connection is interrupted before the first login is successfully
completed qemu-kvm is waiting forever in qemu_aio_wait().
This is fixed by performing an sync login to the target. If the
connection breaks after the first successful login errors are
handled internally by libiscsi.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If an invalid URL is specified iscsi_get_error(iscsi) is called
with iscsi == NULL.
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
rbd / rados tends to return pretty often length of writes
or discarded blocks. These values might be bigger than int.
The steps to reproduce are:
mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends
a discard. Important is that you use scsi-hd and set
discard_granularity=512. Otherwise rbd disabled discard support.
Signed-off-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It's poor symbol hygiene to provide a global symbols that collide with a
common library like libuuid. If QEMU links against a shared library
that depends on uuid_generate() it can end up calling our stub version
of the function.
This exact scenario happened with GlusterFS libgfapi.so, which depends
on libglusterfs.so's uuid_generate().
Scope the uuid stubs for vdi.c only and avoid affecting other shared
objects.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
For hdev, floppy, and cdrom, the reopen() handlers are the same as
for the file reopen handler. For floppy and cdrom types, however,
we keep O_NONBLOCK, as in the _open function.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Fixed a MAJOR BUG in VMDK files on file boundaries on reads
and ALSO ON WRITES WHICH MIGHT CORRUPT THE IMAGE AND DATA!!!!!!
Triggered for example with the following VMDK file (partly listed):
RW 4193792 FLAT "XP-W1-f001.vmdk" 0
RW 2097664 FLAT "XP-W1-f002.vmdk" 0
RW 4193792 FLAT "XP-W1-f003.vmdk" 0
RW 512 FLAT "XP-W1-f004.vmdk" 0
RW 4193792 FLAT "XP-W1-f005.vmdk" 0
RW 2097664 FLAT "XP-W1-f006.vmdk" 0
RW 4193792 FLAT "XP-W1-f007.vmdk" 0
RW 512 FLAT "XP-W1-f008.vmdk" 0
Patch includes:
1.) Patch fixes wrong calculation on extent boundaries. Especially it
fixes the relativeness of the sector number to the current extent.
Verfied correctness with:
1.) Converted either with Virtualbox to VDI and then with qemu-img and
then with qemu-img only:
VBoxManage clonehd --format vdi /VM/XP-W/new/XP-W1.vmdk ~/.VirtualBox/Harddisks/XP-W1-new-test.vdi
./qemu-img convert -O raw ~/.VirtualBox/Harddisks/XP-W1-new-test.vdi /root/QEMU/VM-XP-W1/XP-W1-via-VBOX.img
md5sum /root/QEMU/VM-XP-W/XP-W1-direct.img
md5sum /root/QEMU/VM-XP-W/XP-W1-via-VBOX.img
=> same MD5 hash
2.) Verified debug log files
3.) Run Windows XP successfully
4.) chkdsk run successfully without any errors
Signed-off-by: Gerhard Wiesinger <lists@wiesinger.com>
Acked-by: Fam Zheng <famcool@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that AIOPool no longer keeps a freelist, it isn't really a "pool"
anymore. Rename it to AIOCBInfo and make it const since it no longer
needs to be modified.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Versions before gcc-4.6 don't support unnamed fields in initializers
(see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676).
Offset and OffsetHigh belong to an unnamed struct which is part of an
unnamed union. Therefore the original code does not work with older
versions of gcc.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A missing factor for the refcount table entry size in the calculation
could mean that too little memory was allocated for the in-memory
representation of the table, resulting in a buffer overflow.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
The URI syntax is consistent with the Gluster syntax. Export names
are specified in the path, preceded by one or more (otherwise unused)
slashes.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
With the new support for EventNotifiers in the AIO event loop, we
can hook a completion port to every opened file and use asynchronous
I/O on them.
Wine's support is extremely inefficient, also because it really does
the I/O synchronously on regular files. (!) But it works, and it is
good to keep the Win32 and POSIX ports as similar as possible.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Making the qemu_paiocb specific to raw devices will let us access members
of the BDRVRawState arbitrarily.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The Win32 implementation will only accept EventNotifiers, thus a few
drivers are disabled under Windows. EventNotifiers are a good match
for the GSource implementation, too, because the Win32 port of glib
allows to place their HANDLEs in a GPollFD.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Error management is important for mirroring; otherwise, an error on the
target (even something as "innocent" as ENOSPC) requires to start again
with a full copy. Similar to on_read_error/on_write_error, two separate
knobs are provided for on_source_error (reads) and on_target_error (writes).
The default is 'report' for both.
The 'ignore' policy will leave the sector dirty, so that it will be
retried later. Thus, it will not cause corruption.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Switching to the target of the migration is done mostly asynchronously,
and reported to management via the BLOCK_JOB_COMPLETED event; the only
synchronous phase is opening the backing files. bdrv_open_backing_file
can always be done, even for migration of the full image (aka sync:
'full'). In this case, qmp_drive_mirror will create the target disk
with no backing file at all, and bdrv_open_backing_file will be a no-op.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds the implementation of a new job that mirrors a disk to
a new image while letting the guest continue using the old image.
The target is treated as a "black box" and data is copied from the
source to the target in the background. This can be used for several
purposes, including storage migration, continuous replication, and
observation of the guest I/O in an external program. It is also a
first step in replacing the inefficient block migration code that is
part of QEMU.
The job is possibly never-ending, but it is logically structured into
two phases: 1) copy all data as fast as possible until the target
first gets in sync with the source; 2) keep target in sync and
ensure that reopening to the target gets a correct (full) copy
of the source data.
The second phase is indicated by the progress in "info block-jobs"
reporting the current offset to be equal to the length of the file.
When the job is cancelled in the second phase, QEMU will run the
job until the source is clean and quiescent, then it will report
successful completion of the job.
In other words, the BLOCK_JOB_CANCELLED event means that the target
may _not_ be consistent with a past state of the source; the
BLOCK_JOB_COMPLETED event means that the target is consistent with
a past state of the source. (Note that it could already happen
that management lost the race against QEMU and got a completion
event instead of cancellation).
It is not yet possible to complete the job and switch over to the target
disk. The next patches will fix this and add many refinements to the
basic idea introduced here. These include improved error management,
some tunable knobs and performance optimizations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This simplifies some code and error checking, and also fixes a bug.
bdrv_find_backing_image() should only be passed absolute filenames,
or filenames relative to the chain. In the QMP message handler for
block commit, when looking up the base do so from the determined top
image, so we know it is reachable from top.
Some of the error messages put out by block-commit have changed
slightly, which causes 2 tests cases for block-commit to fail.
This patch updates the test cases to look for the correct error
output.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* 'trivial-patches' of git://github.com/stefanha/qemu:
versatilepb: Use symbolic indices for ARM PIC
qdev: kill bogus comment
qemu-barrier: Fix compiler version check for future gcc versions
hw: Add missing 'static' attribute for QEMUMachine
cleanup useless return sentence
qemu-sockets: Fix compiler warning (regression for MinGW)
vnc: Fix spelling (hellmen -> hellman) in comment
slirp: Fix spelling in comment (enought -> enough, insure -> ensure)
tcg/arm: Use tcg_out_mov_reg rather than inline equivalent code
cpu: Add missing 'static' attribute to qemu_global_mutex
configure: Support empty target list (--target-list=)
hw: Fix return value check for bdrv_read, bdrv_write
This patch cleans up return sentences in the end of void functions.
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Amos Kong <akong@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>
Avoid strncpy+manual-NUL-terminate. Use pstrcpy instead.
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
* parse_vdiname: Use pstrcpy, not strncpy, when the destination
buffer must be NUL-terminated.
* sd_open: Likewise, avoid buffer overrun.
* do_sd_create: Likewise. Leave the preceding memset, since
pstrcpy does not NUL-fill, and filename needs that.
* sd_snapshot_create: Add a comment/question.
* find_vdi_name: Remove a useless memset.
* sd_snapshot_goto: Remove a useless memset.
Use pstrcpy to NUL-terminate, because find_vdi_name requires
that its vdi arg (filename parameter) be NUL-terminated.
It seems ok not to NUL-fill the buffer.
Do the same for snapid: remove useless memset-0 (instead,
zero tag[0]). Use pstrcpy, not strncpy.
* sd_snapshot_list: Use pstrcpy, not strncpy to write
into the ->name member. Each must be NUL-terminated.
Acked-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Currently it is impossible to write a blkdebug script that ping-pongs
between two states, because the second set-state rule will use the
state that is set in the first. If you have
[set-state]
event = "..."
state = "1"
new_state = "2"
[set-state]
event = "..."
state = "2"
new_state = "1"
for example the state will remain locked at 1. This can be fixed
by first processing all rules, and then setting the state.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds support for error management to streaming.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This will let block-stream reuse the enum. Places that used the enums
are renamed accordingly.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds the live commit coroutine. This iteration focuses on the
commit only below the active layer, and not the active layer itself.
The behaviour is similar to block streaming; the sectors are walked
through, and anything that exists above 'base' is committed back down
into base. At the end, intermediate images are deleted, and the
chain stitched together. Images are restored to their original open
flags upon completion.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds gluster as the new block backend in QEMU. This gives
QEMU the ability to boot VM images from gluster volumes. Its already
possible to boot from VM images on gluster volumes using FUSE mount, but
this patchset provides the ability to boot VM images from gluster volumes
by by-passing the FUSE layer in gluster. This is made possible by
using libgfapi routines to perform IO on gluster volumes directly.
VM Image on gluster volume is specified like this:
file=gluster[+transport]://[server[:port]]/volname/image[?socket=...]
'gluster' is the protocol.
'transport' specifies the transport type used to connect to gluster
management daemon (glusterd). Valid transport types are
tcp, unix and rdma. If a transport type isn't specified, then tcp
type is assumed.
'server' specifies the server where the volume file specification for
the given volume resides. This can be either hostname, ipv4 address
or ipv6 address. ipv6 address needs to be within square brackets [ ].
If transport type is 'unix', then 'server' field should not be specifed.
The 'socket' field needs to be populated with the path to unix domain
socket.
'port' is the port number on which glusterd is listening. This is optional
and if not specified, QEMU will send 0 which will make gluster to use the
default port. If the transport type is unix, then 'port' should not be
specified.
'volname' is the name of the gluster volume which contains the VM image.
'image' is the path to the actual VM image that resides on gluster volume.
Examples:
file=gluster://1.2.3.4/testvol/a.img
file=gluster+tcp://1.2.3.4/testvol/a.img
file=gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
file=gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
file=gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
file=gluster+tcp://server.domain.com:24007/testvol/dir/a.img
file=gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
file=gluster+rdma://1.2.3.4:24007/testvol/a.img
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony:
block: remove keep_read_only flag from BlockDriverState struct
block: convert bdrv_commit() to use bdrv_reopen()
block: vpc image file reopen
block: vdi image file reopen
block: vmdk image file reopen
block: qcow image file reopen
block: qcow2 image file reopen
block: qed image file reopen
block: raw image file reopen
block: raw-posix image file reopen
block: purge s->aligned_buf and s->aligned_buf_size from raw-posix.c
block: use BDRV_O_NOCACHE instead of s->aligned_buf in raw-posix.c
block: do not parse BDRV_O_CACHE_WB in block drivers
block: move open flag parsing in raw block drivers to helper functions
block: move aio initialization into a helper function
block: Framework for reopening files safely
block: make bdrv_set_enable_write_cache() modify open_flags
block: correctly set the keep_read_only flag
blockdev: preserve readonly and snapshot states across media changes
There is currently nothing that needs to be done for VPC image
file reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is currently nothing that needs to be done for VDI reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch supports reopen for VMDK image files. VMDK extents are added
to the existing reopen queue, so that the transactional model of reopen
is maintained with multiple image files.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are the stubs for the file reopen drivers for the qcow format.
There is currently nothing that needs to be done by the qcow driver
in reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are the stubs for the file reopen drivers for the qcow2 format.
There is currently nothing that needs to be done by the qcow2 driver
in reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are the stubs for the file reopen drivers for the qed format.
There is currently nothing that needs to be done by the qed driver
in reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are the stubs for the file reopen drivers for the raw format.
There is currently nothing that needs to be done by the raw driver
in reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is derived from the Supriya Kannery's reopen patches.
This contains the raw-posix driver changes for the bdrv_reopen_*
functions. All changes are staged into a temporary scratch buffer
during the prepare() stage, and copied over to the live structure
during commit(). Upon abort(), all changes are abandoned, and the
live structures are unmodified.
The _prepare() will create an extra fd - either by means of a dup,
if possible, or opening a new fd if not (for instance, access
control changes). Upon _commit(), the original fd is closed and
the new fd is used. Upon _abort(), the duplicate/new fd is closed.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The aligned_buf pointer and aligned_buf size are no longer used in
raw_posix.c, so remove all references to them.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Rather than check for a non-NULL aligned_buf to determine if
raw_aio_submit needs to check for alignment, check for the presence
of BDRV_O_NOCACHE in the bs->open_flags.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Block drivers should ignore BDRV_O_CACHE_WB in .bdrv_open flags,
and in the bs->open_flags.
This patch removes the code, leaving the behaviour behind as if
BDRV_O_CACHE_WB was set.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Code motion, to move parsing of open flags into a helper function.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move AIO initialization for raw-posix block driver into a helper function.
In addition to just code motion, the aio_ctx pointer is checked for NULL,
prior to calling laio_init(), to make sure laio_init() is only run once.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We no longer need to explicitely call qemu_notify_event() any more
since this is now done automatically any time the filehandles we listen
to change.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We need to support SG_IO from the synchronous iscsi_ioctl() since
scsi-block uses this to do an INQ to the device to discover its properties
This patch makes scsi-block work with iscsi.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ccc-analyzer reports these warnings:
block/vdi.c:704:13: warning: Dereference of null pointer
bmap[i] = VDI_UNALLOCATED;
^
block/vdi.c:702:13: warning: Dereference of null pointer
bmap[i] = i;
^
Moving some code into the if block fixes this.
It also avoids calling function write with 0 bytes of data.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Report from smatch:
block/curl.c:546 curl_close(21) info: redundant null check on s->url calling free()
The check was redundant, and free was also wrong because the memory
was allocated using g_strdup.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch sets data to be sent to Sheepdog correctly and fixes savevm
and loadvm operations on a Sheepdog image.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony:
qemu-iotests: add backing file smaller than image test case
stream: complete early if end of backing file is reached
qed: refuse unaligned zero writes with a backing file
It is possible to create an image that is larger than its backing file.
Reading beyond the end of the backing file produces zeroes if no writes
have been made to those sectors in the image file.
This patch finishes streaming early when the end of the backing file is
reached. Without this patch the block job hangs and continually tries
to stream the first sectors beyond the end of the backing file.
To reproduce the hung block job bug:
$ qemu-img create -f qcow2 backing.qcow2 128M
$ qemu-img create -f qcow2 -o backing_file=backing.qcow2 image.qcow2 6G
$ qemu -drive if=virtio,cache=none,file=image.qcow2
(qemu) block_stream virtio0
(qemu) info block-jobs
The qemu-iotests 030 streaming test still passes.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Zero writes have cluster granularity in QED. Therefore they can only be
used to zero entire clusters.
If the zero write request leaves sectors untouched, zeroing the entire
cluster would obscure the backing file. Instead return -ENOTSUP, which
is handled by block.c:bdrv_co_do_write_zeroes() and falls back to a
regular write.
The qemu-iotests 034 test cases covers this scenario.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The number of blocks of the device is used to compute the device size
in bdrv_getlength()/iscsi_getlength().
For MMC devices, the ReturnedLogicalBlockAddress in the READCAPACITY10
has a special meaning when it is 0.
In this case it does not mean that LBA 0 is the last accessible LBA,
and thus the device has 1 readable block, but instead it means that the
disc is blank and there are no readable blocks.
This change ensures that when the iSCSI LUN is loaded with a blank
DVD-R disk or similar that bdrv_getlength() will return the correct
size of the device as 0 bytes.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
* kwolf/for-anthony:
virtio-blk: hide VIRTIO_BLK_F_CONFIG_WCE from old machine types
Documentation: Warn against qemu-img on active image
vmdk: Read footer for streamOptimized images
vmdk: Fix header structure
Conflicts:
hw/virtio-blk.c
This patch fixes two main issues with block/iscsi.c:
1) iscsi_task_mgmt_abort_task_async calls iscsi_scsi_task_cancel which
was also directly called in iscsi_aio_cancel
2) a race between task completion and task abortion could happen cause
the scsi_free_scsi_task were done before iscsi_schedule_bh has finished.
To fix this, all the freeing of IscsiTasks and releasing of the AIOCBs
is centralized in iscsi_bh_cb, independent of whether the SCSI command
has completed or was cancelled.
3) iscsi_aio_cancel was not synchronously waiting for the end of the
command.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is always used with the same callback, remove the argument. And
its return value is never used, assume allocation succeeds.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 64e69e8092. The commit
returned immediately from iscsi_aio_cancel, risking corruption in case the
following happens:
guest qemu target
=========================================================================
send write 1 -------->
send write 1 -------->
cancel write 1 ------>
cancel write 1 ------>
<------------------ cancellation processed
send write 2 -------->
send write 2 -------->
<---------------- completed write 2
<------------------ completed write 2
<---------------- completed write 1
<---------------- cancellation not done
Here, the guest would see write 2 superseding write 1, when in fact the
outcome could have been the opposite. The right behavior is to return
only after the target says whether the cancellation was done or not, and
it will be implemented by the next three patches.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The footer takes precedence over the header when it exists. It contains
the real grain directory offset that is missing in the header. Without
this patch, streamOptimized images with a footer cannot be read.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
This patch converts all block layer close calls, that correspond
to qemu_open calls, to qemu_close.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch converts all block layer open calls to qemu_open.
Note that this adds the O_CLOEXEC flag to the changed open paths
when the O_CLOEXEC macro is defined.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony:
qemu-iotests: skip 039 with ./check -nocache
block: add BLOCK_O_CHECK for qemu-img check
qcow2: mark image clean after repair succeeds
qed: mark image clean after repair succeeds
blockdev: flip default cache mode from writethrough to writeback
virtio-blk: disable write cache if not negotiated
virtio-blk: support VIRTIO_BLK_F_CONFIG_WCE
qemu-iotests: Save some sed processes
ahci: Fix sglist memleak in ahci_dma_rw_buf()
ahci: Fix ahci cdrom read corruptions for reads > 128k
virtio-blk: fix use-after-free while handling scsi commands
* bonzini/scsi-next:
scsi-disk: add support for the UNMAP command
scsi-disk: improve out-of-range LBA detection for WRITE SAME
scsi-disk: more assertions and resets for aiocb
virtio-scsi: do not compare 32-bit QEMU tags against 64-bit virtio-scsi tags
iscsi: Pick default initiator-name based on the name of the VM
iscsi: reorganize code for parse_initiator_name
iscsi: do not leak initiator_name
Image formats with a dirty bit, like qed and qcow2, repair dirty image
files upon open with BDRV_O_RDWR. Performing automatic repair when
qemu-img check runs is not ideal because the bdrv_open() call repairs
the image before the actual bdrv_check() call from qemu-img.c.
Fix this "double repair" since it leads to confusing output from
qemu-img check. Tell the block driver that this image is being opened
just for bdrv_check(). This skips automatic repair and qemu-img.c can
invoke it manually with bdrv_check().
Update the golden output for qemu-iotests 039 to reflect the new
qemu-img check output.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The dirty bit is cleared after image repair succeeds in qcow2_open().
Move this into qcow2_check() so that all callers benefit from this
behavior when fix mode is enabled.
This is necessary so qemu-img check can call .bdrv_check() and mark the
image clean.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The dirty bit is cleared after image repair succeeds in qed_open().
Move this into qed_check() so that all callers benefit from this
behavior when fix=true.
This is necessary so qemu-img check can call .bdrv_check() and mark the
image clean.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch updates the iscsi layer to automatically pick a 'unique'
initiator-name based on the name of the vm in case the user has not set
an explicit iqn-name to use.
Create a new function qemu_get_vm_name() that returns the name of the VM,
if specified.
This way we can thus create default names to use as the initiator name
based on the guest session.
If the VM is not named via the '-name' command line argument, the iscsi
initiator-name used wiull simply be
iqn.2008-11.org.linux-kvm
If a name for the VM was specified with the '-name' option, iscsi will
use a default initiatorname of
iqn.2008-11.org.linux-kvm:<name>
These names are just the default iscsi initiator name that qemu will
generate/use only when the user has not set an explicit initiator name
via the commandlines or config files.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
The argument of iscsi_create_context is never freed by libiscsi,
which in fact calls strdup on it. Avoid a leak.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lazy refcounts is a performance optimization for qcow2 that postpones
refcount metadata updates and instead marks the image dirty. In the
case of crash or power failure the image will be left in a dirty state
and repaired next time it is opened.
Reducing metadata I/O is important for cache=writethrough and
cache=directsync because these modes guarantee that data is on disk
after each write (hence we cannot take advantage of caching updates in
RAM). Refcount metadata is not needed for guest->file block address
translation and therefore does not need to be on-disk at the time of
write completion - this is the motivation behind the lazy refcount
optimization.
The lazy refcount optimization must be enabled at image creation time:
qemu-img create -f qcow2 -o compat=1.1,lazy_refcounts=on a.qcow2 10G
qemu-system-x86_64 -drive if=virtio,file=a.qcow2,cache=writethrough
Update qemu-iotests 031 and 036 since the extension header size changes
when we add feature bit table entries.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds an incompatible feature bit to mark images that have not
been closed cleanly. When a dirty image file is opened a consistency
check and repair is performed.
Update qemu-iotests 031 and 036 since the extension header size changes
when we add feature bit table entries.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vvfat creates a virtual VFAT filesystem with a certain logical
geometry that depends on its options. It sets the "geometry hint" to
this geometry. It is the only block driver to do this.
The geometry hint is about about *physical* geometry, and used only by
certain hard disk device models.
vvfat's hint is normally invisible for device models, because
bdrv_open() puts a raw format on top of vvfat's fat protocol. That
raw format is where drive_init() puts the user's geometry (if any),
and where the device model gets it from.
Nobody complained, because the default physical geometry is the same
as vvfat's logical geometry:
opts LCHS def. PCHS
1024,16,63 same
:32: 1024,16,63 same
:16: 1024,16,63 same
:12: 64,16,63 same
Except when you specify :floppy:
opts LCHS def. PCHS
:floppy: 80, 2,36 5,16,63
:32:floppy: 80, 2,36 5,16,63
:16:floppy: 80, 2,36 5,16,63
:12:floppy: 80, 2,18 2,16,63
Silly thing to do for use with a hard disk.
However, the "raw" format can be suppressed by adding an
redundant-looking "format=vvfat" to "file=fat:FOO". Then, vvfat's
hint clobbers the user's geometry, i.e. -drive options cyls, heads,
secs get silently ignored. Don't do that.
No change without format=vvfat. With it, the user's hard disk
geometry (-drive options cyls, heads, secs) is now obeyed, and the
default hard disk geometry with :floppy: now matches the one without
format=vvfat.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unless parameter ":floppy:" is given, vvfat creates a virtual image
with DOS MBR defining a single partition which holds the FAT file
system. The size of the virtual image depends on the width of the
FAT: 32 MiB (CHS 64, 16, 63) for 12 bit FAT, 504 MiB (CHS 1024, 16,
63) for 16 and 32 bit FAT, leaving (64*16-1)*63 = 64449 and
(1024*16-1)*64 = 1032129 sectors for the partition.
However, it screws up the end of the partition in the MBR:
FAT width param. start CHS end CHS start LBA size
:32: 0,1,1 1023,14,63 63 1032065
:16: 0,1,1 1023,14,55 63 1032057
:12: 0,1,1 63,14,55 63 64377
The actual FAT file system nevertheless assumes the partition has
1032129 or 64449 sectors. Oops.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Only buffers that map to unallocated blocks need to be zeroed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* mjt/mjt-iov2:
rewrite iov_send_recv() and move it to iov.c
cleanup qemu_co_sendv(), qemu_co_recvv() and friends
export iov_send_recv() and use it in iov_send() and iov_recv()
rename qemu_sendv to iov_send, change proto and move declarations to iov.h
change qemu_iovec_to_buf() to match other to,from_buf functions
consolidate qemu_iovec_copy() and qemu_iovec_concat() and make them consistent
allow qemu_iovec_from_buffer() to specify offset from which to start copying
consolidate qemu_iovec_memset{,_skip}() into single function and use existing iov_memset()
rewrite iov_* functions
change iov_* function prototypes to be more appropriate
virtio-serial-bus: use correct lengths in control_out() message
Conflicts:
tests/Makefile
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
* kwolf/for-anthony: (24 commits)
block: Factor bdrv_read_unthrottled() out of guess_disk_lchs()
qtest: Tidy up temporary files properly
fdc: Drop broken code for user-defined floppy geometry
fdc_test: introduce test_sense_interrupt
fdc_test: update media_change test
fdc: fix interrupt handling
fdc: rewrite seek and DSKCHG bit handling
block: introduce bdrv_swap, implement bdrv_append on top of it
block: copy over job and dirty bitmap fields in bdrv_append
raw: hook into blkdebug
blkdebug: optionally tie errors to a specific sector
blkdebug: store list of active rules
blkdebug: pass getlength to underlying file
blkdebug: tiny cleanup
blkdebug: remove sync i/o events
sheepdog: traverse pending_list from the first for each time
sheepdog: split outstanding list into inflight and pending
sheepdog: make sure we don't free aiocb before sending all requests
sheepdog: use coroutine based socket functions in coroutine context
sheepdog: restart I/O when socket becomes ready in do_co_req()
...
This makes blkdebug scripts more powerful, and independent of the
exact sequence of operations performed by streaming.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This prepares for the next patch, where some active rules may actually
not trigger depending on input to readv/writev. Store the active rules
in a SIMPLEQ (so that it can be emptied easily with QSIMPLEQ_INIT), and
fetch the errno/once/immediately arguments from there.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is required when using blkdebug with raw format. Unlike qcow2/QED,
raw asks blkdebug for the length of the file, it doesn't get it from
a header.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are unused, except (by mistake more or less) in QED.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The pending list can be modified in other coroutine context
sd_co_rw_vector, so we need to traverse the list from the first again
after we send the pending request.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
outstanding_list_head is used for both pending and inflight requests.
This patch splits it and improves readability.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch increments the pending counter before sending requests, and
make sures that aiocb is not freed while sending them.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This removes blocking network I/Os in coroutine context.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, no one reenters the yielded coroutine. This fixes it.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes warnings about dprintf format in debug mode.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When qcow2_alloc_clusters() error handling code was introduced in commit
5d757b563d, the value of free_byte_offset
was clobbered in the error case. This patch keeps free_byte_offset at 0
so we will try to allocate clusters again next time this function is
called.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The DEBUG_ALLOC qcow2.h macro enables additional consistency checks
throughout the code. This makes it easier to spot corruptions that are
introduced during development. Since consistency check is an expensive
operation the DEBUG_ALLOC macro is used to compile checks out in normal
builds and qcow2_check_refcounts() calls missed the addition of a new
function argument.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the device we open is a SMC or SSC device, then force the use of sg. We
dont have any medium changer or tape emulation so only passthrough via
real sg or scsi-generic via iscsi would work anyway.
Forcing sg also makes qemu skip trying to read from the device to guess
the image format by reading from the device (find_image_format()).
SMC devices do not implement READ6/10/12/16 so it is not possible to
read from them (SSC have different CDBs).
With this patch I can successfully manage a SMC device wiht iscsi in
passthrough mode.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
[Added TYPE_TAPE handling - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Update iscsi to allow passthrough of SG_IO scsi commands when the iscsi
device is forced to be scsi-generic.
Implement both bdrv_ioctl() and bdrv_aio_ioctl() in the iscsi backend,
emulate the SG_IO ioctl and pass the SCSI commands across to the
iscsi target.
This allows end-to-end passthrough of SCSI all the way from the guest,
to qemu, via scsi-generic, then libiscsi all the way to the iscsi target.
To activate this you need to specify that the iscsi lun should be treated
as a scsi-generic device.
Example:
-device lsi -device scsi-generic,drive=MyISCSI \
-drive file=iscsi://10.1.1.125/iqn.ronnie.test/1,if=none,id=MyISCSI
Note, you can currently not boot a qemu guest from a scsi device.
Note,
This only works when the host is linux, since the emulation relies on
definitions of SG_IO from the scsi-generic implementation in the
linux kernel.
It should be fairly easy to re-implement some structures similar enough
for non-linux hosts to do the same style of passthrough via a fake
scsi generic layer and libiscsi if need be.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the declaration of s into the #ifdef sections that actually make
use of it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The autoclear feature bits can be used for qcow2 file format features
that are safe to "drop" by old programs that do not understand the
feature. Upon opening the image file unknown autoclear feature bits are
cleared and the image file header is rewritten, but this was happening
too early in the code when critical header fields were not yet loaded.
Process autoclear feature bits after all necessary header information
has been loaded.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
avail_sectors should really be the number of sectors from the start of
the allocation, not from the start of the write request.
We're lucky enough that this mistake didn't cause any real bug.
avail_sectors is only used in the intialiser of QCowL2Meta:
.nb_available = MIN(requested_sectors, avail_sectors),
m->nb_available in turn is only used for COW at the end of the
allocation. A COW occurs only if the request wasn't cluster aligned,
which in turn would imply that requested_sectors was less than
avail_sectors (both in the original and in the fixed version). In this
case avail_sectors is ignored and therefore the mistake doesn't cause
any misbehaviour.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
copy_sectors() always uses the sum (cluster_offset + n_start) or
(start_sect + n_start), so if some value is added to both cluster_offset
and start_sect, and subtracted from n_start, it's cancelled out anyway.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Writethrough does not need special-casing anymore in the qcow2 caches.
The block layer adds flushes after every guest-initiated data write,
and these will also flush the qcow2 caches to the OS.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Writeback caching was added in Ceph 0.46, and writethrough will be in
0.47. These are controlled by general config options, so there's no
need to check for librbd version.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When any inconsistencies have been fixed, print the statistics and run
another check to make sure everything is correct now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QED block driver already provides the functionality to not only
detect inconsistencies in images, but also fix them. However, this
functionality cannot be manually invoked with qemu-img, but the
check happens only automatically during bdrv_open().
This adds a -r switch to qemu-img check that allows manual invocation
of an image repair.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
is_allocated_base has complex semantics that are not really usable
outside streaming. Split the check in two parts, where the allocated
state for the top bs is moved to the caller. The resulting function
is more generally useful.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Either FIEMAP, or SEEK_DATA+SEEK_HOLE can be used to implement the
is_allocated callback for raw files. On Linux ext4, btrfs and XFS
all support it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 3948d1d4 removed the pointer argument we filled in with l2_offset
but forgot to remove the unnecessary l2_offset assignment.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Some gcc versions seem not to be able to figure out that the switch
statement covers all possible values and that c is therefore always
initialised. Add a default branch for them.
Reported-by: malc <av1474@comtv.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: malc <av1474@comtv.ru>
The same as for non-coroutine versions in previous
patches: rename arguments to be more obvious, change
type of arguments from int to size_t where appropriate,
and use common code for send and receive paths (with
one extra argument) since these are exactly the same.
Use common iov_send_recv() directly.
qemu_co_sendv(), qemu_co_recvv(), and qemu_co_recv()
are now trivial #define's merely adding one extra arg.
qemu_co_sendv() and qemu_co_recvv() callers are
converted to different argument order and extra
`iov_cnt' argument.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
It now allows specifying offset within qiov to start from and
amount of bytes to copy. Actual implementation is just a call
to iov_to_buf().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
qemu_iovec_concat() is currently a wrapper for
qemu_iovec_copy(), use the former (with extra
"0" arg) in a few places where it is used.
Change skip argument of qemu_iovec_copy() from
uint64_t to size_t, since size of qiov itself
is size_t, so there's no way to skip larger
sizes. Rename it to soffset, to make it clear
that the offset is applied to src.
Also change the only usage of uint64_t in
hw/9pfs/virtio-9p.c, in v9fs_init_qiov_from_pdu() -
all callers of it actually uses size_t too,
not uint64_t.
One added restriction: as for all other iovec-related
functions, soffset must point inside src.
Order of argumens is already good:
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes)
vs:
qemu_iovec_concat(QEMUIOVector *dst,
QEMUIOVector *src,
size_t soffset, size_t sbytes)
(note soffset is after _src_ not dst, since it applies to src;
for memset it applies to qiov).
Note that in many places where this function is used,
the previous call is qemu_iovec_reset(), which means
many callers actually want copy (replacing dst content),
not concat. So we may want to add a wrapper like
qemu_iovec_copy() with the same arguments but which
calls qemu_iovec_reset() before _concat().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Similar to
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes);
the new prototype is:
qemu_iovec_from_buf(QEMUIOVector *qiov, size_t offset,
const void *buf, size_t bytes);
The processing starts at offset bytes within qiov.
This way, we may copy a bounce buffer directly to
a middle of qiov.
This is exactly the same function as iov_from_buf() from
iov.c, so use the existing implementation and rename it
to qemu_iovec_from_buf() to be shorter and to match the
utility function.
As with utility implementation, we now assert that the
offset is inside actual iovec. Nothing changed for
current callers, because `offset' parameter is new.
While at it, stop using "bounce-qiov" in block/qcow2.c
and copy decrypted data directly from cluster_data
instead of recreating a temp qiov for doing that.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This patch combines two functions into one, and replaces
the implementation with already existing iov_memset() from
iov.c.
The new prototype of qemu_iovec_memset():
size_t qemu_iovec_memset(qiov, size_t offset, int fillc, size_t bytes)
It is different from former qemu_iovec_memset_skip(), and
I want to make other functions to be consistent with it
too: first how much to skip, second what, and 3rd how many
of it. It also returns actual number of bytes filled in,
which may be less than the requested `bytes' if qiov is
smaller than offset+bytes, in the same way iov_memset()
does.
While at it, use utility function iov_memset() from
iov.h in posix-aio-compat.c, where qiov was used.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
In snapshot mode, bdrv_open creates an empty temporary file without
checking for mkstemp or close failure, and ignoring the possibility
of a buffer overrun given a surprisingly long $TMPDIR.
Change the get_tmp_filename function to return int (not void),
so that it can inform its two callers of those failures.
Also avoid the risk of buffer overrun and do not ignore mkstemp
or close failure.
Update both callers (in block.c and vvfat.c) to propagate
temp-file-creation failure to their callers.
get_tmp_filename creates and closes an empty file, while its
callers later open that presumed-existing file with O_CREAT.
The problem was that a malicious user could provoke mkstemp failure
and race to create a symlink with the selected temporary file name,
thus causing the qemu process (usually root owned) to open through
the symlink, overwriting an attacker-chosen file.
This addresses CVE-2012-2652.
http://bugzilla.redhat.com/CVE-2012-2652
Signed-off-by: Jim Meyering <meyering@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_save_vmstate and bdrv_load_vmstate should return the vmstate size
on success, and -errno on error.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony:
fdc-test: introduced qtest no_media_on_start and cmos qtest for floppy
fdc: fix media detection
fdc: floppy drive should be visible after start without media
qemu-iotests: mark 035 qcow2-only
qcow2: Check qcow2_alloc_clusters_at() return value
sheepdog: use heap instead of stack for BDRVSheepdogState
sheepdog: return -errno on error
sheepdog: mark image as snapshot when tag is specified
qemu-img: Explain how rebase operation can be used to perform a 'diff' operation.
qcow2: don't leak buffer for unexpected qcow_version in header
This allows using LUNs bigger than 2TB. Keep using READ10 for other
device types such as MMC.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
This is needed to avoid READ CAPACITY(16) for MMC devices.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call qemu_notify_event() after updating events. Otherwise, If we add
an event for -is-writeable but the socket is already writeable there
may be a delay before the event callback is actually triggered.
Those delays would in particular hurt performance during BIOS boot and
when the GRUB bootloader reads the kernel and initrd.
But first call out to the socket write functions directly, and only set up
the write event if the socket is full. This will happen very rarely and
this improves performance.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
When using qcow2_alloc_clusters_at(), the cluster allocation code
checked the wrong variable for an error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_create() is called in coroutine context now, so we cannot use
more stack than 1 MB in the function if we use ucontext coroutine.
This patch allocates BDRVSheepdogState, whose size is 4 MB, on the
heap in sd_create().
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On error, BlockDriver APIs should return -errno instead of -1.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When a snapshot tag is specified in the filename, the opened image is
a snapshot.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unallocated sectors should really never be accessed by the guest,
so there's no need to copy them during the streaming process.
If they are read by the guest during streaming, guest-initiated
copy-on-read will copy them (we're in the base == NULL case, which
enables copy on read). If they are read after we disconnect the
image from the base, they will read as zeroes anyway.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes inability to make progress in streaming if the quota is set
to less than the amount of data that an I/O operation has to write.
In this case, limit->dispatched + n will always be above the quota and,
due to the "goto retry" to recheck cancellation and allocation, streaming
will livelock.
This can be reproduced with "block_job_set_speed ide0-hd0 1b". Of course,
with this patch the requested limit will not be obeyed. That could be
done with another patch that caps is_allocated's n argument by the slice
quota.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When an image is modified to point to the new backing file, the backing
file format is set to NULL, which means auto-probe. This is wrong, in
fact it is a small security problem.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The limitation on not having I/O after cancellation cannot really be
kept. Even streaming has a very small race window where you could
cancel a job and have it report completion. If this window is hit,
bdrv_change_backing_file() will yield and possibly cause accesses to
dangling pointers etc.
So, let's just assume that we cannot know exactly what will happen
after the coroutine has set busy to false. We can set a very lax
condition:
- if we cancel the job, the coroutine won't set it to false again
(and hence will not call co_sleep_ns again).
- block_job_cancel_sync will wait for the coroutine to exit, which
pretty much ensures no race.
Instead, we track the coroutine that executes the job and put very
strict conditions on what to do while it is quiescent (busy = false).
First of all, the coroutine must never set busy = false while the job
has been cancelled. Second, the coroutine can be reentered arbitrarily
while it is quiescent, so you cannot really do anything but co_sleep_ns at
that time. This condition is obeyed by the block_job_sleep_ns function.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function abstracts the pretty complex semantics of the "busy"
member of BlockJob.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QED's opaque data includes a pointer back to the BlockDriverState.
This breaks when bdrv_append shuffles data between bs_new and bs_top.
To avoid this, add a "rebind" function that tells the driver about
the new relationship between the BlockDriverState and its opaque.
The patch also adds rebind to VVFAT for completeness, even though
it is not used with live snapshots.
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are needed to print "info block" output correctly. QCOW2 does this
because it needs it to write the header, but QED does not, and common code
is the right place to do it.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This check applies to all drivers, but QED lacks it.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony:
fdc: simplify media change handling
qcow2: lock on prealloc
block: make bdrv_create adopt coroutine
qcow2: Limit COW to where it's needed
sheepdog: switch to writethrough mode if cluster doesn't support flush
preallocate() will be locked. This is required because
qcow2_alloc_cluster_link_l2() assumes that it runs under a lock that it
can drop while COW is being performed.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes a regression introduced in commit 250196f1. The bug leads to
data corruption, found during an Autotest run with a Fedora 8 guest.
Consider a write request whose first part is covered by an already
allocated cluster, but additional clusters need to be newly allocated.
When counting the number of clusters to allocate, the qcow2 code would
decide to do COW for all remaining clusters of the write request, even
if some of them are already allocated.
If during this COW operation another write request is issued that touches
the same cluster, it will still refer to the old cluster. When the COW
completes, the first request will update the L2 table and the second
write request will be lost. Note that the requests need not overlap, it's
enough for them to touch the same cluster.
This patch ensures that only clusters that really require COW are
considered for allocation. In this case any other request writing to the
same cluster will be an allocating write and gets serialised.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is necessary for qemu to work with the older version of Sheepdog
which doesn't support SD_OP_FLUSH_VDI.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Update the configure test for libiscsi support to detect version 1.3
or later. Version 1.3 of libiscsi provides both READCAPACITY16 as well
as UNMAP commands.
Update the iscsi block layer to use READCAPACITY16 to detect the size of
the LUN instead of READCAPACITY10. This allows support for LUNs larger
than 2TB.
Update to implement bdrv_aio_discard() using the UNMAP command.
This allows us to use thin-provisioned LUNs from TGTD and other iSCSI
targets that support thin-provisioning.
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
[squashed in subsequent patch from Ronnie to fix off-by-one in LBA count]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change the write flag to an operation type in RBDAIOCB, and make the
buffer optional since discard doesn't use it.
Discard is first included in librbd 0.1.2 (which is in Ceph 0.46).
If librbd is too old, leave out qemu_rbd_aio_discard entirely,
so the old behavior is preserved.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If cache references are held while the coroutine has yielded, the cache
may get used up and abort() when it can't find a free entry.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use __APPLE__ and __MACH__ macros instead of CONFIG_COCOA to detect Mac
OS X host. The patch is based on Ben Leslie's patch:
http://patchwork.ozlabs.org/patch/97859/
Signed-off-by: Ben Leslie <benno@benno.id.au>
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Andreas Färber <andreas.faerber@web.de>
* qmp/queue/qmp:
qapi: fix qmp_balloon() conversion
qemu-iotests: add block-stream speed value test case
block: add 'speed' optional parameter to block-stream
block: change block-job-set-speed argument from 'value' to 'speed'
block: use Error mechanism instead of -errno for block_job_set_speed()
block: use Error mechanism instead of -errno for block_job_create()
Allow streaming operations to be started with an initial speed limit.
This eliminates the window of time between starting streaming and
issuing block-job-set-speed. Users should use the new optional 'speed'
parameter instead so that speed limits are in effect immediately when
the job starts.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
There are at least two different errors that can occur in
block_job_set_speed(): the job might not support setting speeds or the
value might be invalid.
Use the Error mechanism to report the error where it occurs.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
The block job API uses -errno return values internally and we convert
these to Error in the QMP functions. This is ugly because the Error
should be created at the point where we still have all the relevant
information. More importantly, it is hard to add new error cases to
this case since we quickly run out of -errno values without losing
information.
Go ahead and use Error directly and don't convert later.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
s->sock is assigned only afterwards, so we're really registering an
aio_fd_handler for file descriptor 0 here. Not exactly what we intended.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kwolf/for-anthony: (38 commits)
qemu-iotests: Fix test 031 for qcow2 v3 support
qemu-iotests: Add -o and make v3 the default for qcow2
qcow2: Zero write support
qemu-iotests: Test backing file COW with zero clusters
qemu-iotests: add a simple test for write_zeroes
qcow2: Support for feature table header extension
qcow2: Support reading zero clusters
qcow2: Version 3 images
qcow2: Ignore reserved bits in check_refcounts
qcow2: Ignore reserved bits in refcount table entries
qcow2: Simplify count_cow_clusters
qcow2: Refactor qcow2_free_any_clusters
qcow2: Ignore reserved bits in L1/L2 entries
qcow2: Fail write_compressed when overwriting data
qcow2: Ignore reserved bits in count_contiguous_clusters()
qcow2: Ignore reserved bits in get_cluster_offset
qcow2: Save disk size in snapshot header
Specification for qcow2 version 3
qcow2: Fix refcount block allocation during qcow2_alloc_cluster_at()
iotests: Resolve test failures caused by hostname
...
Instead of printing an ugly bitmask, qemu can now print a more helpful
string even for yet unknown features.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds the basic infrastructure to qcow2 to handle version 3 images.
It includes code to create v3 images, allow header updates for v3 images
and checks feature bits.
It still misses support for zero clusters, so this is not a fully
compliant implementation of v3 yet.
The default for creating new images stays at v2 for now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Also don't infer the cluster type directly from the L2 entries, but use
qcow2_get_cluster_type() to keep everything in a single place.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
count_cow_clusters() tries to reuse existing functions, and all it
achieves is to make things much more complicated than they really are:
Everything needs COW, unless it's a normal cluster with refcount 1.
This patch implements the obvious way of doing this, and by using
qcow2_get_cluster_type() it gets rid of all flag magic.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Zero clusters will add another cluster type. Refactor the open-coded
cluster type detection into a switch of QCOW2_CLUSTER_* options so that
the detection is in a single place. This makes it easier to add new
cluster types.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This changes the still existing places that assume that the only flags
are QCOW_OFLAG_COPIED and QCOW_OFLAG_COMPRESSED to properly mask out
reserved bits.
It does not convert bdrv_check yet.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_alloc_compressed_cluster_offset() already fails if the copied flag
is set, because qcow2_write_compressed() doesn't perform COW as it would
have to do to allow this.
However, what we really want to check here is whether the cluster is
allocated or not. With internal snapshots the copied flag may not be set
on allocated clusters. Check the cluster offset instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Until now, count_contiguous_clusters() has an argument that allowed to
specify flags that should be ignored in the comparison, i.e. that are
allowed to change between contiguous clusters.
This patch changes the function so that it ignores all flags by default
now and you need to pass the flags on which it should stop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With this change, reading from a qcow2 image ignores all reserved bits
that are set in an L1 or L2 table entry.
Now get_cluster_offset() assigns *cluster_offset only the offset without
any other flags. The cluster type is not longer encoded in the offset,
but a positive return value in case of success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows that different snapshots of an image can have different
sizes, which is a requirement for enabling image resizing even with
images that have internal snapshots.
We don't do the actual support for it now, but make sure that the
additional field is present and not completely ignored in all version 3
images. When trying to load a snapshot of different size, it returns
an error.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Refcount block allocation and refcount table growth rely on
s->free_cluster_index pointing to somewhere after the current
allocation. Change qcow2_alloc_cluster_at() to fulfill this
assumption.
Without this change it could happen that a newly allocated refcount
block and the allocated data block point to the same area in the image
file, causing data corruption in the long run.
This fixes a bug that became first visible after commit 250196f1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Right now, nbd_wr_sync will hang if no data at all is available on the
socket and the other side is not going to provide any. Relax this by
making it loop only for writes or partial reads. This fixes a race
where one thread is executing qemu_aio_wait() and another is executing
main_loop_wait(). Then, the select() call in main_loop_wait() can return
stale data and call the "readable" callback with no data in the socket.
Reported-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the next patch we need to look at the return code of nbd_wr_sync.
To avoid percolating the socket_error() ugliness all around, let's
handle errors by returning negative errno values.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Someone forgot something in commit 29c1a730... Documenting the right
return value is not enough, you also need to actually return it in the
code.
This bug sometimes causes error return values even when everything has
succeeded: The new offset of the refcount block is truncated to 32 bits
and interpreted as signed. At least with small cluster sizes it's easy
to get a negative return value this way.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
If do_alloc_cluster_offset() fails, the error handling code tried to
remove the request from the in-flight queue, to which it wasn't added
yet, resulting in a NULL pointer dereference.
m->nb_clusters really only becomes != 0 when the request is in the list.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony: (46 commits)
qed: remove incoming live migration blocker
qed: honor BDRV_O_INCOMING for incoming live migration
migration: clear BDRV_O_INCOMING flags on end of incoming live migration
qed: add bdrv_invalidate_cache to be called after incoming live migration
blockdev: open images with BDRV_O_INCOMING on incoming live migration
block: add a function to clear incoming live migration flags
block: Add new BDRV_O_INCOMING flag to notice incoming live migration
block stream: close unused files and update ->backing_hd
qemu-iotests: Fix call syntax for qemu-io
qemu-iotests: Fix call syntax for qemu-img
qemu-iotests: Test unknown qcow2 header extensions
qemu-iotests: qcow2.py
sheepdog: fix send req helpers
sheepdog: implement SD_OP_FLUSH_VDI operation
block: bdrv_append() fixes
qed: track dirty flag status
qemu-img: add dirty flag status
qed: image fragmentation statistics
qemu-img: add image fragmentation statistics
block: document job API
...
From original commit with Patchwork-id: 31108 by
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
"The QED image format includes a file header bit to mark images dirty.
QED normally checks dirty images on open and fixes inconsistent
metadata. This is undesirable during live migration since the dirty bit
may be set if the source host is modifying the image file. The check
should be postponed until migration completes.
Skip operations that modify the image file if the BDRV_O_INCOMING flag
is set."
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QED image is reopened to flush metadata and check consistency.
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Close the now unused images that were part of the previous backing file
chain and adjust ->backing_hd, backing_filename and backing_format
properly.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=801449
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We should return if reading of the header fails.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Acked-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Flush operation is supposed to flush the write-back cache of
sheepdog cluster.
By issuing flush operation, we can assure the Guest of data
reaching the sheepdog cluster storage.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need to do this in every implementation of set_speed
(even though there is only one right now).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Streaming can issue I/O while qcow2_close is running. This causes the
L2 caches to become very confused or, alternatively, could cause a
segfault when the streaming coroutine is reentered after closing its
block device. The fix is to cancel streaming jobs when closing their
underlying device.
The cancellation must be synchronous, on the other hand qemu_aio_wait
will not restart a coroutine that is sleeping in co_sleep. So add
a flag saying whether streaming has in-flight I/O. If the busy flag
is false, the coroutine is quiescent and, when cancelled, will not
issue any new I/O.
This protects streaming against closing, but not against deleting.
We have a reference count protecting us against concurrent deletion,
but I still added an assertion to ensure nothing bad happens.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Finally reindent all code and change goto statements to a loop.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reads and writes to the underlying file can also occur with the simple
non-vectored I/O interfaces.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi.c really works as if it implemented bdrv_read and bdrv_write. However,
because only vector I/O is supported by the asynchronous callbacks, it
went through extra pain to bounce-buffer the I/O. This can be handled
by the block layer now that the format is coroutine-based.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most of the AIOCB really holds local variables that need to persist
across callback invocation. It can go away now.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now inline the former AIO callbacks into vdi_co_readv and vdi_co_writev.
While many cleanups are possible, the code now really looks synchronous.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The next step is to take code that only triggers after the first operation,
and move it at the end of vdi_aio_read_cb and vdi_aio_write_cb.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Even a basic conversion changing the bdrv_aio_readv/bdrv_aio_writev calls
to bdrv_co_readv/bdrv_co_writev, and callbacks to goto statements can
eliminate a lot of code. This is because error handling is simplified
and indirections through bottom halves can go away.
After this patch, I/O to the underlying file already happens via
coroutines, but the code still looks a lot like if asynchronous I/O was
being used.
Acked-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
After validation check, the 'checksum' is not written back
to footer, which leave it with zero.
This results in errors while loadding it under Microsoft's
Hyper-V environment, and also errors from utilities like
Citrix's vhd-util.
Signed-off-by: Zhang Shengju <sean_zhang@trendmicro.com.cn>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since everything goes through the cache, callers don't use the L2 table
offset any more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The function usleep is not available for all supported platforms:
at least some versions of MinGW don't support it.
usleep was also declared obsolete by POSIX.1-2001.
The function g_usleep is part of glib2.0, so it is available for
all supported platforms.
Using nanosleep would also be possible but needs more code.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
If the first part of a write request is allocated, but the second isn't
and it can be allocated so that the resulting area is contiguous, handle
it at once. This is a common case for sequential writes.
After this patch, alloc_cluster_offset() only checks if the clusters are
already allocated or how many new clusters can be allocated contigouosly.
The actual cluster allocation is split off into a new function
do_alloc_cluster_offset().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
This function allows to allocate clusters at a given offset in the image
file. This is useful if you want to allocate the second part of an area
that must be contiguous.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
qemu-img resize has some limitations with qcow2, but the user is only
told that "this image format does not support resize". Quite confusing,
so add some more detailed error_report() calls and change "this image
format" into "this image".
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The L2 table cache reduces QED metadata reads that would be required
when translating LBAs to offsets into the image file. Since requests
execute in parallel it is possible to share an L2 table between multiple
requests.
There is a potential data corruption issue when an in-use L2 table is
evicted from the cache because the following situation occurs:
1. An allocating write performs an update to L2 table "A".
2. Another request needs L2 table "B" and causes table "A" to be
evicted.
3. A new read request needs L2 table "A" but it is not cached.
As a result the L2 update from #1 can overlap with the L2 fetch from #3.
We must avoid doing overlapping I/O requests here since the worst case
outcome is that the L2 fetch completes before the L2 update and yields
stale data. In that case we would effectively discard the L2 update and
lose data clusters!
Thanks to Benoît Canet <benoit.canet@gmail.com> for extensive testing
and debugging which lead to discovery of this bug.
Reported-by: Benoît Canet <benoit.canet@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Tested-by: Benoît Canet <benoit.canet@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Image files that make qemu-img info read several gigabytes into the
unknown header extensions list are bad. Just fail opening the image
if an extension claims to be larger than the header extension area.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The spec says that the length of extensions is padded to 8 bytes, not
the offset. Currently this is the same because the header size is a
multiple of 8, so this is only about compatibility with future changes
to the header size.
While touching it, move the calculation to a common place instead of
duplicating it for each header extension type.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The co_recv coroutine has two things that will try to enter it:
1. The select(2) read callback on the sheepdog socket.
2. The aio_add_request() blocking operations, including a coroutine
mutex.
This patch fixes it by setting NULL to co_recv before sending data.
In future, we should make the sheepdog driver fully coroutine-based
and simplify request handling.
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we want header extensions to work as compatible extensions, we can't
destroy yet unknown header extensions when rewriting the header (e.g.
for changing the backing file). Save all unknown header extensions in a
list of blobs and include them in a new header.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In order to switch the backing file, qcow2 issues multiple write
requests that only changed a part of the image header. Any failure after
the first one would leave the header in an corrupted state. With this
patch, the whole header is written at once, so we can't fail in the
middle.
At the same time, this gives us a reusable functions that updates all
fields of the qcow2 header and not only the backing file.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The geometry calculation algorithm from the VHD spec rounds the image
size down if it doesn't exactly match a geometry. During image
conversion, this causes the image to be truncated. For dynamic images,
we already have code in place to round up instead, let's do the same for
fixed images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The Virtual Hard Disk Image Format Specification allows for three
types of hard disk formats, Fixed, Dynamic, and Differencing. Qemu
currently only supports Dynamic disks. This patch adds support for
the Fixed Disk format.
Usage:
Example 1: qemu-img create -f vpc -o type=fixed <filename> [size]
Example 2: qemu-img convert -O vpc -o type=fixed <input filename> <output filename>
While it is also allowed to specify '-o type=dynamic', the default disk type
remains Dynamic and is what is used when the type is left unspecified.
Signed-off-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds configuration variables for iSCSI to set
initiator-name to use when logging in to the target,
which type of header-digest to negotiate with the target
and username and password for CHAP authentication.
This allows specifying a initiator-name either from the command line
-iscsi initiator-name=iqn.2004-01.com.example:test
or from a configuration file included with -readconfig
[iscsi]
initiator-name = iqn.2004-01.com.example:test
header-digest = CRC32C|CRC32C-NONE|NONE-CRC32C|NONE
user = CHAP username
password = CHAP password
If you use several different targets, you can also configure this on a per
target basis by using a group name:
[iscsi "iqn.target.name"]
...
The configuration file can be read using -readconfig.
Example :
qemu-system-i386 -drive file=iscsi://127.0.0.1/iqn.ronnie.test/1
-readconfig iscsi.conf
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Zero writes are a dedicated interface for writing regions of zeroes into
the image file. If clusters are not yet allocated it is possible to use
an efficient metadata representation which keeps the image file compact
and does not store individual zero bytes.
Implementing this for the QED image format is fairly straightforward.
The only issue is that when a zero write touches an existing cluster we
have to allocate a bounce buffer and perform a regular write.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Per-request attributes like read/write are currently implemented as bool
fields in the QEDAIOCB struct. This becomes unwiedly as the number of
attributes grows. For example, the qed_aio_setup() function would have
to take multiple bool arguments and at call sites it would be hard to
distinguish the meaning of each bool.
Instead use a flags field with bitmask constants. This will be used
when zero write support is added.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since common file operation functions lack of error detection and use
much more I/O syscalls, so change them to bdrv series functions and
reduce I/O request.
Signed-off-by: Li Zhi Hui <zhihuili@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The new block was filled with zero when it was allocated by g_malloc0,
but when it was reused later and only partially used, data from the
previously allocated block were still present and written to the new
block.
This caused the problems reported by bug #919242
(https://bugs.launchpad.net/qemu/+bug/919242).
Now the unused parts of the new block which are before and after the data
are always filled with zero, so it is no longer necessary to zero the whole
block with g_malloc0.
I also updated the copyright comment.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add support for streaming data from an intermediate section of the
image chain (see patch and documentation for details).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch implements rate-limiting for image streaming. If we've
exceeded the bandwidth quota for a 100 ms time slice we sleep the
coroutine until the next slice begins.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Most of the codebase as been converted to use glib memory allocation
functions. There are still a few instances of malloc/calloc in the
block layer and qemu-io. Replace them, especially since they do not
check the strdup/malloc/calloc return value.
Reported-by: Dr David Alan Gilbert <davidagilbert@uk.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All files under GPLv2 will get GPLv2+ changes starting tomorrow.
event_notifier.c and exec-obsolete.h were only ever touched by Red Hat
employees and can be relicensed now.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Allow sending up to 16 requests, and drive the replies to the coroutine
that did the request. The code is written to be exactly the same as
before this patch when MAX_NBD_REQUESTS == 1 (modulo the extra mutex
and state).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
qemu-nbd has a limit of slightly less than 1M per request. Work
around this in the nbd block driver.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Outside coroutines, avoid busy waiting on EAGAIN by temporarily
making the socket blocking.
The API of qemu_recvv/qemu_sendv is slightly different from
do_readv/do_writev because they do not handle coroutines. It
returns the number of bytes written before encountering an
EAGAIN. The specificity of yielding on EAGAIN is entirely in
qemu-coroutine.c.
Reviewed-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The caller expects psn_tab to be NULL when there are no snapshots or
an error occurs. This results in calling g_free on an invalid address.
Reported-by: Oliver Francke <Oliver@filoo.de>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Li Zhi Hui <zhihuili@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Initially done with the following semantic patch:
@ rule1 @
expression E;
statement S;
@@
E = qemu_aio_get (...);
(
- if (E == NULL) { ... }
|
- if (E)
{ <... S ...> }
)
which however missed occurrences in linux-aio.c and posix-aio-compat.c.
Those were done by hand.
The change in vdi_aio_setup's caller was also done by hand.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Initially done with the following semantic patch:
@ rule1 @
expression E;
statement S;
@@
E =
(
bdrv_aio_readv
| bdrv_aio_writev
| bdrv_aio_flush
| bdrv_aio_discard
| bdrv_aio_ioctl
)
(...);
(
- if (E == NULL) { ... }
|
- if (E)
{ <... S ...> }
)
which however missed the occurrence in block/blkverify.c
(as it should have done), and left behind some unused
variables.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Double semicolons should be single.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Now that bdrv_co_is_allocated() is available we can use it instead of
the synchronous bdrv_is_allocated() interface. This is a follow-up that
Kevin Wolf <kwolf@redhat.com> pointed out after applying the series that
introduces bdrv_co_is_allocated().
It is safe to make cow_read() a coroutine_fn because its only caller is
a coroutine_fn.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's common to wake up all waiting coroutines. Introduce the
qemu_co_queue_restart_all() function to do this instead of looping over
qemu_co_queue_next() in every caller.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cow block driver does not keep internal state for cluster lookups.
This means it is safe to perform cluster lookups in coroutine context
without risk of race conditions that corrupt internal state.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is trivial to switch from the synchronous .bdrv_is_allocated()
interface to .bdrv_co_is_allocated() since vdi_is_allocated() does not
block.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is trivial to switch from the synchronous .bdrv_is_allocated()
interface to .bdrv_co_is_allocated() since vvfat_is_allocated() does not
block.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The qcow2, qcow, and vmdk block drivers are based on coroutines. They have a
coroutine mutex which protects internal state. We can convert the
.bdrv_is_allocated() function to .bdrv_co_is_allocated() by holding the mutex
around the cluster lookup operation.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The bdrv_qed_is_allocated() function is a synchronous wrapper around
qed_find_cluster(), which performs the cluster lookup. In order to
convert the synchronous function to a coroutine function we yield
instead of using qemu_aio_wait(). Note that QED's cache is already safe
for parallel requests so no locking is needed.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the bdrv_read() of the snapshot's L1 table fails, return the right
error code and make sure that the old L1 table is still loaded and we
don't break the BlockDriverState completely.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
First the snapshot must be deleted and only then the refcounts can be
decreased.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The refcount updates must be moved so that in the worst case we can get
cluster leaks, but refcounts may never be too low.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Besides fixing the return code, this adds some comments that make clear
how the code works and that it potentially breaks images if we fail in
the wrong place. Actually fixing this is left for the next patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Increase refcounts only after allocating a new L1 table has succeeded in
order to make leaks less likely. If writing the snapshot table fails,
revert in-memory state to be consistent with that on disk.
While at it, make it return the real error codes instead of -1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
sn->id_str could be leaked before this. The rest of this patch changes
comments, fixes coding style or removes checks that are unnecessary with
g_malloc.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Failing in the middle wouldn't help with the integrity of the image, so
doing everything in a single request seems better.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Doesn't immediately fix anything as the callers don't use the return
value, but they will be fixed next.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Looks better when reviewing these source files.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since common file operation functions lack of error detection,
so change them to bdrv series functions.
Signed-off-by: Li Zhi Hui <zhihuili@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch is only to refactor some lines of codes to get better and more robust codes.
As you have seen, in qed_read_table_cb() it's nice to
use qiov->size because that function doesn't obviously use a single
struct iovec.
In other two functions, if qiov use more than one struct iovec, the existing way will get wrong nb_sectors.
To make the code more robust, it will be nicer to refactor the existing way as below.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A BlockDriverState should not issue requests on itself through the
public block layer interface. Nested, or reentrant, requests are
problematic because they do I/O throttling and request tracking twice.
Features like block layer copy-on-read use request tracking to avoid
race conditions between concurrent requests. The reentrant request will
have to "wait" for its parent request to complete. But the parent is
waiting for the reentrant request to make progress so we have reached
deadlock.
The solution is for block drivers to avoid the public block layer
interfaces for reentrant requests. Instead they should call their own
internal functions if they wish to perform reentrant requests.
This is also a good opportunity to make copy_sectors() a true
coroutine_fn. That means calling bdrv_co_writev() instead of
bdrv_write(). Behavior is unchanged but we're being explicit that this
executes in coroutine context.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unlocking during COW allows for more parallelism. One change it requires is
that buffers are dynamically allocated instead of just using a per-image
buffer.
While touching the code, drop the synchronous qcow2_read() function and replace
it by a bdrv_read() call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
vvfat caches more or less everything when in writable mode. For migration
to work, it would have to be invalidated. Block migration for now when
in writable mode (default is readonly).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vdi caches the block map. For migration to work, it would have to be
invalidated. Block migration for now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
s->lock should be unlocked before leaving add_aio_request.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
zlib.h is not a local include file, therefore it should be included
using <> instead of "".
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now when you try to migrate with qed, you get:
(qemu) migrate tcp:localhost:1025
Block format 'qed' used by device 'ide0-hd0' does not support feature 'live migration'
(qemu)
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
We don't reopen the actual file, but instead invoke the close and open routines.
We specifically ignore the backing file since it's contents are read-only and
therefore immutable.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qcow2 has a writeback metadata cache, so flushing a qcow2 image actually
consists of writing back that cache to the protocol and only then flushes the
protocol in order to get everything stable on disk.
This introduces a separate bdrv_co_flush_to_os to reflect the split.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are two different types of flush that you can do: Flushing one level up
to the OS (i.e. writing data to the host page cache) or flushing it all the way
down to the disk. The existing functions flush to the disk, reflect this in the
function name.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The Data Offset field in the Dynamic Disk Header is an 8 byte field.
Although the specification (2006-10-11) gives an example of initializing
only the first 4 bytes, images generated by Microsoft on Windows initialize
all 8 bytes.
Failure to initialize all 8 bytes results in errors from utilities
like Citrix's vhd-util which checks specifically for the proper Data
Offset field initialization.
Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vvfat used to directly call into the qcow2 block driver instead of using the
block.c wrappers. With the coroutine conversion, this stopped working.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
First determine FAT12/16/32, then compute geometry from that for both
FDD and HDD. For 1.44MB floppies, and 2.88MB floppies using FAT16,
change to 1 sector/cluster. The default remains 2.88MB with FAT12
and 2 sectors/cluster. Both DOS and mkdosfs by default format a 2.88MB
floppy as FAT12.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The sector count is stored in the partition and hence must not include the
sectors before its start. At the same time, remove the useless special
casing for 1.44 MB floppies. This fixes fsck on VVFAT hard disks,
which otherwise tries to seek past the end of the disk.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is consistent with what "real" floppies have, so file(1)
now actually recognizes the VVFAT image as a 1.44 MB floppy.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the number of "faked sectors" + the number of sectors that are
part of a cluster does not sum up to the total number of sectors,
qemu-img convert fails. Read these spare sectors as all zeros.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When reading the address of the first free entry, you cannot
use array_get without first marking all entries as occupied.
This is visible if you change the sectors per cluster on a
floppy from 2 to 1.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix mismatching allocation and deallocation: g_free should be used to pair with
g_malloc.
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed_by: Ray Wang <raywang@linux.vnet.ibm.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fix coding style in block/cloop.c.
Reviewed-by: Andreas Färber <afaerber@suse.de>
Reviewed_by: Ray Wang <raywang@linux.vnet.ibm.com>
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_cache_flush failed, s->lock will not be unlock.
Signed-off-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Data we read from the disk isn't necessarily null terminated and may not
contain the string we're looking for. The code needs to be a bit more careful
here.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
An entry in the VDI block map will hold an offset to the actual block if
the block is allocated, or one of two specially-interpreted values if
not allocated. Using VirtualBox terminology, value VDI_IMAGE_BLOCK_FREE
(0xffffffff) represents a never-allocated block (semantically arbitrary
content). VDI_IMAGE_BLOCK_ZERO (0xfffffffe) represents a "discarded"
block (semantically zero-filled). block/vdi knows only about
VDI_IMAGE_BLOCK_FREE. Teach it about VDI_IMAGE_BLOCK_ZERO.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This provides built-in support for iSCSI to QEMU.
This has the advantage that the iSCSI devices need not be made visible to the host, which is useful if you have very many virtual machines and very many iscsi devices.
It also has the benefit that non-root users of QEMU can access iSCSI devices across the network without requiring root privilege on the host.
This driver interfaces with the multiplatform posix library for iscsi initiator/client access to iscsi devices hosted at
git://github.com/sahlberg/libiscsi.git
The patch adds the driver to interface with the iscsi library.
It also updated the configure script to
* by default, probe is libiscsi is available and if so, build
qemu against libiscsi.
* --enable-libiscsi
Force a build against libiscsi. If libiscsi is not available
the build will fail.
* --disable-libiscsi
Do not link against libiscsi, even if it is available.
When linked with libiscsi, qemu gains support to access iscsi resources such as disks and cdrom directly, without having to make the devices visible to the host.
You can specify devices using a iscsi url of the form :
iscsi://[<username>[:<password>@]]<host>[:<port]/<target-iqn-name>/<lun>
When using authentication, the password can optionally be set with
LIBISCSI_CHAP_PASSWORD="password" to avoid it showing up in the process list
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
'ret' is unconditionally overwitten by qed_read_l1_table_sync()
Spotted by Clang Analyzer
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Spotted by Clang Analyzer
[Note this memcpy call has always been safe because the length will be 0
when the pointer is NULL]
Signed-off-by: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Since coroutine operation is now mandatory, convert both bdrv_discard
implementations to coroutines. For qcow2, this means taking the lock
around the operation. raw-posix remains synchronous.
The bdrv_discard callback is then unused and can be eliminated.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since coroutine operation is now mandatory, convert all bdrv_flush
implementations to coroutines. For qcow2, this means taking the lock.
Other implementations are simpler and just forward bdrv_flush to the
underlying protocol, so they can avoid the lock.
The bdrv_flush callback is then unused and can be eliminated.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This does the first part of the conversion to coroutines, by
wrapping bdrv_write implementations to take the mutex.
Drivers that implement bdrv_write rather than bdrv_co_writev can
then benefit from asynchronous operation (at least if the underlying
protocol supports it, which is not the case for raw-win32), even
though they still operate with a bounce buffer.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>