Re: amrestore problem, headers ok but no data

On Fri, Jan 07, 2005 at 03:11:41PM -0500, Jon LaBadie wrote:
> That ENOMEM, reported as "insufficient memory" sometimes used to
> throw me for a loop.  Here is the situation as I understand it.
> 
> To enhance performance dd tries to do unbuffered I/O, meaning the
> data goes directly to memory in dd rather than through buffers
> in the OS and then to dd.  An upshot of this is that the buffer
> dd reads into must be at least as big as a block on the device.
> As there is no OS buffering, dd must get the entire block from
> the device in one shot, no taking little chunks at a time.  The
> devices do not send bytes on request, they send blocks.

This is correct, except for one minor quibble -- the claim that
"dd *tries* to do unbuffered I/O".  As I understand it, dd has no
choice in the matter.  Character devices (usually?) behave as
you've described:  there's no buffering within the kernel; the
data flows more-or-less directly from the hardware into the
process's user-space buffer, and vice versa during writes.
Further, each read() or write() call by the process is typically
translated into a single hardware I/O request.  This imposes a
restriction on the user process: the call must read and write
only whole hardware blocks.  Specifically:
  (a) each read() or write() call must start on a block boundary
  (b) it must also end on a block boundary

For character, aka "raw", disk devices, the read() and write()
cases are identical:
  (a) at the time of any read() or write() call, the seek offset
      must be a multiple of the disk's sector size
  (b) the requested length must (typically?) be a multiple of the
      same

When writing to a traditional, variable-length tape device, each
write() call writes exactly one physical block to the tape:
  (a) the new block starts wherever the head was positioned at
      the time, or somewhere after that if the drive had to skip
      past a bad spot on the tape
  (b) the length of the new block is precisely the length passed
      to write()

When reading from such a tape device, each read() call reads
exactly one physical block from the tape:
  (a) at the start of the call, the heads are guaranteed *not* to
      be positioned in the middle of a block; they might be
      precisely at the start of the next block to be read, or
      there might first be a bad spot in the tape to be skipped
      over, which doesn't contain any data.

      The hardware enforces this constraint, perhaps with help
      from the driver.  Unlike disk devices, tape devices aren't
      seekable; thus the user code has no way to position the
      tape incorrectly.  (Skipping backwards and forwards is
      always done in whole blocks -- hence the names,
      "backward/forward skip *record*".)

  (b) the read buffer must be at least large enough to contain
      the entire block.  The (single) block will be read in its
      entirety, and its actual length returned by read().


Block devices are a different story; they go through the buffer
cache, the same as regular disk files do, so none of the above
restrictions apply.  The cost of that convenience is lower
performance; the kernel breaks up large read() and write() calls
into cache-buffer-sized chunks, so you might have to wait almost
an entire disk rotation for each sector, whereas with the
character device, you'd only have to pay that cost once for the
entire read() or write() call.  (Roughly analogous to
"shoeshining" a tape drive vs. letting it stream :-))

(I think Linux is a bit weird in this respect; its disk special
files are block devices, but seem to behave more like character
devices.  Let me be perfectly clear:  I don't know this for sure.
I'm inferring it from Linus's claim that "dump" can't get a
consistent view of a Linux file system.  If Linux's block disk
devices *acted* like block devices, AFAICT his claim would have
to be false.  But the claim must be true -- I mean, if anyone
knows, it's Linus, right?  Hence the initial assumption must be
false.  QED.  (His more general dissing of "dump" isn't fact,
it's opinion, and so has no bearing here.))

I remember once -- probably back in V6 days -- seeing a UNIX with
both block and character special files for tapes.  That is, each
drive had (at least) all of these variants:

            rewinding   non-rewinding
            ---------   -------------
block       /dev/mt0    /dev/nmt0
character   /dev/rmt0   /dev/nrmt0

The character version of a tape device worked as described above.
The block version went through the buffer cache like any other
block device, which resulted in tapes with 512-byte blocks, no
matter how much you write()ed -- uh, "wrote()"? :-) -- in one
call.  That'd waste a *lot* of tape; it's not surprising that I
haven't seen a block-special file for a tape in a very long time.

The only optimization that's left for dd to perform is a small one:
if ibs and obs are the same, it can save a tiny amount of CPU
time by not using an inner loop; it can just do something like
this (omitting all the error checking and handling for clarity):
        while ((actual = read(infd, buf, bs)) > 0) {
            if (actual == bs)
                ++wholeBlocksRead;
            else
                ++partialBlocksRead;

            if (write(outfd, buf, actual) == actual)
                ++wholeBlocksWritten;
            else
                ++partialBlocksWritten;
        }

The variables being incremented are the source of the stats dd
prints at the end.

The optimization is so small that in practice, dd implementations
might not bother; they might just fold the ibs==obs case into one
of the other two cases.

If ibs and obs differ, the code has to be more complicated: a
bunch of small read()s to fill up a larger output buffer, or a
bunch of small write()s to empty out a larger input buffer
(possibly with padding and syncing and other data-munging if
specified, but none of that's relevent to this thread).

> So if dd is left with a default 512 byte "ibs", input block size,
> but the device is using a larger block size, like an amanda tape
> of 32k, dd has allocated a 512 byte piece of memory to hold the
> input data.  But when dd requests the first block it unexectedly
> gets 32k of data and has "insufficient memory" (ENOMEM).

Just so.  Or maybe an "invalid argument" (EINVAL) :-)

> The reverse is not really a problem.  Suppose you said "ibs=128k".
> dd would simply read sufficient device blocks until the buffer
> was filled, four blocks in the above example.

Yes.  As you've said, it would be dd that did this, *not* the
kernel.  dd would call read() enough times -- in this case four
-- to fill the buffer.  Each call would read one 32-KB physical
tape block.

> On output dd can make its own adjustments.  If the obs is larger,
> it can move multiple input buffers to the output buffer before
> doing the write.  If the reverse is true, input block size larger
> than output, it can copy part of the input block to the output
> buffer and do multiple outputs from a single input buffer.

Yes, except that in neither case does it need to copy the data
from one buffer to another.  It can just have a single buffer
that's max(ibs,obs) long, and do a number of read()s at
appropriate offsets within that one buffer, then one write() of
the whole thing; or vice versa.  The only time dd needs to copy
data internally is when it's doing more complex manipulations.
That's what the "conversion buffer", whose size is given by the
"cbs=" argument, is for; and why the man page bothers to discuss
when the conversion buffer is or is not needed.

--

|  | /\
|-_|/  >   Eric Siegerman, Toronto, Ont.        erics AT telepres DOT com
|  |  /
The animal that coils in a circle is the serpent; that's why so
many cults and myths of the serpent exist, because it's hard to
represent the return of the sun by the coiling of a hippopotamus.
        - Umberto Eco, "Foucault's Pendulum"