On complex operating systems such as Unix, there is a quite frequent need to copy data from one place in memory to another. This can happen in order to copy userland data to the kernel, on architectures where they use separate address spaces; or simply to coalesce data from various sources when building a data structure returned to userland; or to copy data back and forth between memory and bounce buffers located in key memory areas, for the sake of DMA controllers which have limited-width address busses and can't access all the memory of the system.
In the early days of Berkeley Unix, the bytes (or block) copy routine was called
bcopy.
At the same time, the AT&T Unix flavour grew a similar copy routine called
memcpy.
They were identical, except that their arguments were swapped: bcopy
arguments were source, destination, length
("copy source to destination for the given length"), while memcpy
arguments were destination, source, length
("fill destination with data from source for the given length".)
Both ways of specifying arguments made sense, but the memcpy routine
won over bcopy in the long run, and was incorporated into the first
edition of the C language standard, in 1989.
Copying memory from one zone to another is almost trivial work. There is, however, one difficult case: when the source and destination areas overlap.
The naive way to copy data would be, in pseudo-code:
while length is not 0 do
*dst = *src # write the byte read at src to dst
dst = dst + 1
src = src + 1
length = length - 1
end while
But if the areas overlap, for example
src = 0, dst = 4, length = 8:
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||
| dest |
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | ||||
| dest | 1 | 2 | 3 | 4 |
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 | ||||
| dest | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 4 |
So, in order to be able to copy things safely, these routines had to check whether the areas would overlap, and copy things differently (the easiest way being to copy backwards, from the end of the area to the beginning.)
But doing these overlap checks had a cost, and an overwhelming majority of these copy operations were involving non-overlapping areas.
Because of this, on the BSD side, bcopy was split into two routines, a fast bcopy (which does not care about overlap and always copies in forward order) and a safer ovbcopy (which explicitly cares about overlap and copies either in forward or backward order, depending upon its arguments), while memcpy was split into a fast memcpy (which does not care about overlap) and a safer memmove (which explicitly cares about overlap), and similarly to memcpy, memmove was made part of the C standard.
A key point here, is that BSD initially ran on PDP-11, and then (from 3BSD onwards) on the VAX architecture.
On the VAX, bcopy and ovbcopy were implemented as wrappers around the powerful MOVC3 instruction. Its documentation states:
The destination string specified by the length and destination address operands is replaced by the source string specified by the length and source address operands. [...] The operation of the instruction is such that overlap of the source and destination strings does not affect the result.
In other words, this means that VAX bcopy was actually an ovbcopy.
Because of this feature, careless programmers could incorrectly use bcopy and still have their code work regardless of the addresses used at runtime, and explicit use of ovbcopy was rare, only serving as a hint that the author had realized that the memory areas involved might overlap.
In 2.11BSD, it was still common to see constructs such as
#define ovbcopy(a,b,c) bcopy(a,b,c)
And in userland, implementation of bcopy for the PDP-11 (which did not have a special processor instruction to perform the copy) was written so as to handle overlaps correctly, in lib/libc/pdp/gen/bcopy.s:
/*
* Copyright (c) 1987 Regents of the University of California.
* All rights reserved. The Berkeley software License Agreement
* specifies the terms and conditions for redistribution.
*/
#ifdef LIBC_SCCS
<@(#)bcopy.s 1.2 (Berkeley) 8/21/87\0>
.even
#endif LIBC_SCCS
#include "DEFS.h"
/*
* bcopy(src, dst, length)
* caddr_t src, dst;
* u_int length;
*
* Copy length bytes from src to dst - handles overlapping moves. Bcopy uses
* a cost analysis heuristic to determine if it's worth while to see if a word
* copy should be tried. The test is if length > N, then try for a word
* copy. This routine will fail if N < 8. 10 is a good value for N:
*
* Algorithm Cost (in instructions)
* byte loop cost 2 * length
* word loop setup [worst case] 14 (instructions)
* word loop cost 0.625 * length, for length > 10
*
* N * 2 > 14 + N * 0.625
* N > 10
*/
#undef N
#define N 10.
ENTRY(bcopy)
mov 6(sp),r0 / if ((r0 = length) == 0)
beq 3f / return
mov r2,-(sp) / reserve a register for our use
mov 6(sp),r2 / r2 = dst
mov 4(sp),r1 / r1 = src
cmp r2,r1 / if (dst == src)
beq 2f / return
bhi 9f / if (dst > src)
/ do backwards copy
/*
* copy memory from src:r1 to dst:r2 for length:r0 bytes, FORWARDS ...
*/
cmp r0,$N / if (length >= N)
bhi 4f / try words
1:
movb (r1)+,(r2)+ / do *dst++ = *src++
sob r0,1b / while (--length)
2:
mov (sp)+,r2 / and return
3:
rts pc
/*
* The length of the copy justifies trying a word by word copy. If src and
* dst are of the same parity, we can do a word copy by handling any leading
* and trailing odd bytes separately.
*/
4:
bit $1,r1 / if (src&1 != dst&1)
beq 5f / do bytes
bit $1,r2
beq 1b / (src odd, dst even - do bytes)
movb (r1)+,(r2)+ / copy leading odd byte
dec r0
br 6f
5:
bit $1,r2
bne 1b / (src even, dst odd - do bytes)
6:
mov r0,-(sp) / save trailing byte indicator
asr r0 / length >>= 1 (unsigned)
bic $0100000,r0 / (length is now in remaining words to copy)
asr r0 / if (length >>= 1, wasodd(length))
bcc 7f / handle leading non multiple of four
mov (r1)+,(r2)+
7:
asr r0 / if (length >>= 1, wasodd(length))
bcc 8f / handle leading non multiple of eight
mov (r1)+,(r2)+
mov (r1)+,(r2)+
8:
mov (r1)+,(r2)+ / do
mov (r1)+,(r2)+ / move eight bytes
mov (r1)+,(r2)+
mov (r1)+,(r2)+
sob r0,8b / while (--length)
bit $1,(sp)+ / if (odd trailing byte)
beq 2b
movb (r1)+,(r2)+ / copy it
br 2b / and return
/*
* copy memory from src:r1 to dst:r2 for length:r0 bytes, BACKWARDS ...
*/
9:
add r0,r1 / src += length
add r0,r2 / dst += length
cmp r0,$N / if (length >= N)
bhi 4f / try words
1:
movb -(r1),-(r2) / do *--dst = *--src
sob r0,1b / while (--length)
2:
mov (sp)+,r2 / and return
3:
rts pc
/*
* The length of the copy justifies trying a word by word copy. If src and
* dst are of the same parity, we can do a word copy by handling any leading
* and trailing odd bytes separately.
*/
4:
bit $1,r1 / if (src&1 != dst&1)
beq 5f / do bytes
bit $1,r2
beq 1b / (src odd, dst even - do bytes)
movb -(r1),-(r2) / copy leading odd byte
dec r0
br 6f
5:
bit $1,r2
bne 1b / (src even, dst odd - do bytes)
6:
mov r0,-(sp) / save trailing byte indicator
asr r0 / length >>= 1 (unsigned)
bic $0100000,r0 / (length is now in remaining words to copy)
asr r0 / if (length >>= 1, wasodd(length))
bcc 7f / handle leading non multiple of four
mov -(r1),-(r2)
7:
asr r0 / if (length >>= 1, wasodd(length))
bcc 8f / handle leading non multiple of eight
mov -(r1),-(r2)
mov -(r1),-(r2)
8:
mov -(r1),-(r2) / do
mov -(r1),-(r2) / move eight bytes
mov -(r1),-(r2)
mov -(r1),-(r2)
sob r0,8b / while (--length)
bit $1,(sp)+ / if (odd trailing byte)
beq 2b
movb -(r1),-(r2) / copy it
br 2b / and return
Even if you're not fluent in PDP-11 assembly, the code is well commented, and the rationale for the size from which larger-than-byte instructions are used is clear. The only possibly difficult instructions to read are the bic instructions following the asr instructions; the asr performs an arithmetic shift to the right (dividing by two), preserving the sign bit if set, and the bic instruction immediately following clears the sign bit (bit 15) from the result (the 0100000 operand is octal for 32768 or, if you prefer, 0x8000 in hexadecimal.)
(I took the liberty to add the missing "+" symbol in the "N * 2 > 14 + N * 0.625" equation as it is not present in the 2.11BSD source.)
Of course, there was a C version for architectures which would not provide an assembler version, in lib/libc/gen/bcopy.c: (note the pre-ANSI function declaration and the register keywords)
/*
* Copyright (c) 1980 Regents of the University of California.
* All rights reserved. The Berkeley software License Agreement
* specifies the terms and conditions for redistribution.
*/
#if defined(LIBC_SCCS) && !defined(lint)
static char sccsid[] = "@(#)bcopy.c 1.1 (Berkeley) 1/19/87";
#endif LIBC_SCCS and not lint
/*
* bcopy -- vax movc3 instruction
*/
bcopy(src, dst, length)
register char *src, *dst;
register unsigned int length;
{
if (length && src != dst)
if (dst < src)
do
*dst++ = *src++;
while (--length);
else { /* copy backwards */
src += length;
dst += length;
do
*--dst = *--src;
while (--length);
}
return(0);
}
When the BSD codebase was ported to more architectures, bcopy was still made identical to ovbcopy, for example for the iconic hp300 port using Motorola 680x0 processors:
/*
* {ov}bcopy(from, to, len)
*
* Works for counts up to 128K.
*/
ALTENTRY(ovbcopy, _bcopy)
ENTRY(bcopy)
movl sp@(12),d0 | get count
jeq Lcpyexit | if zero, return
movl sp@(4),a0 | src address
movl sp@(8),a1 | dest address
cmpl a1,a0 | src before dest?
jlt Lcpyback | yes, copy backwards (avoids overlap)
movl a0,d1
btst #0,d1 | src address odd?
jeq Lcfeven | no, go check dest
movb a0@+,a1@+ | yes, copy a byte
subql #1,d0 | update count
jeq Lcpyexit | exit if done
Lcfeven:
movl a1,d1
btst #0,d1 | dest address odd?
jne Lcfbyte | yes, must copy by bytes
movl d0,d1 | no, get count
lsrl #2,d1 | convert to longwords
jeq Lcfbyte | no longwords, copy bytes
subql #1,d1 | set up for dbf
Lcflloop:
movl a0@+,a1@+ | copy longwords
dbf d1,Lcflloop | til done
andl #3,d0 | get remaining count
jeq Lcpyexit | done if none
Lcfbyte:
subql #1,d0 | set up for dbf
Lcfbloop:
movb a0@+,a1@+ | copy bytes
dbf d0,Lcfbloop | til done
Lcpyexit:
rts
Lcpyback:
addl d0,a0 | add count to src
addl d0,a1 | add count to dest
movl a0,d1
btst #0,d1 | src address odd?
jeq Lcbeven | no, go check dest
movb a0@-,a1@- | yes, copy a byte
subql #1,d0 | update count
jeq Lcpyexit | exit if done
Lcbeven:
movl a1,d1
btst #0,d1 | dest address odd?
jne Lcbbyte | yes, must copy by bytes
movl d0,d1 | no, get count
lsrl #2,d1 | convert to longwords
jeq Lcbbyte | no longwords, copy bytes
subql #1,d1 | set up for dbf
Lcblloop:
movl a0@-,a1@- | copy longwords
dbf d1,Lcblloop | til done
andl #3,d0 | get remaining count
jeq Lcpyexit | done if none
Lcbbyte:
subql #1,d0 | set up for dbf
Lcbbloop:
movb a0@-,a1@- | copy bytes
dbf d0,Lcbbloop | til done
rts
In this particular case, the combination of the ALTENTRY and ENTRY macros would define the ovbcopy symbol at the exact same address as bcopy, making them truly the same routine.
(Also, the size limit to 128KB mentioned in the comments is incorrect, and this may not be trivial to catch to people not familiar with m68k assembly. The dbf instruction used to control the loop (test condition, decrement and branch) looks at only the low 16 bits of its register argument, in this case d0 for the byte loops and d1 for the longword (32 bits) loops. So the real limit of this routine is 4 * 65535 bytes if able to perfom a longword copy in Lcflloop or Lcblloop, but only 65535 bytes if resorting to the the byte copy loops Lcfbloop and Lcbbloop. I can only guess an earlier version of this routine was only trying to do 16-bit copies and would thus be limited to 2 * 65535 which is close to 128KB; unfortunately this code was contributed to BSD by Mike Hibler from the University of Utah, and this work is not freely available, let alone its source code, and I did not want to bother Mike to check the history of that routine to confirm that guess.)
Note that all these routines try to be fast by using larger-than-byte operations as soon as source and destination addresses are properly aligned, rather than doing byte copies in a loop. This helps for large copies (100 bytes or more), which are not unusual. This also explains why the assembly versions can be 10s or 100s of lines long.
However, Christ Torek's port to the Sparc had separate bcopy and ovbcopy routines, the former not handling overlap, as you can check with that sys/sparc/sparc/locore.s excerpt:
/*
* kernel bcopy/memcpy
* Assumes regions do not overlap; has no useful return value.
*
* Must not use %g7 (see copyin/copyout above).
*/
#define BCOPY_SMALL 32 /* if < 32, copy by bytes */
ENTRY(memcpy)
/*
* Swap args for bcopy. Gcc generates calls to memcpy for
* structure assignments.
*/
mov %o0, %o3
mov %o1, %o0
mov %o3, %o1
ENTRY(bcopy)
cmp %o2, BCOPY_SMALL
Lbcopy_start:
bge,a Lbcopy_fancy ! if >= this many, go be fancy.
btst 7, %o0 ! (part of being fancy)
/*
* Not much to copy, just do it a byte at a time.
*/
deccc %o2 ! while (--len >= 0)
bl 1f
EMPTY
0:
inc %o0
ldsb [%o0 - 1], %o4 ! (++dst)[-1] = *src++;
stb %o4, [%o1]
deccc %o2
bge 0b
inc %o1
1:
retl
nop
/* NOTREACHED */
/*
* Plenty of data to copy, so try to do it optimally.
*/
Lbcopy_fancy:
! check for common case first: everything lines up.
! btst 7, %o0 ! done already
bne 1f
EMPTY
btst 7, %o1
be,a Lbcopy_doubles
dec 8, %o2 ! if all lined up, len -= 8, goto bcopy_doubes
! If the low bits match, we can make these line up.
1:
xor %o0, %o1, %o3 ! t = src ^ dst;
btst 1, %o3 ! if (t & 1) {
be,a 1f
btst 1, %o0 ! [delay slot: if (src & 1)]
! low bits do not match, must copy by bytes.
0:
ldsb [%o0], %o4 ! do {
inc %o0 ! (++dst)[-1] = *src++;
inc %o1
deccc %o2
bnz 0b ! } while (--len != 0);
stb %o4, [%o1 - 1]
retl
nop
/* NOTREACHED */
! lowest bit matches, so we can copy by words, if nothing else
1:
be,a 1f ! if (src & 1) {
btst 2, %o3 ! [delay slot: if (t & 2)]
! although low bits match, both are 1: must copy 1 byte to align
ldsb [%o0], %o4 ! *dst++ = *src++;
stb %o4, [%o1]
inc %o0
inc %o1
dec %o2 ! len--;
btst 2, %o3 ! } [if (t & 2)]
1:
be,a 1f ! if (t & 2) {
btst 2, %o0 ! [delay slot: if (src & 2)]
dec 2, %o2 ! len -= 2;
0:
ldsh [%o0], %o4 ! do {
sth %o4, [%o1] ! *(short *)dst = *(short *)src;
inc 2, %o0 ! dst += 2, src += 2;
deccc 2, %o2 ! } while ((len -= 2) >= 0);
bge 0b
inc 2, %o1
b Lbcopy_mopb ! goto mop_up_byte;
btst 1, %o2 ! } [delay slot: if (len & 1)]
/* NOTREACHED */
! low two bits match, so we can copy by longwords
1:
be,a 1f ! if (src & 2) {
btst 4, %o3 ! [delay slot: if (t & 4)]
! although low 2 bits match, they are 10: must copy one short to align
ldsh [%o0], %o4 ! (*short *)dst = *(short *)src;
sth %o4, [%o1]
inc 2, %o0 ! dst += 2;
inc 2, %o1 ! src += 2;
dec 2, %o2 ! len -= 2;
btst 4, %o3 ! } [if (t & 4)]
1:
be,a 1f ! if (t & 4) {
btst 4, %o0 ! [delay slot: if (src & 4)]
dec 4, %o2 ! len -= 4;
0:
ld [%o0], %o4 ! do {
st %o4, [%o1] ! *(int *)dst = *(int *)src;
inc 4, %o0 ! dst += 4, src += 4;
deccc 4, %o2 ! } while ((len -= 4) >= 0);
bge 0b
inc 4, %o1
b Lbcopy_mopw ! goto mop_up_word_and_byte;
btst 2, %o2 ! } [delay slot: if (len & 2)]
/* NOTREACHED */
! low three bits match, so we can copy by doublewords
1:
be 1f ! if (src & 4) {
dec 8, %o2 ! [delay slot: len -= 8]
ld [%o0], %o4 ! *(int *)dst = *(int *)src;
st %o4, [%o1]
inc 4, %o0 ! dst += 4, src += 4, len -= 4;
inc 4, %o1
dec 4, %o2 ! }
1:
Lbcopy_doubles:
ldd [%o0], %o4 ! do {
std %o4, [%o1] ! *(double *)dst = *(double *)src;
inc 8, %o0 ! dst += 8, src += 8;
deccc 8, %o2 ! } while ((len -= 8) >= 0);
bge Lbcopy_doubles
inc 8, %o1
! check for a usual case again (save work)
btst 7, %o2 ! if ((len & 7) == 0)
be Lbcopy_done ! goto bcopy_done;
btst 4, %o2 ! if ((len & 4)) == 0)
be,a Lbcopy_mopw ! goto mop_up_word_and_byte;
btst 2, %o2 ! [delay slot: if (len & 2)]
ld [%o0], %o4 ! *(int *)dst = *(int *)src;
st %o4, [%o1]
inc 4, %o0 ! dst += 4;
inc 4, %o1 ! src += 4;
btst 2, %o2 ! } [if (len & 2)]
1:
! mop up trailing word (if present) and byte (if present).
Lbcopy_mopw:
be Lbcopy_mopb ! no word, go mop up byte
btst 1, %o2 ! [delay slot: if (len & 1)]
ldsh [%o0], %o4 ! *(short *)dst = *(short *)src;
be Lbcopy_done ! if ((len & 1) == 0) goto done;
sth %o4, [%o1]
ldsb [%o0 + 2], %o4 ! dst[2] = src[2];
retl
stb %o4, [%o1 + 2]
/* NOTREACHED */
! mop up trailing byte (if present).
Lbcopy_mopb:
bne,a 1f
ldsb [%o0], %o4
Lbcopy_done:
retl
nop
1:
retl
stb %o4,[%o1]
/*
* ovbcopy(src, dst, len): like bcopy, but regions may overlap.
*/
ENTRY(ovbcopy)
cmp %o0, %o1 ! src < dst?
bgeu Lbcopy_start ! no, go copy forwards as via bcopy
cmp %o2, BCOPY_SMALL! (check length for doublecopy first)
/*
* Since src comes before dst, and the regions might overlap,
* we have to do the copy starting at the end and working backwards.
*/
add %o2, %o0, %o0 ! src += len
add %o2, %o1, %o1 ! dst += len
bge,a Lback_fancy ! if len >= BCOPY_SMALL, go be fancy
btst 3, %o0
/*
* Not much to copy, just do it a byte at a time.
*/
deccc %o2 ! while (--len >= 0)
bl 1f
EMPTY
0:
dec %o0 ! *--dst = *--src;
ldsb [%o0], %o4
dec %o1
deccc %o2
bge 0b
stb %o4, [%o1]
1:
retl
nop
/*
* Plenty to copy, try to be optimal.
* We only bother with word/halfword/byte copies here.
*/
Lback_fancy:
! btst 3, %o0 ! done already
bnz 1f ! if ((src & 3) == 0 &&
btst 3, %o1 ! (dst & 3) == 0)
bz,a Lback_words ! goto words;
dec 4, %o2 ! (done early for word copy)
1:
/*
* See if the low bits match.
*/
xor %o0, %o1, %o3 ! t = src ^ dst;
btst 1, %o3
bz,a 3f ! if (t & 1) == 0, can do better
btst 1, %o0
/*
* Nope; gotta do byte copy.
*/
2:
dec %o0 ! do {
ldsb [%o0], %o4 ! *--dst = *--src;
dec %o1
deccc %o2 ! } while (--len != 0);
bnz 2b
stb %o4, [%o1]
retl
nop
3:
/*
* Can do halfword or word copy, but might have to copy 1 byte first.
*/
! btst 1, %o0 ! done earlier
bz,a 4f ! if (src & 1) { /* copy 1 byte */
btst 2, %o3 ! (done early)
dec %o0 ! *--dst = *--src;
ldsb [%o0], %o4
dec %o1
stb %o4, [%o1]
dec %o2 ! len--;
btst 2, %o3 ! }
4:
/*
* See if we can do a word copy ((t&2) == 0).
*/
! btst 2, %o3 ! done earlier
bz,a 6f ! if (t & 2) == 0, can do word copy
btst 2, %o0 ! (src&2, done early)
/*
* Gotta do halfword copy.
*/
dec 2, %o2 ! len -= 2;
5:
dec 2, %o0 ! do {
ldsh [%o0], %o4 ! src -= 2;
dec 2, %o1 ! dst -= 2;
deccc 2, %o0 ! *(short *)dst = *(short *)src;
bge 5b ! } while ((len -= 2) >= 0);
sth %o4, [%o1]
b Lback_mopb ! goto mop_up_byte;
btst 1, %o2 ! (len&1, done early)
6:
/*
* We can do word copies, but we might have to copy
* one halfword first.
*/
! btst 2, %o0 ! done already
bz 7f ! if (src & 2) {
dec 4, %o2 ! (len -= 4, done early)
dec 2, %o0 ! src -= 2, dst -= 2;
ldsh [%o0], %o4 ! *(short *)dst = *(short *)src;
dec 2, %o1
sth %o4, [%o1]
dec 2, %o2 ! len -= 2;
! }
7:
Lback_words:
/*
* Do word copies (backwards), then mop up trailing halfword
* and byte if any.
*/
! dec 4, %o2 ! len -= 4, done already
0: ! do {
dec 4, %o0 ! src -= 4;
dec 4, %o1 ! src -= 4;
ld [%o0], %o4 ! *(int *)dst = *(int *)src;
deccc 4, %o2 ! } while ((len -= 4) >= 0);
bge 0b
st %o4, [%o1]
/*
* Check for trailing shortword.
*/
btst 2, %o2 ! if (len & 2) {
bz,a 1f
btst 1, %o2 ! (len&1, done early)
dec 2, %o0 ! src -= 2, dst -= 2;
ldsh [%o0], %o4 ! *(short *)dst = *(short *)src;
dec 2, %o1
sth %o4, [%o1] ! }
btst 1, %o2
/*
* Check for trailing byte.
*/
1:
Lback_mopb:
! btst 1, %o2 ! (done already)
bnz,a 1f ! if (len & 1) {
ldsb [%o0 - 1], %o4 ! b = src[-1];
retl
nop
1:
retl ! dst[-1] = b;
stb %o4, [%o1 - 1] ! }
The i386 situation is a bit more complicated. In the CSRG repository at the University of Berkeley, the first stable version of locore.s (according to the commit message) only implements bcopy:
#
# bcopy (src,dst,cnt)
# NOTE: does not (yet) handle overlapped copies
#
.globl _bcopy
_bcopy:
pushl %esi
pushl %edi
movl 12(%esp),%esi
movl 16(%esp),%edi
movl 20(%esp),%ecx
# shll $2,%ecx
# cld
# rep
# movsl
# movl 20(%esp),%ecx
# andl $3,%ecx
rep
movsb
popl %edi
popl %esi
xorl %eax,%eax
ret
The implementation of ovbcopy is found near the end of the file:
_ovbcopy:
.byte 0xcc
This 0xcc byte is a debugger trap instruction (INT3); any attempt to call ovbcopy would trigger the kernel debugger, from which there would be no real way to recover.
A real ovbcopy implementation was finally added more than one year later:
#
# ovbcopy (src,dst,cnt)
# NOTE: does not (yet) work doing words at a time
#
.globl _ovbcopy
_ovbcopy:
pushl %esi
pushl %edi
movl 12(%esp),%esi
movl 16(%esp),%edi
movl 20(%esp),%ecx
addl %ecx,%esi /* copy from end to beginning */
addl %ecx,%edi
decl %esi
decl %edi
std /* decrementing as we go */
rep
movsb
xorl %eax,%eax
cld
ret
But note that this code is not correct! What it does is a backwards copy,
regardless of whether the areas overlap, and this will misbehave in situations
where the areas overlap, but would be safe to copy forward
(dst < src < dst + cnt.)
If you remember my initial example of
src = 0, dst = 4, length = 8,
then let's try
src = 4, dst = 0, length = 8.
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||
| dest |
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||||
| dest | 1 | 2 | 3 | 4 |
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 5 | 6 | 7 | 8 | 5 | 6 | 7 | 8 | ||||
| dest | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 5 | 6 | 7 | 8 | 5 | 6 | 7 | 8 | ||||
| dest | 5 | 6 | 8 | 8 |
| address | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 |
| src | 5 | 6 | 7 | 8 | 5 | 6 | 7 | 8 | ||||
| dest | 5 | 6 | 7 | 8 | 5 | 6 | 7 | 8 |
This is the code which can be found in the 0.0 release of 386BSD, which you can check here.
However in the next version (386BSD 0.1), the bcopy and ovbcopy have been replaced with a single, fixed, version, correctly handling overlaps, contributed by Wolfgang Solfrank, which can be seen here:
/*
* (ov)bcopy (src,dst,cnt)
* ws@tools.de (Wolfgang Solfrank, TooLs GmbH) +49-228-985800
*/
.globl _bcopy,_ovbcopy
_ovbcopy:
_bcopy:
pushl %esi
pushl %edi
movl 12(%esp),%esi
movl 16(%esp),%edi
movl 20(%esp),%ecx
cmpl %esi,%edi /* potentially overlapping? */
jnb 1f
cld /* nope, copy forwards. */
shrl $2,%ecx /* copy by words */
rep
movsl
movl 20(%esp),%ecx
andl $3,%ecx /* any bytes left? */
rep
movsb
popl %edi
popl %esi
xorl %eax,%eax
ret
1:
addl %ecx,%edi /* copy backwards. */
addl %ecx,%esi
std
andl $3,%ecx /* any fractional bytes? */
decl %edi
decl %esi
rep
movsb
movl 20(%esp),%ecx /* copy remainder by words */
shrl $2,%ecx
subl $3,%esi
subl $3,%edi
rep
movsl
popl %edi
popl %esi
xorl %eax,%eax
cld
ret
In 386BSD 1.0, this implementation was again discarded, ovbcopy was completely removed, and an overlap-aware bcopy was implemented as an inline function in usr/src/kernel/include/i386/inline/string/bcopy.h:
/*
* Copyright (c) 1994 William F. Jolitz.
* 386BSD Copyright Restrictions Apply. All Other Rights Reserved.
*
* $Id: bcopy.h,v 1.1 94/06/09 18:19:53 bill Exp Locker: bill $
* BSD block copy.
*/
__INLINE void
bcopy(const void *fromaddr, void *toaddr, size_t maxlength)
{
extern const int zero; /* compiler bug workaround */
const void *f = fromaddr + zero; /* compiler bug workaround */
/* possibly overlapping? */
if ((u_long) toaddr < (u_long) fromaddr) {
asm volatile ("cld ; repe ; movsl" :
"=D" (toaddr), "=S" (fromaddr) :
"0" (toaddr), "1" (fromaddr), "c" (maxlength / 4));
asm volatile ("repe ; movsb" :
"=D" (toaddr), "=S" (fromaddr) :
"0" (toaddr), "1" (fromaddr), "c" (maxlength & 3));
} else {
toaddr += maxlength - 1;
fromaddr += maxlength - 1;
asm volatile ("std ; repe ; movsb" :
"=D" (toaddr), "=S" (fromaddr) :
"0" (toaddr), "1" (fromaddr), "c" (maxlength & 3));
toaddr -= 3;
fromaddr -= 3;
asm volatile ("repe ; movsl ; cld" :
"=D" (toaddr), "=S" (fromaddr) :
"0" (toaddr), "1" (fromaddr), "c" (maxlength / 4));
}
}
Anyway, damage was done. Since bcopy used to allow overlap, and still did on the major platforms at that time (vax, hp300, mips), there was no way to trust code using bcopy to mean ovbcopy or overlap-unsafe bcopy.
In other words, while code using ovbcopy could be safely converted to use memmove, code using bcopy could not be converted to use memcpy, but had to use memmove as well, unless there was enough confidence (from the author, or from code knowledge) that a non-overlap routine was safe to use in that particular case.
In the OpenBSD kernel, while new code would slowly prefer memcpy and memmove to bcopy and ovbcopy, there was still legacy code using bcopy. And there were ports to new architectures (such as PA-RISC or PowerPC) where people writing code for these new ports would eventually assume that it was safe to provide an overlap-safe ovbcopy routine and a faster, overlap-unsafe, bcopy routine.
And then there would be weird behaviours or memory corruption once in a blue moon, because one piece of code in the kernel had been written to use bcopy, assuming vax-times overlap handling, but was running on a more modern platform where the bcopy routine was assuming non-overlapping areas.
In order to protect against this, and until all the bcopy/ovbcopy users would get converted to memcpy/memmove, Theo de Raadt thought it would be safer to make sure all OpenBSD platforms had an overlap-safe bcopy routine, thus making bcopy identical to ovbcopy as used to be the case in the vax and hp300 days.
We were at a hackathon in Coimbra, Portugal, when he did the changes to the sparc port, removing the bcopy routine and making it an alias of ovbcopy, similarly to the hp300 code shown earlier. These changes were obviously correct.
From: Theo de Raadt
Date: Mon, 26 Nov 2007 08:13:48 +0000
To: source-changes mailing list
Subject: CVS: cvs.openbsd.org: src
CVSROOT: /cvs
Module name: src
Changes by: deraadt@cvs.openbsd.org 2007/11/26 01:13:48
Modified files:
sys/arch/sparc/sparc: locore.s
Log message:
the bcopy() found here was not handling overlapping. Merge it nicely with
the ovbcopy() code
ok miod
Yet when I updated my Tadpole SPARCbook 3 laptop I had with me, its network interface would now fail to work, with all packets having wrong checksums and being discarded in error.
A quick investigation pointed that this apparently innocuous change was the cause of the problem, so it was reverted minutes later.
From: Theo de Raadt
Date: Mon, 26 Nov 2007 08:18:51 +0000
To: source-changes mailing list
Subject: CVS: cvs.openbsd.org: src
CVSROOT: /cvs
Module name: src
Changes by: deraadt@cvs.openbsd.org 2007/11/26 01:18:51
Modified files:
sys/arch/sparc/sparc: locore.s
Log message:
few mails later, miod asks me to wait
And then I added this issue to my todolist, and got carried away with other things.
Eventually, after a few months and the occasional remember from Theo, I started investigating the problem.
It turns out that there was a small bug, possibly a typo, in the ovbcopy logic on sparc, which only affected copies done in 16-bit chunks (i.e. areas aligned to a 16-bit address, but at least one of them not aligned to a 32-bit address.) It turns out the driver for the AMD Lance (AM7990) Ethernet controller, which was used on the SPARCbook (and many other SPARCstation systems) as the le(4) driver, was a heavy user of bcopy with such addresses.
The fix was a one-liner:
From: Miod Vallat
Date: Sat, 15 Mar 2008 13:28:43 +0000
To: source-changes mailing list
Subject: CVS: cvs.openbsd.org: src
CVSROOT: /cvs
Module name: src
Changes by: miod@cvs.openbsd.org 2008/03/15 07:28:43
Modified files:
sys/arch/sparc/sparc: locore.s
Log message:
After 15 years of fun, fix Torek's ovbcopy() operation when copying shorts
backwards.
If you look at the commit diff:
@@ -4995,11 +4995,11 @@ Lback_fancy:
dec 2, %o2 ! len -= 2;
5:
dec 2, %o0 ! do {
ldsh [%o0], %o4 ! src -= 2;
dec 2, %o1 ! dst -= 2;
- deccc 2, %o0 ! *(short *)dst = *(short *)src;
+ deccc 2, %o2 ! *(short *)dst = *(short *)src;
bge 5b ! } while ((len -= 2) >= 0);
sth %o4, [%o1]
b Lback_mopb ! goto mop_up_byte;
btst 1, %o2 ! (len&1, done early)
In that loop, register %o0 is the source address and register
%o2 is the length, so the code needs to operate on the length
at every loop iteration.
The deccc would subtract 2 from %o0 and update
the condition codes, causing the following bge instruction to
branch only if the computed value is greater than or equal to zero.
Regardless of whether the branch would be taken or not, the store operation
(sth opcode to store a halfword, thus 16 bits) would be
executed anyway. This is known as a delay slot instruction.
But since the deccc operation was incorrectly operating on the source
address instead, the loop would continue until the address became negative;
somehow fortunately, kernel addresses on sparc have their high bit set, so
instead of becoming an infinite loop and overflowing the destination buffer,
the loop would instead perform only one iteration and exit, copying only
2 bytes instead of the actual requested size.
That human error might have happened because the comments block on the right
of the assembly instructions is misleading, the commented equivalent C
statements not being in front of the actual assembly instructions.
A possible way to rewrite that block with the comments matching the
assembly code for the loop would be:
dec 2, %o2 ! len -= 2;
5: ! do {
dec 2, %o0 ! src -= 2;
ldsh [%o0], %o4 ! tmp = *(short *)src;
dec 2, %o1 ! dst -= 2;
deccc 2, %o2 ! len -= 2;
bge 5b ! } while (len >= 0);
sth %o4, [%o1] ! delay slot: *(short *)dst = tmp;
Unsurprisingly, the sparc64 port to the UltraSPARC family suffered from the same bug and received a similar fix at the same time.
I eventually wrote a quick'n'dirty test program for the memory copy routines, in order to spot bugs specific to certain alignment or sizes conditions.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
/* Comment this block out to test libc functions rather than your own impl */
#if 0
extern void asm_bcopy(const void *, void *, size_t);
#define bcopy asm_bcopy
extern void *asm_memcpy(void *, const void *, size_t);
#define memcpy asm_memcpy
extern void *asm_memmove(void *, const void *, size_t);
#define memmove asm_memmove
#endif
#define MAX_SIZE (64 * 1024)
/*
* Test memory area. One reference area, and two playgrounds on each side.
* Playgrounds are a bit larger because:
* - we want to play with relative alignemnt of src and/or dest areas.
* - we want to check the 16 bytes area before and after the dest area to
* check that no out-of-area writes occur.
*/
static struct {
uint8_t play1[MAX_SIZE + 128];
uint8_t ref[MAX_SIZE];
uint8_t play2[MAX_SIZE + 128];
} d __attribute__((aligned(16)));
static char testname[128];
static unsigned int failures;
/*
* Safe non-overlapping copy routine, to be sure not to depend upon the
* code being tested.
*/
void
truthful_bcopy(uint8_t *dest, const uint8_t *src, size_t len)
{
while (len-- != 0) {
*dest++ = *src++;
}
}
/*
* Compute the test description.
*/
void
test(const char *descr, size_t soffs, size_t doffs, size_t len)
{
static const char *patterns[] = {
"000",
"001",
"010",
"011",
"100",
"101",
"110",
"111"
};
snprintf(testname, sizeof testname,
"%s, bit patterns src..%s dst..%s len..%s",
descr, patterns[soffs & 7], patterns[doffs & 7],
patterns[len & 7]);
}
void
blame(void)
{
printf("ERROR: in test %s\n", testname);
}
/*
* Check a given 'test' area of length 'len' against 'ref', as well as the
* 16 bytes preceding it against 'before' and the 16 bytes following it
* against 'after'.
*/
int
check(const uint8_t *ref, const uint8_t *test, size_t len,
const uint8_t *before, const uint8_t *after)
{
size_t pos;
int failed = 0;
test -= 16;
for (pos = 0; pos < 16; pos++, before++, test++) {
if (*before != *test) {
if (failed == 0) {
failed = 1;
blame();
}
printf("offset %zd: read %02x, expected %02x\n",
(ssize_t)pos - 16, *test, *before);
failures++;
}
}
for (pos = 0; pos < len; pos++, ref++, test++)
if (*ref != *test) {
if (failed == 0) {
failed = 1;
blame();
}
printf("offset 0x%zx: read %02x, expected %02x\n",
pos, *test, *ref);
failures++;
}
for (pos = 0; pos < 16; pos++, after++, test++) {
if (*after != *test) {
if (failed == 0) {
failed = 1;
blame();
}
printf("offset size+%zu: read %02x, expected %02x\n",
pos, *test, *after);
failures++;
}
}
return failed;
}
/*
* Same as above, also checking the return value of the copy function.
*/
void
checkr(const uint8_t *ref, const uint8_t *test, size_t len, const void *ret,
const uint8_t *before, const uint8_t *after)
{
int failed = check(ref, test, len, before, after);
if (ret != test) {
if (failed == 0) {
blame();
}
printf("return value: got %p, expected %p\n", ret, test);
failures++;
}
}
void
test_bcopy(unsigned int lim, size_t size)
{
unsigned int i, j;
uint8_t effs[16];
memset(effs, 0xff, sizeof effs);
if (lim > 8)
lim = 8;
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("bcopy, no overlap, src < dst", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play2, 0xff, size + 32 + lim);
/* + 16 to guarantee 16 valid ff bytes before */
bcopy(d.ref + i, d.play2 + 16 + j, size);
check(d.ref + i, d.play2 + 16 + j, size, effs, effs);
}
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("bcopy, no overlap, dst < src", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play1, 0xff, size + 32 + lim);
/* + 16 to guarantee 16 valid ff bytes before */
bcopy(d.ref + i, d.play1 + 16 + j, size);
check(d.ref + i, d.play1 + 16 + j, size, effs, effs);
}
/*
* No overlap tests - either it's a real bcopy(), as opposed to
* ovbcopy(), in which case overlap test do not make sense, or
* it's likely a wrapper around memmove(), for which overlap
* situations are tested below.
*/
}
void
test_memcpy(unsigned int lim, size_t size)
{
unsigned int i, j;
void *ret;
uint8_t effs[16];
memset(effs, 0xff, sizeof effs);
if (lim > 8)
lim = 8;
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("memcpy, no overlap, src < dst", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play2, 0xff, size + 32 + lim);
/* + 16 to guarantee 16 valid ff bytes before */
ret = memcpy(d.play2 + 16 + j, d.ref + i, size);
checkr(d.ref + i, d.play2 + 16 + j, size, ret, effs, effs);
}
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("memcpy, no overlap, dst < src", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play1, 0xff, size + 32 + lim);
/* + 16 to guarantee 16 valid ff bytes before */
ret = memcpy(d.play1 + 16 + j, d.ref + i, size);
checkr(d.ref + i, d.play1 + 16 + j, size, ret, effs, effs);
}
}
void
test_memmove(unsigned int lim, size_t size)
{
unsigned int i, j;
void *ret;
size_t doffs, soffs;
uint8_t before[16], after[16], effs[16];
memset(effs, 0xff, sizeof effs);
if (lim > 8)
lim = 8;
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("memmove, no overlap, src < dst", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play2, 0xff, size + 32 + lim);
doffs = 16 + j;
truthful_bcopy(before, d.play2 + doffs - 16, 16);
truthful_bcopy(after, d.play2 + doffs + size, 16);
ret = memmove(d.play2 + doffs, d.ref + i, size);
checkr(d.ref + i, d.play2 + doffs, size, ret, before, after);
}
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("memmove, no overlap, src > dst", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play1, 0xff, size + 32 + lim);
doffs = 16 + j;
truthful_bcopy(before, d.play1 + doffs - 16, 16);
truthful_bcopy(after, d.play1 + doffs + size, 16);
ret = memmove(d.play1 + doffs, d.ref + i, size);
checkr(d.ref + i, d.play1 + doffs, size, ret, before, after);
}
/*
* Overlap needing a reverse copy occurs if
* src < dst < src + len
*/
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("memmove, overlap needing reverse copy", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play2, 0xff, size + 32 + 2 * lim);
doffs = 16 + i + j;
soffs = 16 + i;
memcpy(d.play2 + soffs, d.ref, size);
truthful_bcopy(before, d.play2 + doffs - 16, 16);
truthful_bcopy(after, d.play2 + doffs + size, 16);
ret = memmove(d.play2 + doffs, d.play2 + soffs, size);
checkr(d.ref, d.play2 + doffs, size, ret, before, after);
}
/*
* Overlap not needing a reverse copy occurs if
* dst < src < dst + len
*/
for (i = 0; i < lim; i++)
for (j = 0; j < lim; j++) {
test("memmove, overlap not needing reverse copy", i, j, size);
/* + 32 to guarantee 16 valid ff bytes before and 16 after */
memset(d.play2, 0xff, size + 32 + 2 * lim);
doffs = 16 + i;
soffs = 16 + i + j;
memcpy(d.play2 + soffs, d.ref, size);
truthful_bcopy(before, d.play2 + doffs - 16, 16);
truthful_bcopy(after, d.play2 + doffs + size, 16);
ret = memmove(d.play2 + doffs, d.play2 + soffs, size);
checkr(d.ref, d.play2 + doffs, size, ret, before, after);
}
}
int
main()
{
unsigned int i;
/*
* Compute a random pattern, and make sure none of its bytes is
* equal to 0xff, which is our "not part of the pattern" sentinel.
*/
arc4random_buf(d.ref, sizeof d.ref);
for (i = 0; i < sizeof d.ref; i++) {
if (d.ref[i] == 0xff)
d.ref[i] = 0;
}
/*
* Test all alignments with a few size combinations.
*
* Small lengths are used to check that we don't copy too much
* data if the relative alignment of src and dest allows for a
* large-copy operation.
*/
static const size_t sizes[] = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 32, 64, 128, 256, 512, 1024 //, 4096, 32768
};
for (i = 0; i < sizeof(sizes) / sizeof(sizes[0]); i++) {
printf("Testing with size %zu\n", sizes[i]);
test_bcopy(8, sizes[i]);
test_memcpy(8, sizes[i]);
test_memmove(8, sizes[i]);
if (failures != 0)
break;
}
switch (failures) {
default:
printf("** %u TESTS FAILED **\n", failures);
break;
case 1:
printf("** 1 TEST FAILED **\n");
break;
case 0:
break;
}
/* benchmark if no errors */
if (failures == 0) {
struct timeval start, end, diff;
unsigned int i, iter, niters;
static const size_t allsizes[] = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 32, 64, 128, 256, 512, 1024, 4096, 8192, 32768
};
for (i = 0; i < sizeof(allsizes) / sizeof(allsizes[0]); i++) {
/* increase '100' below on fast hardware */
niters = (100 * 32768 / allsizes[i]);
gettimeofday(&start, NULL);
for (iter = niters; iter != 0; iter--) {
test_bcopy(1, allsizes[i]);
test_memcpy(1, allsizes[i]);
test_memmove(1, allsizes[i]);
}
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("%u loops on %zu bytes, time elapsed: %ld seconds, %ld microseconds\n",
niters, sizes[i], (long)diff.tv_sec, (long)diff.tv_usec);
}
}
return failures != 0;
}
This test program (and also the crude benchmark capability in it) turned out to be helpful recently while creating a reasonably fast yet small assembly implementation of the bcopy, memcopy and memmove routines for m88k some months ago.
In NetBSD, which sparc port had also been based on Chris Torek's work, the ovbcopy bug remained unnoticed and unfixed until 2018, when all uses of ovbcopy had been replaced with either an overlap-aware bcopy routine or memmove.
Module Name: src
Committed By: mrg
Date: Fri Mar 16 09:29:24 UTC 2018
Modified Files:
src/sys/arch/sparc/sparc: locore.s
Log Message:
remove obsolete ovbcopy(). it may be the very slightly bit
faster for larger copies, but slower for smaller ones.
i don't see any major benefit in keeping this code.
this is the final ovbcopy() reference in src. you're welcome :-)
You can now rest safely, knowing that ovbcopy no longer exists.
Well, almost. It still exists in FreeBSD, as a macro in sys/systm.h:
#define ovbcopy(f, t, l) bcopy((f), (t), (l))
...but this is only for the sake of a few files, and one should reasonably expect it to disappear completely in the near future.