Copying memory is an art

On complex operating systems such as Unix, there is a quite frequent need to copy data from one place in memory to another. This can happen in order to copy userland data to the kernel, on architectures where they use separate address spaces; or simply to coalesce data from various sources when building a data structure returned to userland; or to copy data back and forth between memory and bounce buffers located in key memory areas, for the sake of DMA controllers which have limited-width address busses and can't access all the memory of the system.

In the early days of Berkeley Unix, the bytes (or block) copy routine was called bcopy.
At the same time, the AT&T Unix flavour grew a similar copy routine called memcpy.
They were identical, except that their arguments were swapped: bcopy arguments were source, destination, length ("copy source to destination for the given length"), while memcpy arguments were destination, source, length ("fill destination with data from source for the given length".)
Both ways of specifying arguments made sense, but the memcpy routine won over bcopy in the long run, and was incorporated into the first edition of the C language standard, in 1989.

Copying memory from one zone to another is almost trivial work. There is, however, one difficult case: when the source and destination areas overlap.

The naive way to copy data would be, in pseudo-code:

while length is not 0 do
        *dst = *src     # write the byte read at src to dst
        dst = dst + 1
        src = src + 1
        length = length - 1
end while

But if the areas overlap, for example src = 0, dst = 4, length = 8:

address	00	01	02	03	04	05	06	07	08	09	10	11
src	1	2	3	4	5	6	7	8
dest

then copying data this way will mean that contents from addresses 0 to 3 will be copied to addresses 4 to 7 first...

address	00	01	02	03	04	05	06	07	08	09	10	11
src	1	2	3	4	1	2	3	4
dest					1	2	3	4

...then contents from addresses 4 to 7 will be copied to addresses 8 to 11, except that they have just been overwritten before, and the results would be incorrect:

address	00	01	02	03	04	05	06	07	08	09	10	11
src	1	2	3	4	1	2	3	4
dest					1	2	3	4	1	2	3	4

So, in order to be able to copy things safely, these routines had to check whether the areas would overlap, and copy things differently (the easiest way being to copy backwards, from the end of the area to the beginning.)

But doing these overlap checks had a cost, and an overwhelming majority of these copy operations were involving non-overlapping areas.

Because of this, on the BSD side, bcopy was split into two routines, a fast bcopy (which does not care about overlap and always copies in forward order) and a safer ovbcopy (which explicitly cares about overlap and copies either in forward or backward order, depending upon its arguments), while memcpy was split into a fast memcpy (which does not care about overlap) and a safer memmove (which explicitly cares about overlap), and similarly to memcpy, memmove was made part of the C standard.

A key point here, is that BSD initially ran on PDP-11, and then (from 3BSD onwards) on the VAX architecture.

On the VAX, bcopy and ovbcopy were implemented as wrappers around the powerful MOVC3 instruction. Its documentation states:

The destination string specified by the length and destination address
operands is replaced by the source string specified by the length and source
address operands.
[...]
The operation of the instruction is such that overlap of the source and
destination strings does not affect the result.

(The documentation also subtly notes that the length argument is a 16-bit value, which implies that a single MOVC3 instruction can not copy more than 65535 bytes. This means that a correct implementation of bcopy or ovbcopy would need to actually perform as many MOVC3 instructions as needed to cope with sizes larger than or equal to 64KB. In the BSD kernel, it was accepted that no size would be that large.)

In other words, this means that VAX bcopy was actually an ovbcopy.

Because of this feature, careless programmers could incorrectly use bcopy and still have their code work regardless of the addresses used at runtime, and explicit use of ovbcopy was rare, only serving as a hint that the author had realized that the memory areas involved might overlap.

In 2.11BSD, it was still common to see constructs such as

#define ovbcopy(a,b,c)  bcopy(a,b,c)

in the kernel (for example, in the TCP/IP stack.)

And in userland, implementation of bcopy for the PDP-11 (which did not have a special processor instruction to perform the copy) was written so as to handle overlaps correctly, in lib/libc/pdp/gen/bcopy.s:

/*
 * Copyright (c) 1987 Regents of the University of California.
 * All rights reserved.  The Berkeley software License Agreement
 * specifies the terms and conditions for redistribution.
 */

#ifdef LIBC_SCCS
        <@(#)bcopy.s    1.2 (Berkeley) 8/21/87\0>
        .even
#endif LIBC_SCCS

#include "DEFS.h"

/*
 * bcopy(src, dst, length)
 *      caddr_t src, dst;
 *      u_int   length;
 *
 * Copy length bytes from src to dst - handles overlapping moves.  Bcopy uses
 * a cost analysis heuristic to determine if it's worth while to see if a word
 * copy should be tried.  The test is if length > N, then try for a word
 * copy.  This routine will fail if N < 8.  10 is a good value for N:
 *
 *      Algorithm                       Cost (in instructions)
 *      byte loop cost                  2 * length
 *      word loop setup [worst case]    14 (instructions)
 *      word loop cost                  0.625 * length, for length > 10
 *
 *      N * 2 > 14 + N * 0.625
 *      N > 10
 */
#undef  N
#define N       10.

ENTRY(bcopy)
        mov     6(sp),r0        / if ((r0 = length) == 0)
        beq     3f              /       return
        mov     r2,-(sp)        / reserve a register for our use
        mov     6(sp),r2        / r2 = dst
        mov     4(sp),r1        / r1 = src
        cmp     r2,r1           / if (dst == src)
        beq     2f              /       return
        bhi     9f              / if (dst > src)
                                /       do backwards copy

/*
 * copy memory from src:r1 to dst:r2 for length:r0 bytes, FORWARDS ...
 */
        cmp     r0,$N           / if (length >= N)
        bhi     4f              /       try words
1:
        movb    (r1)+,(r2)+     / do  *dst++ = *src++
        sob     r0,1b           / while (--length)
2:
        mov     (sp)+,r2        / and return
3:
        rts     pc

/*
 * The length of the copy justifies trying a word by word copy.  If src and
 * dst are of the same parity, we can do a word copy by handling any leading
 * and trailing odd bytes separately.
 */
4:
        bit     $1,r1           / if (src&1 != dst&1)
        beq     5f              /       do bytes
        bit     $1,r2
        beq     1b              / (src odd, dst even - do bytes)
        movb    (r1)+,(r2)+     / copy leading odd byte
        dec     r0
        br      6f
5:
        bit     $1,r2
        bne     1b              / (src even, dst odd - do bytes)
6:
        mov     r0,-(sp)        / save trailing byte indicator
        asr     r0              / length >>= 1 (unsigned)
        bic     $0100000,r0     / (length is now in remaining words to copy)
        asr     r0              / if (length >>= 1, wasodd(length))
        bcc     7f              /       handle leading non multiple of four
        mov     (r1)+,(r2)+
7:
        asr     r0              / if (length >>= 1, wasodd(length))
        bcc     8f              /       handle leading non multiple of eight
        mov     (r1)+,(r2)+
        mov     (r1)+,(r2)+
8:
        mov     (r1)+,(r2)+     / do
        mov     (r1)+,(r2)+     /       move eight bytes
        mov     (r1)+,(r2)+
        mov     (r1)+,(r2)+
        sob     r0,8b           / while (--length)
        bit     $1,(sp)+        / if (odd trailing byte)
        beq     2b
        movb    (r1)+,(r2)+     /       copy it
        br      2b              / and return

/*
 * copy memory from src:r1 to dst:r2 for length:r0 bytes, BACKWARDS ...
 */
9:
        add     r0,r1           / src += length
        add     r0,r2           / dst += length

        cmp     r0,$N           / if (length >= N)
        bhi     4f              /       try words
1:
        movb    -(r1),-(r2)     / do  *--dst = *--src
        sob     r0,1b           / while (--length)
2:
        mov     (sp)+,r2        / and return
3:
        rts     pc

/*
 * The length of the copy justifies trying a word by word copy.  If src and
 * dst are of the same parity, we can do a word copy by handling any leading
 * and trailing odd bytes separately.
 */
4:
        bit     $1,r1           / if (src&1 != dst&1)
        beq     5f              /       do bytes
        bit     $1,r2
        beq     1b              / (src odd, dst even - do bytes)
        movb    -(r1),-(r2)     / copy leading odd byte
        dec     r0
        br      6f
5:
        bit     $1,r2
        bne     1b              / (src even, dst odd - do bytes)
6:
        mov     r0,-(sp)        / save trailing byte indicator
        asr     r0              / length >>= 1 (unsigned)
        bic     $0100000,r0     / (length is now in remaining words to copy)
        asr     r0              / if (length >>= 1, wasodd(length))
        bcc     7f              /       handle leading non multiple of four
        mov     -(r1),-(r2)
7:
        asr     r0              / if (length >>= 1, wasodd(length))
        bcc     8f              /       handle leading non multiple of eight
        mov     -(r1),-(r2)
        mov     -(r1),-(r2)
8:
        mov     -(r1),-(r2)     / do
        mov     -(r1),-(r2)     /       move eight bytes
        mov     -(r1),-(r2)
        mov     -(r1),-(r2)
        sob     r0,8b           / while (--length)
        bit     $1,(sp)+        / if (odd trailing byte)
        beq     2b
        movb    -(r1),-(r2)     /       copy it
        br      2b              / and return

Even if you're not fluent in PDP-11 assembly, the code is well commented, and the rationale for the size from which larger-than-byte instructions are used is clear. The only possibly difficult instructions to read are the bic instructions following the asr instructions; the asr performs an arithmetic shift to the right (dividing by two), preserving the sign bit if set, and the bic instruction immediately following clears the sign bit (bit 15) from the result (the 0100000 operand is octal for 32768 or, if you prefer, 0x8000 in hexadecimal.)

(I took the liberty to add the missing "+" symbol in the "N * 2 > 14 + N * 0.625" equation as it is not present in the 2.11BSD source.)

Of course, there was a C version for architectures which would not provide an assembler version, in lib/libc/gen/bcopy.c: (note the pre-ANSI function declaration and the register keywords)

/*
 * Copyright (c) 1980 Regents of the University of California.
 * All rights reserved.  The Berkeley software License Agreement
 * specifies the terms and conditions for redistribution.
 */

#if defined(LIBC_SCCS) && !defined(lint)
static char sccsid[] = "@(#)bcopy.c     1.1 (Berkeley) 1/19/87";
#endif LIBC_SCCS and not lint

/*
 * bcopy -- vax movc3 instruction
 */
bcopy(src, dst, length)
        register char *src, *dst;
        register unsigned int length;
{
        if (length && src != dst)
                if (dst < src)
                        do
                                *dst++ = *src++;
                        while (--length);
                else {                  /* copy backwards */
                        src += length;
                        dst += length;
                        do
                                *--dst = *--src;
                        while (--length);
                }
        return(0);
}

When the BSD codebase was ported to more architectures, bcopy was still made identical to ovbcopy, for example for the iconic hp300 port using Motorola 680x0 processors:

/*
 * {ov}bcopy(from, to, len)
 *
 * Works for counts up to 128K.
 */
ALTENTRY(ovbcopy, _bcopy)
ENTRY(bcopy)
        movl    sp@(12),d0              | get count
        jeq     Lcpyexit                | if zero, return
        movl    sp@(4),a0               | src address
        movl    sp@(8),a1               | dest address
        cmpl    a1,a0                   | src before dest?
        jlt     Lcpyback                | yes, copy backwards (avoids overlap)
        movl    a0,d1
        btst    #0,d1                   | src address odd?
        jeq     Lcfeven                 | no, go check dest
        movb    a0@+,a1@+               | yes, copy a byte
        subql   #1,d0                   | update count
        jeq     Lcpyexit                | exit if done
Lcfeven:
        movl    a1,d1
        btst    #0,d1                   | dest address odd?
        jne     Lcfbyte                 | yes, must copy by bytes
        movl    d0,d1                   | no, get count
        lsrl    #2,d1                   | convert to longwords
        jeq     Lcfbyte                 | no longwords, copy bytes
        subql   #1,d1                   | set up for dbf
Lcflloop:
        movl    a0@+,a1@+               | copy longwords
        dbf     d1,Lcflloop             | til done
        andl    #3,d0                   | get remaining count
        jeq     Lcpyexit                | done if none
Lcfbyte:
        subql   #1,d0                   | set up for dbf
Lcfbloop:
        movb    a0@+,a1@+               | copy bytes
        dbf     d0,Lcfbloop             | til done
Lcpyexit:
        rts
Lcpyback:
        addl    d0,a0                   | add count to src
        addl    d0,a1                   | add count to dest
        movl    a0,d1
        btst    #0,d1                   | src address odd?
        jeq     Lcbeven                 | no, go check dest
        movb    a0@-,a1@-               | yes, copy a byte
        subql   #1,d0                   | update count
        jeq     Lcpyexit                | exit if done
Lcbeven:
        movl    a1,d1
        btst    #0,d1                   | dest address odd?
        jne     Lcbbyte                 | yes, must copy by bytes
        movl    d0,d1                   | no, get count
        lsrl    #2,d1                   | convert to longwords
        jeq     Lcbbyte                 | no longwords, copy bytes
        subql   #1,d1                   | set up for dbf
Lcblloop:
        movl    a0@-,a1@-               | copy longwords
        dbf     d1,Lcblloop             | til done
        andl    #3,d0                   | get remaining count
        jeq     Lcpyexit                | done if none
Lcbbyte:
        subql   #1,d0                   | set up for dbf
Lcbbloop:
        movb    a0@-,a1@-               | copy bytes
        dbf     d0,Lcbbloop             | til done
        rts

In this particular case, the combination of the ALTENTRY and ENTRY macros would define the ovbcopy symbol at the exact same address as bcopy, making them truly the same routine.

(Also, the size limit to 128KB mentioned in the comments is incorrect, and this may not be trivial to catch to people not familiar with m68k assembly. The dbf instruction used to control the loop (test condition, decrement and branch) looks at only the low 16 bits of its register argument, in this case d0 for the byte loops and d1 for the longword (32 bits) loops. So the real limit of this routine is 4 * 65535 bytes if able to perfom a longword copy in Lcflloop or Lcblloop, but only 65535 bytes if resorting to the the byte copy loops Lcfbloop and Lcbbloop. I can only guess an earlier version of this routine was only trying to do 16-bit copies and would thus be limited to 2 * 65535 which is close to 128KB; unfortunately this code was contributed to BSD by Mike Hibler from the University of Utah, and this work is not freely available, let alone its source code, and I did not want to bother Mike to check the history of that routine to confirm that guess.)

Note that all these routines try to be fast by using larger-than-byte operations as soon as source and destination addresses are properly aligned, rather than doing byte copies in a loop. This helps for large copies (100 bytes or more), which are not unusual. This also explains why the assembly versions can be 10s or 100s of lines long.

However, Christ Torek's port to the Sparc had separate bcopy and ovbcopy routines, the former not handling overlap, as you can check with that sys/sparc/sparc/locore.s excerpt:

/*
 * kernel bcopy/memcpy
 * Assumes regions do not overlap; has no useful return value.
 *
 * Must not use %g7 (see copyin/copyout above).
 */

#define BCOPY_SMALL     32      /* if < 32, copy by bytes */

ENTRY(memcpy)
        /*
         * Swap args for bcopy.  Gcc generates calls to memcpy for
         * structure assignments.
         */
        mov     %o0, %o3
        mov     %o1, %o0
        mov     %o3, %o1
ENTRY(bcopy)
        cmp     %o2, BCOPY_SMALL
Lbcopy_start:
        bge,a   Lbcopy_fancy    ! if >= this many, go be fancy.
        btst    7, %o0          ! (part of being fancy)

        /*
         * Not much to copy, just do it a byte at a time.
         */
        deccc   %o2             ! while (--len >= 0)
        bl      1f
        EMPTY
0:
        inc     %o0
        ldsb    [%o0 - 1], %o4  !       (++dst)[-1] = *src++;
        stb     %o4, [%o1]
        deccc   %o2
        bge     0b
        inc     %o1
1:
        retl
        nop
        /* NOTREACHED */

        /*
         * Plenty of data to copy, so try to do it optimally.
         */
Lbcopy_fancy:
        ! check for common case first: everything lines up.
!       btst    7, %o0          ! done already
        bne     1f
        EMPTY
        btst    7, %o1
        be,a    Lbcopy_doubles
        dec     8, %o2          ! if all lined up, len -= 8, goto bcopy_doubes

        ! If the low bits match, we can make these line up.
1:
        xor     %o0, %o1, %o3   ! t = src ^ dst;
        btst    1, %o3          ! if (t & 1) {
        be,a    1f
        btst    1, %o0          ! [delay slot: if (src & 1)]

        ! low bits do not match, must copy by bytes.
0:
        ldsb    [%o0], %o4      !       do {
        inc     %o0             !               (++dst)[-1] = *src++;
        inc     %o1
        deccc   %o2
        bnz     0b              !       } while (--len != 0);
        stb     %o4, [%o1 - 1]
        retl
        nop
        /* NOTREACHED */

        ! lowest bit matches, so we can copy by words, if nothing else
1:
        be,a    1f              ! if (src & 1) {
        btst    2, %o3          ! [delay slot: if (t & 2)]

        ! although low bits match, both are 1: must copy 1 byte to align
        ldsb    [%o0], %o4      !       *dst++ = *src++;
        stb     %o4, [%o1]
        inc     %o0
        inc     %o1
        dec     %o2             !       len--;
        btst    2, %o3          ! } [if (t & 2)]
1:
        be,a    1f              ! if (t & 2) {
        btst    2, %o0          ! [delay slot: if (src & 2)]
        dec     2, %o2          !       len -= 2;
0:
        ldsh    [%o0], %o4      !       do {
        sth     %o4, [%o1]      !               *(short *)dst = *(short *)src;
        inc     2, %o0          !               dst += 2, src += 2;
        deccc   2, %o2          !       } while ((len -= 2) >= 0);
        bge     0b
        inc     2, %o1
        b       Lbcopy_mopb     !       goto mop_up_byte;
        btst    1, %o2          ! } [delay slot: if (len & 1)]
        /* NOTREACHED */

        ! low two bits match, so we can copy by longwords
1:
        be,a    1f              ! if (src & 2) {
        btst    4, %o3          ! [delay slot: if (t & 4)]

        ! although low 2 bits match, they are 10: must copy one short to align
        ldsh    [%o0], %o4      !       (*short *)dst = *(short *)src;
        sth     %o4, [%o1]
        inc     2, %o0          !       dst += 2;
        inc     2, %o1          !       src += 2;
        dec     2, %o2          !       len -= 2;
        btst    4, %o3          ! } [if (t & 4)]
1:
        be,a    1f              ! if (t & 4) {
        btst    4, %o0          ! [delay slot: if (src & 4)]
        dec     4, %o2          !       len -= 4;
0:
        ld      [%o0], %o4      !       do {
        st      %o4, [%o1]      !               *(int *)dst = *(int *)src;
        inc     4, %o0          !               dst += 4, src += 4;
        deccc   4, %o2          !       } while ((len -= 4) >= 0);
        bge     0b
        inc     4, %o1
        b       Lbcopy_mopw     !       goto mop_up_word_and_byte;
        btst    2, %o2          ! } [delay slot: if (len & 2)]
        /* NOTREACHED */

        ! low three bits match, so we can copy by doublewords
1:
        be      1f              ! if (src & 4) {
        dec     8, %o2          ! [delay slot: len -= 8]
        ld      [%o0], %o4      !       *(int *)dst = *(int *)src;
        st      %o4, [%o1]
        inc     4, %o0          !       dst += 4, src += 4, len -= 4;
        inc     4, %o1
        dec     4, %o2          ! }
1:
Lbcopy_doubles:
        ldd     [%o0], %o4      ! do {
        std     %o4, [%o1]      !       *(double *)dst = *(double *)src;
        inc     8, %o0          !       dst += 8, src += 8;
        deccc   8, %o2          ! } while ((len -= 8) >= 0);
        bge     Lbcopy_doubles
        inc     8, %o1

        ! check for a usual case again (save work)
        btst    7, %o2          ! if ((len & 7) == 0)
        be      Lbcopy_done     !       goto bcopy_done;

        btst    4, %o2          ! if ((len & 4)) == 0)
        be,a    Lbcopy_mopw     !       goto mop_up_word_and_byte;
        btst    2, %o2          ! [delay slot: if (len & 2)]
        ld      [%o0], %o4      !       *(int *)dst = *(int *)src;
        st      %o4, [%o1]
        inc     4, %o0          !       dst += 4;
        inc     4, %o1          !       src += 4;
        btst    2, %o2          ! } [if (len & 2)]

1:
        ! mop up trailing word (if present) and byte (if present).
Lbcopy_mopw:
        be      Lbcopy_mopb     ! no word, go mop up byte
        btst    1, %o2          ! [delay slot: if (len & 1)]
        ldsh    [%o0], %o4      ! *(short *)dst = *(short *)src;
        be      Lbcopy_done     ! if ((len & 1) == 0) goto done;
        sth     %o4, [%o1]
        ldsb    [%o0 + 2], %o4  ! dst[2] = src[2];
        retl
        stb     %o4, [%o1 + 2]
        /* NOTREACHED */

        ! mop up trailing byte (if present).
Lbcopy_mopb:
        bne,a   1f
        ldsb    [%o0], %o4

Lbcopy_done:
        retl
        nop

1:
        retl
        stb     %o4,[%o1]
/*
 * ovbcopy(src, dst, len): like bcopy, but regions may overlap.
 */
ENTRY(ovbcopy)
        cmp     %o0, %o1        ! src < dst?
        bgeu    Lbcopy_start    ! no, go copy forwards as via bcopy
        cmp     %o2, BCOPY_SMALL! (check length for doublecopy first)

        /*
         * Since src comes before dst, and the regions might overlap,
         * we have to do the copy starting at the end and working backwards.
         */
        add     %o2, %o0, %o0   ! src += len
        add     %o2, %o1, %o1   ! dst += len
        bge,a   Lback_fancy     ! if len >= BCOPY_SMALL, go be fancy
        btst    3, %o0

        /*
         * Not much to copy, just do it a byte at a time.
         */
        deccc   %o2             ! while (--len >= 0)
        bl      1f
        EMPTY
0:
        dec     %o0             !       *--dst = *--src;
        ldsb    [%o0], %o4
        dec     %o1
        deccc   %o2
        bge     0b
        stb     %o4, [%o1]
1:
        retl
        nop

        /*
         * Plenty to copy, try to be optimal.
         * We only bother with word/halfword/byte copies here.
         */
Lback_fancy:
!       btst    3, %o0          ! done already
        bnz     1f              ! if ((src & 3) == 0 &&
        btst    3, %o1          !     (dst & 3) == 0)
        bz,a    Lback_words     !       goto words;
        dec     4, %o2          ! (done early for word copy)

1:
        /*
         * See if the low bits match.
         */
        xor     %o0, %o1, %o3   ! t = src ^ dst;
        btst    1, %o3
        bz,a    3f              ! if (t & 1) == 0, can do better
        btst    1, %o0

        /*
         * Nope; gotta do byte copy.
         */
2:
        dec     %o0             ! do {
        ldsb    [%o0], %o4      !       *--dst = *--src;
        dec     %o1
        deccc   %o2             ! } while (--len != 0);
        bnz     2b
        stb     %o4, [%o1]
        retl
        nop

3:
        /*
         * Can do halfword or word copy, but might have to copy 1 byte first.
         */
!       btst    1, %o0          ! done earlier
        bz,a    4f              ! if (src & 1) {        /* copy 1 byte */
        btst    2, %o3          ! (done early)
        dec     %o0             !       *--dst = *--src;
        ldsb    [%o0], %o4
        dec     %o1
        stb     %o4, [%o1]
        dec     %o2             !       len--;
        btst    2, %o3          ! }

4:
        /*
         * See if we can do a word copy ((t&2) == 0).
         */
!       btst    2, %o3          ! done earlier
        bz,a    6f              ! if (t & 2) == 0, can do word copy
        btst    2, %o0          ! (src&2, done early)

        /*
         * Gotta do halfword copy.
         */
        dec     2, %o2          ! len -= 2;
5:
        dec     2, %o0          ! do {
        ldsh    [%o0], %o4      !       src -= 2;
        dec     2, %o1          !       dst -= 2;
        deccc   2, %o0          !       *(short *)dst = *(short *)src;
        bge     5b              ! } while ((len -= 2) >= 0);
        sth     %o4, [%o1]
        b       Lback_mopb      ! goto mop_up_byte;
        btst    1, %o2          ! (len&1, done early)

6:
        /*
         * We can do word copies, but we might have to copy
         * one halfword first.
         */
!       btst    2, %o0          ! done already
        bz      7f              ! if (src & 2) {
        dec     4, %o2          ! (len -= 4, done early)
        dec     2, %o0          !       src -= 2, dst -= 2;
        ldsh    [%o0], %o4      !       *(short *)dst = *(short *)src;
        dec     2, %o1
        sth     %o4, [%o1]
        dec     2, %o2          !       len -= 2;
                                ! }

7:
Lback_words:
        /*
         * Do word copies (backwards), then mop up trailing halfword
         * and byte if any.
         */
!       dec     4, %o2          ! len -= 4, done already
0:                              ! do {
        dec     4, %o0          !       src -= 4;
        dec     4, %o1          !       src -= 4;
        ld      [%o0], %o4      !       *(int *)dst = *(int *)src;
        deccc   4, %o2          ! } while ((len -= 4) >= 0);
        bge     0b
        st      %o4, [%o1]

        /*
         * Check for trailing shortword.
         */
        btst    2, %o2          ! if (len & 2) {
        bz,a    1f
        btst    1, %o2          ! (len&1, done early)
        dec     2, %o0          !       src -= 2, dst -= 2;
        ldsh    [%o0], %o4      !       *(short *)dst = *(short *)src;
        dec     2, %o1
        sth     %o4, [%o1]      ! }
        btst    1, %o2

        /*
         * Check for trailing byte.
         */
1:
Lback_mopb:
!       btst    1, %o2          ! (done already)
        bnz,a   1f              ! if (len & 1) {
        ldsb    [%o0 - 1], %o4  !       b = src[-1];
        retl
        nop
1:
        retl                    !       dst[-1] = b;
        stb     %o4, [%o1 - 1]  ! }

The i386 situation is a bit more complicated. In the CSRG repository at the University of Berkeley, the first stable version of locore.s (according to the commit message) only implements bcopy:

        #
        # bcopy (src,dst,cnt)
        # NOTE: does not (yet) handle overlapped copies
        #

        .globl  _bcopy
_bcopy:
        pushl   %esi
        pushl   %edi
        movl    12(%esp),%esi
        movl    16(%esp),%edi
        movl    20(%esp),%ecx
        # shll  $2,%ecx
        # cld
        # rep
        # movsl
        # movl  20(%esp),%ecx
        # andl  $3,%ecx
        rep
        movsb
        popl    %edi
        popl    %esi
        xorl    %eax,%eax
        ret

The implementation of ovbcopy is found near the end of the file:

_ovbcopy:
        .byte 0xcc

This 0xcc byte is a debugger trap instruction (INT3); any attempt to call ovbcopy would trigger the kernel debugger, from which there would be no real way to recover.

A real ovbcopy implementation was finally added more than one year later:

        #
        # ovbcopy (src,dst,cnt)
        # NOTE: does not (yet) work doing words at a time
        #

        .globl  _ovbcopy
_ovbcopy:
        pushl   %esi
        pushl   %edi
        movl    12(%esp),%esi
        movl    16(%esp),%edi
        movl    20(%esp),%ecx
        addl    %ecx,%esi       /* copy from end to beginning */
        addl    %ecx,%edi
        decl    %esi
        decl    %edi
        std                     /* decrementing as we go */
        rep
        movsb
        xorl    %eax,%eax
        cld
        ret

But note that this code is not correct! What it does is a backwards copy, regardless of whether the areas overlap, and this will misbehave in situations where the areas overlap, but would be safe to copy forward (dst < src < dst + cnt.)
If you remember my initial example of src = 0, dst = 4, length = 8, then let's try src = 4, dst = 0, length = 8.

address	00	01	02	03	04	05	06	07	08	09	10	11
src					1	2	3	4	5	6	7	8
dest

In this case, with a forward copy, contents from addresses 4 to 7 would be copied to addresses 0 to 3...

address	00	01	02	03	04	05	06	07	08	09	10	11
src					1	2	3	4	5	6	7	8
dest	1	2	3	4

...then contents from addresses 8 to 11 would be copied to addresses 4 to 7, correctly.

address	00	01	02	03	04	05	06	07	08	09	10	11
src					5	6	7	8	5	6	7	8
dest	1	2	3	4	5	6	7	8

But the backwards copy here would try to copy contents from addresses 11 to 8 to addresses 7 to 4...

address	00	01	02	03	04	05	06	07	08	09	10	11
src					5	6	7	8	5	6	7	8
dest					5	6	8	8

...and then from addresses 7 to 4 to addresses 3 to 0, but these 4 bytes have just been overwritten.

address	00	01	02	03	04	05	06	07	08	09	10	11
src					5	6	7	8	5	6	7	8
dest	5	6	7	8	5	6	7	8

This is the code which can be found in the 0.0 release of 386BSD, which you can check here.

However in the next version (386BSD 0.1), the bcopy and ovbcopy have been replaced with a single, fixed, version, correctly handling overlaps, contributed by Wolfgang Solfrank, which can be seen here:

        /*
         * (ov)bcopy (src,dst,cnt)
         *  ws@tools.de     (Wolfgang Solfrank, TooLs GmbH) +49-228-985800
         */

        .globl  _bcopy,_ovbcopy
_ovbcopy:
_bcopy:
        pushl   %esi
        pushl   %edi
        movl    12(%esp),%esi
        movl    16(%esp),%edi
        movl    20(%esp),%ecx
        cmpl    %esi,%edi       /* potentially overlapping? */
        jnb     1f
        cld                     /* nope, copy forwards. */
        shrl    $2,%ecx         /* copy by words */
        rep
        movsl
        movl    20(%esp),%ecx
        andl    $3,%ecx         /* any bytes left? */
        rep
        movsb
        popl    %edi
        popl    %esi
        xorl    %eax,%eax
        ret
1:
        addl    %ecx,%edi       /* copy backwards. */
        addl    %ecx,%esi
        std
        andl    $3,%ecx         /* any fractional bytes? */
        decl    %edi
        decl    %esi
        rep
        movsb
        movl    20(%esp),%ecx   /* copy remainder by words */
        shrl    $2,%ecx
        subl    $3,%esi
        subl    $3,%edi
        rep
        movsl
        popl    %edi
        popl    %esi
        xorl    %eax,%eax
        cld
        ret

In 386BSD 1.0, this implementation was again discarded, ovbcopy was completely removed, and an overlap-aware bcopy was implemented as an inline function in usr/src/kernel/include/i386/inline/string/bcopy.h:

/*
 * Copyright (c) 1994 William F. Jolitz.
 * 386BSD Copyright Restrictions Apply. All Other Rights Reserved.
 *
 * $Id: bcopy.h,v 1.1 94/06/09 18:19:53 bill Exp Locker: bill $
 * BSD block copy.
 */

__INLINE void
bcopy(const void *fromaddr, void *toaddr, size_t maxlength)
{
        extern const int zero;                  /* compiler bug workaround */
        const void *f = fromaddr + zero;        /* compiler bug workaround */

        /* possibly overlapping? */
        if ((u_long) toaddr < (u_long) fromaddr) {
                asm volatile ("cld ; repe ; movsl" :
                    "=D" (toaddr), "=S" (fromaddr) :
                    "0" (toaddr), "1" (fromaddr), "c" (maxlength / 4));
                asm volatile ("repe ; movsb" :
                    "=D" (toaddr), "=S" (fromaddr) :
                    "0" (toaddr), "1" (fromaddr), "c" (maxlength & 3));
        } else {
                toaddr += maxlength - 1;
                fromaddr += maxlength - 1;
                asm volatile ("std ; repe ; movsb" :
                    "=D" (toaddr), "=S" (fromaddr) :
                    "0" (toaddr), "1" (fromaddr), "c" (maxlength & 3));
                toaddr -= 3;
                fromaddr -= 3;
                asm volatile ("repe ; movsl ; cld" :
                    "=D" (toaddr), "=S" (fromaddr) :
                    "0" (toaddr), "1" (fromaddr), "c" (maxlength / 4));
        }
}

Anyway, damage was done. Since bcopy used to allow overlap, and still did on the major platforms at that time (vax, hp300, mips), there was no way to trust code using bcopy to mean ovbcopy or overlap-unsafe bcopy.

In other words, while code using ovbcopy could be safely converted to use memmove, code using bcopy could not be converted to use memcpy, but had to use memmove as well, unless there was enough confidence (from the author, or from code knowledge) that a non-overlap routine was safe to use in that particular case.

In the OpenBSD kernel, while new code would slowly prefer memcpy and memmove to bcopy and ovbcopy, there was still legacy code using bcopy. And there were ports to new architectures (such as PA-RISC or PowerPC) where people writing code for these new ports would eventually assume that it was safe to provide an overlap-safe ovbcopy routine and a faster, overlap-unsafe, bcopy routine.

And then there would be weird behaviours or memory corruption once in a blue moon, because one piece of code in the kernel had been written to use bcopy, assuming vax-times overlap handling, but was running on a more modern platform where the bcopy routine was assuming non-overlapping areas.

In order to protect against this, and until all the bcopy/ovbcopy users would get converted to memcpy/memmove, Theo de Raadt thought it would be safer to make sure all OpenBSD platforms had an overlap-safe bcopy routine, thus making bcopy identical to ovbcopy as used to be the case in the vax and hp300 days.

We were at a hackathon in Coimbra, Portugal, when he did the changes to the sparc port, removing the bcopy routine and making it an alias of ovbcopy, similarly to the hp300 code shown earlier. These changes were obviously correct.

From: Theo de Raadt
Date: Mon, 26 Nov 2007 08:13:48 +0000
To: source-changes mailing list
Subject: CVS: cvs.openbsd.org: src

CVSROOT:        /cvs
Module name:    src
Changes by:     deraadt@cvs.openbsd.org 2007/11/26 01:13:48

Modified files:
        sys/arch/sparc/sparc: locore.s

Log message:
the bcopy() found here was not handling overlapping.  Merge it nicely with
the ovbcopy() code
ok miod

Yet when I updated my Tadpole SPARCbook 3 laptop I had with me, its network interface would now fail to work, with all packets having wrong checksums and being discarded in error.

A quick investigation pointed that this apparently innocuous change was the cause of the problem, so it was reverted minutes later.

From: Theo de Raadt
Date: Mon, 26 Nov 2007 08:18:51 +0000
To: source-changes mailing list
Subject: CVS: cvs.openbsd.org: src

CVSROOT:        /cvs
Module name:    src
Changes by:     deraadt@cvs.openbsd.org 2007/11/26 01:18:51

Modified files:
        sys/arch/sparc/sparc: locore.s

Log message:
few mails later, miod asks me to wait

And then I added this issue to my todolist, and got carried away with other things.

Eventually, after a few months and the occasional remember from Theo, I started investigating the problem.

It turns out that there was a small bug, possibly a typo, in the ovbcopy logic on sparc, which only affected copies done in 16-bit chunks (i.e. areas aligned to a 16-bit address, but at least one of them not aligned to a 32-bit address.) It turns out the driver for the AMD Lance (AM7990) Ethernet controller, which was used on the SPARCbook (and many other SPARCstation systems) as the le(4) driver, was a heavy user of bcopy with such addresses.

The fix was a one-liner:

From: Miod Vallat
Date: Sat, 15 Mar 2008 13:28:43 +0000
To: source-changes mailing list
Subject: CVS: cvs.openbsd.org: src

CVSROOT:        /cvs
Module name:    src
Changes by:     miod@cvs.openbsd.org    2008/03/15 07:28:43

Modified files:
        sys/arch/sparc/sparc: locore.s

Log message:
After 15 years of fun, fix Torek's ovbcopy() operation when copying shorts
backwards.

If you look at the commit diff:

@@ -4995,11 +4995,11 @@ Lback_fancy:
        dec     2, %o2          ! len -= 2;
 5:
        dec     2, %o0          ! do {
        ldsh    [%o0], %o4      !       src -= 2;
        dec     2, %o1          !       dst -= 2;
-       deccc   2, %o0          !       *(short *)dst = *(short *)src;
+       deccc   2, %o2          !       *(short *)dst = *(short *)src;
        bge     5b              ! } while ((len -= 2) >= 0);
        sth     %o4, [%o1]
        b       Lback_mopb      ! goto mop_up_byte;
        btst    1, %o2          ! (len&1, done early)

In that loop, register %o0 is the source address and register %o2 is the length, so the code needs to operate on the length at every loop iteration.
The deccc would subtract 2 from %o0 and update the condition codes, causing the following bge instruction to branch only if the computed value is greater than or equal to zero.
Regardless of whether the branch would be taken or not, the store operation (sth opcode to store a halfword, thus 16 bits) would be executed anyway. This is known as a delay slot instruction.
But since the deccc operation was incorrectly operating on the source address instead, the loop would continue until the address became negative; somehow fortunately, kernel addresses on sparc have their high bit set, so instead of becoming an infinite loop and overflowing the destination buffer, the loop would instead perform only one iteration and exit, copying only 2 bytes instead of the actual requested size.

That human error might have happened because the comments block on the right of the assembly instructions is misleading, the commented equivalent C statements not being in front of the actual assembly instructions.
A possible way to rewrite that block with the comments matching the assembly code for the loop would be:

        dec     2, %o2          ! len -= 2;
 5:                             ! do {
        dec     2, %o0          !       src -= 2;
        ldsh    [%o0], %o4      !       tmp = *(short *)src;
        dec     2, %o1          !       dst -= 2;
        deccc   2, %o2          !       len -= 2;
        bge     5b              ! } while (len >= 0);
         sth     %o4, [%o1]     ! delay slot: *(short *)dst = tmp;

Unsurprisingly, the sparc64 port to the UltraSPARC family suffered from the same bug and received a similar fix at the same time.

I eventually wrote a quick'n'dirty test program for the memory copy routines, in order to spot bugs specific to certain alignment or sizes conditions.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

/* Comment this block out to test libc functions rather than your own impl */
#if 0
extern void asm_bcopy(const void *, void *, size_t);
#define bcopy asm_bcopy
extern void *asm_memcpy(void *, const void *, size_t);
#define memcpy asm_memcpy
extern void *asm_memmove(void *, const void *, size_t);
#define memmove asm_memmove
#endif

#define MAX_SIZE        (64 * 1024)

/*
 * Test memory area. One reference area, and two playgrounds on each side.
 * Playgrounds are a bit larger because:
 * - we want to play with relative alignemnt of src and/or dest areas.
 * - we want to check the 16 bytes area before and after the dest area to
 *   check that no out-of-area writes occur.
 */
static struct {
        uint8_t play1[MAX_SIZE + 128];
        uint8_t ref[MAX_SIZE];
        uint8_t play2[MAX_SIZE + 128];
} d __attribute__((aligned(16)));

static char testname[128];
static unsigned int failures;

/*
 * Safe non-overlapping copy routine, to be sure not to depend upon the
 * code being tested.
 */
void
truthful_bcopy(uint8_t *dest, const uint8_t *src, size_t len)
{
        while (len-- != 0) {
                *dest++ = *src++;
        }
}

/*
 * Compute the test description.
 */
void
test(const char *descr, size_t soffs, size_t doffs, size_t len)
{
        static const char *patterns[] = {
                "000",
                "001",
                "010",
                "011",
                "100",
                "101",
                "110",
                "111"
        };

        snprintf(testname, sizeof testname,
            "%s, bit patterns src..%s dst..%s len..%s",
            descr, patterns[soffs & 7], patterns[doffs & 7],
            patterns[len & 7]);
}

void
blame(void)
{
        printf("ERROR: in test %s\n", testname);
}

/*
 * Check a given 'test' area of length 'len' against 'ref', as well as the
 * 16 bytes preceding it against 'before' and the 16 bytes following it
 * against 'after'.
 */
int
check(const uint8_t *ref, const uint8_t *test, size_t len,
    const uint8_t *before, const uint8_t *after)
{
        size_t pos;
        int failed = 0;

        test -= 16;
        for (pos = 0; pos < 16; pos++, before++, test++) {
                if (*before != *test) {
                        if (failed == 0) {
                                failed = 1;
                                blame();
                        }
                        printf("offset %zd: read %02x, expected %02x\n",
                            (ssize_t)pos - 16, *test, *before);
                        failures++;
                }
        }
        for (pos = 0; pos < len; pos++, ref++, test++)
                if (*ref != *test) {
                        if (failed == 0) {
                                failed = 1;
                                blame();
                        }
                        printf("offset 0x%zx: read %02x, expected %02x\n",
                            pos, *test, *ref);
                        failures++;
                }
        for (pos = 0; pos < 16; pos++, after++, test++) {
                if (*after != *test) {
                        if (failed == 0) {
                                failed = 1;
                                blame();
                        }
                        printf("offset size+%zu: read %02x, expected %02x\n",
                            pos, *test, *after);
                        failures++;
                }
        }

        return failed;
}

/*
 * Same as above, also checking the return value of the copy function.
 */
void
checkr(const uint8_t *ref, const uint8_t *test, size_t len, const void *ret,
    const uint8_t *before, const uint8_t *after)
{
        int failed = check(ref, test, len, before, after);

        if (ret != test) {
                if (failed == 0) {
                        blame();
                }
                printf("return value: got %p, expected %p\n", ret, test);
                failures++;
        }
}

void
test_bcopy(unsigned int lim, size_t size)
{
        unsigned int i, j;

        uint8_t effs[16];
        memset(effs, 0xff, sizeof effs);

        if (lim > 8)
                lim = 8;
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("bcopy, no overlap, src < dst", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play2, 0xff, size + 32 + lim);
                /* + 16 to guarantee 16 valid ff bytes before */
                bcopy(d.ref + i, d.play2 + 16 + j, size);
                check(d.ref + i, d.play2 + 16 + j, size, effs, effs);
        }
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("bcopy, no overlap, dst < src", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play1, 0xff, size + 32 + lim);
                /* + 16 to guarantee 16 valid ff bytes before */
                bcopy(d.ref + i, d.play1 + 16 + j, size);
                check(d.ref + i, d.play1 + 16 + j, size, effs, effs);
        }
        /*
         * No overlap tests - either it's a real bcopy(), as opposed to
         * ovbcopy(), in which case overlap test do not make sense, or
         * it's likely a wrapper around memmove(), for which overlap
         * situations are tested below.
         */
}

void
test_memcpy(unsigned int lim, size_t size)
{
        unsigned int i, j;
        void *ret;

        uint8_t effs[16];
        memset(effs, 0xff, sizeof effs);

        if (lim > 8)
                lim = 8;
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("memcpy, no overlap, src < dst", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play2, 0xff, size + 32 + lim);
                /* + 16 to guarantee 16 valid ff bytes before */
                ret = memcpy(d.play2 + 16 + j, d.ref + i, size);
                checkr(d.ref + i, d.play2 + 16 + j, size, ret, effs, effs);
        }
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("memcpy, no overlap, dst < src", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play1, 0xff, size + 32 + lim);
                /* + 16 to guarantee 16 valid ff bytes before */
                ret = memcpy(d.play1 + 16 + j, d.ref + i, size);
                checkr(d.ref + i, d.play1 + 16 + j, size, ret, effs, effs);
        }
}

void
test_memmove(unsigned int lim, size_t size)
{
        unsigned int i, j;
        void *ret;
        size_t doffs, soffs;

        uint8_t before[16], after[16], effs[16];
        memset(effs, 0xff, sizeof effs);

        if (lim > 8)
                lim = 8;
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("memmove, no overlap, src < dst", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play2, 0xff, size + 32 + lim);
                doffs = 16 + j;
                truthful_bcopy(before, d.play2 + doffs - 16, 16);
                truthful_bcopy(after, d.play2 + doffs + size, 16);
                ret = memmove(d.play2 + doffs, d.ref + i, size);
                checkr(d.ref + i, d.play2 + doffs, size, ret, before, after);
        }
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("memmove, no overlap, src > dst", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play1, 0xff, size + 32 + lim);
                doffs = 16 + j;
                truthful_bcopy(before, d.play1 + doffs - 16, 16);
                truthful_bcopy(after, d.play1 + doffs + size, 16);
                ret = memmove(d.play1 + doffs, d.ref + i, size);
                checkr(d.ref + i, d.play1 + doffs, size, ret, before, after);
        }

        /*
         * Overlap needing a reverse copy occurs if
         *   src < dst < src + len
         */
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("memmove, overlap needing reverse copy", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play2, 0xff, size + 32 + 2 * lim);
                doffs = 16 + i + j;
                soffs = 16 + i;
                memcpy(d.play2 + soffs, d.ref, size);
                truthful_bcopy(before, d.play2 + doffs - 16, 16);
                truthful_bcopy(after, d.play2 + doffs + size, 16);
                ret = memmove(d.play2 + doffs, d.play2 + soffs, size);
                checkr(d.ref, d.play2 + doffs, size, ret, before, after);
        }

        /*
         * Overlap not needing a reverse copy occurs if
         *   dst < src < dst + len
         */
        for (i = 0; i < lim; i++)
        for (j = 0; j < lim; j++) {
                test("memmove, overlap not needing reverse copy", i, j, size);
                /* + 32 to guarantee 16 valid ff bytes before and 16 after */
                memset(d.play2, 0xff, size + 32 + 2 * lim);
                doffs = 16 + i;
                soffs = 16 + i + j;
                memcpy(d.play2 + soffs, d.ref, size);
                truthful_bcopy(before, d.play2 + doffs - 16, 16);
                truthful_bcopy(after, d.play2 + doffs + size, 16);
                ret = memmove(d.play2 + doffs, d.play2 + soffs, size);
                checkr(d.ref, d.play2 + doffs, size, ret, before, after);
        }
}

int
main()
{
        unsigned int i;

        /*
         * Compute a random pattern, and make sure none of its bytes is
         * equal to 0xff, which is our "not part of the pattern" sentinel.
         */
        arc4random_buf(d.ref, sizeof d.ref);
        for (i = 0; i < sizeof d.ref; i++) {
                if (d.ref[i] == 0xff)
                        d.ref[i] = 0;
        }

        /*
         * Test all alignments with a few size combinations.
         *
         * Small lengths are used to check that we don't copy too much
         * data if the relative alignment of src and dest allows for a
         * large-copy operation.
         */
        static const size_t sizes[] = {
                1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
                16, 32, 64, 128, 256, 512, 1024 //, 4096, 32768
        };
        for (i = 0; i < sizeof(sizes) / sizeof(sizes[0]); i++) {
                printf("Testing with size %zu\n", sizes[i]);
                test_bcopy(8, sizes[i]);
                test_memcpy(8, sizes[i]);
                test_memmove(8, sizes[i]);
                if (failures != 0)
                        break;
        }
        switch (failures) {
        default:
                printf("** %u TESTS FAILED **\n", failures);
                break;
        case 1:
                printf("** 1 TEST FAILED **\n");
                break;
        case 0:
                break;
        }

        /* benchmark if no errors */
        if (failures == 0) {
                struct timeval start, end, diff;
                unsigned int i, iter, niters;
                static const size_t allsizes[] = {
                        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
                        16, 32, 64, 128, 256, 512, 1024, 4096, 8192, 32768
                };
                for (i = 0; i < sizeof(allsizes) / sizeof(allsizes[0]); i++) {
                        /* increase '100' below on fast hardware */
                        niters = (100 * 32768 / allsizes[i]);

                        gettimeofday(&start, NULL);
                        for (iter = niters; iter != 0; iter--) {
                                test_bcopy(1, allsizes[i]);
                                test_memcpy(1, allsizes[i]);
                                test_memmove(1, allsizes[i]);
                        }
                        gettimeofday(&end, NULL);
                        timersub(&end, &start, &diff);
                        printf("%u loops on %zu bytes, time elapsed: %ld seconds, %ld microseconds\n",
                            niters, sizes[i], (long)diff.tv_sec, (long)diff.tv_usec);
                }
        }

        return failures != 0;
}

This test program (and also the crude benchmark capability in it) turned out to be helpful recently while creating a reasonably fast yet small assembly implementation of the bcopy, memcopy and memmove routines for m88k some months ago.

In NetBSD, which sparc port had also been based on Chris Torek's work, the ovbcopy bug remained unnoticed and unfixed until 2018, when all uses of ovbcopy had been replaced with either an overlap-aware bcopy routine or memmove.

Module Name:    src
Committed By:   mrg
Date:           Fri Mar 16 09:29:24 UTC 2018

Modified Files:
        src/sys/arch/sparc/sparc: locore.s

Log Message:
remove obsolete ovbcopy().  it may be the very slightly bit
faster for larger copies, but slower for smaller ones.
i don't see any major benefit in keeping this code.

this is the final ovbcopy() reference in src.  you're welcome :-)

You can now rest safely, knowing that ovbcopy no longer exists.
Well, almost. It still exists in FreeBSD, as a macro in sys/systm.h:

#define ovbcopy(f, t, l) bcopy((f), (t), (l))

...but this is only for the sake of a few files, and one should reasonably expect it to disappear completely in the near future.