OpenBSD stories
miod > software > OpenBSD > stories > OpenBSD on SGI, 3/6: The blowfish awakens

OpenBSD on SGI, 3/6: The blowfish awakens

(Follow this link to go back to the main SGI page, and this link to go back to the previous part.)

2004, OpenBSD

As usual, Fogelström disappeared for a while and we thought again that the SGI port would never materialize unless someone else would take the lead (and likely port the existing NetBSD code).

But on july 30th, it looks like things would happen for real.

<pefo> sooon a snapshot for you guys with SGI O2s will be ready to play with.
       R5K required for now though.
<miod> thanks pefo.
<miod> i still have to check the O2 here, but I think I have an R5k module on
       one of the i2 or indy.
<miod> thought they might not be suit for o2, come to think of it...
<pefo> no, O2 and Indy modules are not compatible.
<miod> damn. then i hope it's an r5k.

The promised snapshot appeared two days later.

<pefo> ALLRIGHT!! Everyone with SGI O2s R5K, line up and dust them off. Snapshot
       ready for ftp in 2 hrs!
<miod> pefo, I'll line up once I'm back home
<pefo> tsk, tsk... ;)
<miod> if you can pay the bills and the food I'll gladly stop working and spend
       my time home (-:
[...]
<pefo> OK, the SGI O2 snap is up and ready for download. Have fun!
Date: Mon, 2 Aug 2004 15:57:51 +0200
From: Per Fogelström
To: private OpenBSD mailing list
Subject: First SGI O2 snap now available!!!

OK, for all those who waited a long time, the moment has come!!!

This snap runs on O2s with R5K and i think R5K2 cpus and at least 64Mb of
sdram. You will need a NIC, preferably a fxp. Integrated ethernet support is
on the way. Read the README file for more info.

Get it at ftp.opsycon.se/pub/OpenBSD/3.5+/sgimips when it's fresh!

Have fun and report back to me!

Per

Theo de Raadt seized the opportunity to ask for hardware donations (it never hurts to try...).

Date: Wed, 04 Aug 2004 08:43:57 +0000
From: Theo de Raadt
To: openbsd-misc
Subject: SGI O2

A developer is about to import an SGI O2 codebase.

Anyone want that?  There's a catch.

We need at least 3 machines in Calgary.  There's a little known rule
in OpenBSD -- the "to make an official snapshot" rule -- which says
roughtly [sic]: in addition to the main developers of a new architecture, at
least Theo and Peter must have machines.  Otherwise all of us get
scared of the possibility of a new architecture languishing and
creating an unsightly pit of dead code in the tree.

We could use more than 3, since developers visit here once in a while,
and we like to send them on their way with full suitcases.

Is there anyone out there that wants to take care of this?
ie. Finding some machines, and getting them to Calgary.  It's a
mission.

I'm serious.  Sometimes we ask and nothing happens.  It is kind of
like I am asking our user community to invest some time in getting
access to some damn cheap (just check Ebay) machines and simply post
them to Calgary :) People seem to ask how they can help quite often;
here's an example.

Why add a new architecture?  SGI's mips machines are dead, aren't
they?  Well our expierence [sic] has been that every architecture we add has
helped us find bugs in shared code that affects other architectures.
(As long as it receives developer [attention] and does not rot like NetBSD's
architectures do).

Some very scary and major bugs have been dredged out almost
automatically.  In some cases, the effort is not worthwhile.  In this
case, I judge it to be valuable.  In particular, the SGI developer
will probably peek out quite a few busdma bugs, which will affect
driver support on any busdma architecture.

ps. m88k has been the most worthwhile example, since some insane parts
of that architecture permits Miod to find a MI bug weekly.

And now that there were some concrete bits, we could start addressing one of the most important problems in computing: naming things.

<deraadt> So, should it be called "sgimips" or "sgi".  Comments?
<pval> will we have any other mips in foreseeable future?
<pval> guess it doesn't matter for the name actually
<deraadt> Well I doubt we'll be supporting the sgi68k or sgi88k or sgisparc
          machines.
<miod> to be fair, I have been looking for an sgi68k for some years now, they
       are very very very rare.
<deraadt> I dislike typing long names, so I think it should just be sgi.
<miod> well, it was supposed to be sgi initally.
<deraadt> sgi was very successful at buying them back to give to their
          sledgehammer guy.
<deraadt> I met their California sledgehammer guy.
<miod> yup
<deraadt> I was introduced :)
<miod> did he hit you with his hammer?
<deraadt> Was told that they spent about 1 million dollars over 10 years keeping
          them off ebay and out of the hands of other resellers.
<miod> heh.
<miod> am not surprised.
<millert> why bother?
<deraadt> To kill the reseller market.
<deraadt> SGI machines had low value on buyback.
<deraadt> Unlike Sun, who put high value on buyback.
<miod> sgi tried to be very vocal on the theme "2nd hand sgi are bought as
       refurbished from sgi".

After one good night of thought...

<deraadt> so, sgi or sgimips?  I think sgi.
<drahn> sgi sounds good to me.
<kettenis> sgi also sells ia64 systems isn't it?
<deraadt> Yes, but those a standard ia64 systems.
<deraadt> Nothing SGI about them.
<kettenis> ah, ok
<kettenis> Hmm, NetBSD calls their SGI port sgimips though
<deraadt> Yes, and they call their amd64 port x86_64
[...]
<mickey> it seems to be more of an issue of hpcmips vs sgimips rather than sgi
         making any other than mips machines
<mickey> same as macppc vs mvmeppc
[...]
<deraadt> I like simpler.
[...]
<deraadt> sgi68k is not going to happen.
<drahn> right so sgi makes sense.
<mickey> on he other hand there is no catsarm
<deraadt> cats have legs, not arms, dummy
<mickey> paws you fluffy you
<miod> and tails.
[...]
<matthieu> but are sgimips and sgimips64 the same port or 2 different ones?
<miod> isn't sgimips 32bit only so far?
<deraadt> I think it is 64bit only.
<miod> NetBSD/sgimips is 32bit only as far as I understand.
[...]
<matthieu> so we want 2 ports sgi and sgi64
<miod> matthieu, likely.
<deraadt> We want to support the 32 bit machines?  Or just say forget it?
<miod> i'd like to. but then I have too much time on my hands as everyone knows.
<deraadt> so maybe sgi64?
<miod> or sgi for 64 bits, and sgi32 later (-:
<deraadt> sure
<deraadt> if.  when.
<miod> hell, you said you wanted to only have to type "sgi".
<deraadt> i prefer sgi :)
<todd> my vote is "sgi" == 64bits, all cpu's I own atm are supposedly 64bit
       capable
<art> it's "sg<TAB>" anyway. So who cares?

(catsarm in the discussion above is a reference to the ARM-based CATS board, which was used to bring back ARM support in OpenBSD in order to have a solid fundation before starting the SHARP Zaurus port.)

Fogelström came with a nice summary the next day.

<pefo> hey, while i have a good night sleep you people can decide wether you
       want to call it sgimips, sgi, sgi64, pamela, or whatever! ;)

Eventually there was a general agreement that ``sgi'' was the better name, and the source code was added to the OpenBSD source tree on august 6th.

Michael Shalayeff ported the NetBSD O2 on-board Ethernet driver a few days later, and a few other developers started working on code cleanup.

Date: Sat, 14 Aug 2004 15:40:01 +0200
From: Per Fogelström
To: private OpenBSD mailing list
Subject: New SGI snapshot available.

New snap available in ~pefo/sgisnap-0814

This snap has mec ethernet driver and a lot of other fixes.
When using mec for network be aware that there is a snag somewhere
we are searching for which makes the driver hang the system. Although
i have done complete make builds via nfs source using mec there is
something hiding in there. fxp is reliable though.

MACHINE is now sgi instead of sgimips, MACHINE_ARCH=mips64.
A little confusing since code is still LP32. As a consequence
of that cc and as does not agree on things and -mips1 or -mips2
has to be explicitly given as option. I will try to fix that in
the next snap coming in a couple of days. Alternatively pick up
the comp35 tar from ftp.opsycon.se and use binutils from there.

No disk boot yet. Code is ready but not tested yet.

Heading for LP64 now!!!

Per

A few days later:

Date: Tue, 17 Aug 2004 07:56:16 +0200
From: Per Fogelström
To: private OpenBSD mailing list
Subject: SGI snap update

The SGI snap in ~pefo on cvs is now updated. The toolchain should now
be working (nm(1)). The only tar which is updated is the comp36.tgz.

nm(1) doesn't work with binutils 2.14 and mips. It does not say 'T', 'U'
etc on shared lib symbols. 'W' works though. If someone could take a
look at this i would appreciate it since perl does no longer build
because of that.

The snap dir also contains a diff against the tree with the patches
currently needed to do a make build. Many of these will be obsolete
when the toolchain is fixed, among them the GOT separation and alignment.
Basically all MI fixes will be gone and only some MD will be left.

I will be on the mainland for a couple of day, back Thursday, so
until then have fun. :)


Per

Before performing the switch from 32 to 64 bit, there were a few less important things Fogelström wanted to address, which took the rest of august: a working standalone bootloader, and the switch from gcc 2.95 as the system compiler to gcc 3.3, which was a requirement for reliable enough 64-bit code generation.

<pefo> SGI: diskbooting now works. code is not yet committed, i have to run. a
       new snap will be put up later today or tomorrow.
...
<pefo> ok, a new SGI snap is in ~pefo/sgisnap. the kernel does not yet
       autodetect the boot device but the one in the next snap will. it's
       building right now.
<pefo> don't forget to 'setenv OSLoader boot' otherwise it will try to start
       sash.
<pefo> (for you who don't read the install doc ;) )

On august 31st, things were ready for the 64-bit port to start.

<pefo> ahh! sgi fully migrated to gcc3 now.
<deraadt> oh really?  in mips32 or mips64?
<pefo> mips32. now going 64.
<deraadt> neat.
Date: Tue, 31 Aug 2004 22:03:46 +0200
From: Per Fogelström
To: private OpenBSD mailing list
Subject: gcc3 based sgisnap available

As usual in ~pefo at cvs. This is probably the last snap before going
full 64 bit.

Two days later, a 64-bit kernel was working.

<pefo> Loading ELF64 file
<pefo> 0x0:0xffffffff, Zero 0x339bb0:0xffffffff, 0x347bd0:0xffffffff, start at 0x801001d0
<pefo> Found SGI-IP32, setting up.
<pefo> Initial setup done, switching console.
<pefo> -Copyright (c) 1982, 1986, 1989, 1991, 1993
<pefo>         The Regents of the University of California.  All rights reserved.
<pefo> Copyright (c) 1995-2004 OpenBSD. All rights reserved.  http://www.OpenBSD.org
<pefo> OpenBSD 3.6 (GENERIC64) #24: Thu Sep  2 13:24:13 CEST 2004
<pefo>     root@moosehead.opsycon.se:/usr/src/sys/arch/sgi/compile/GENERIC64
<pefo> real mem = 134217728
<pefo> rsvd mem = 7020544
<pefo> avail mem = 108924928
<pefo> using 1638 buffers containing 6709248 bytes of memory
<pefo> mainbus0 (root)
...
<pefo> TADA!!!
<otto> hip hip hurray!
<pefo> well this was the easy part, migrating the kernel to 64 bits. now comes
       userland...
<pefo> i'm cheating a little though... not running on full address space yet.
<pefo> and don't tell theo, the same code is used to build a 64 or 32 bit kernel.
       just feed the compiler -32 or -64 and everything is taken care of. ;)
<otto>  /msg deraadt did you know pefo cheats? he uses the same code to build
       64 and 32 bit kernel!
<otto> oops ;-)

Userland took four more days, with some setbacks.

<pefo> wow! just went multiuser in 64 bit mode on sgi. userland is static since
       ld.so needs some fixing. but this is looking promising!
[...]
<pefo> ssh craps out on mips64
<pefo> RSA_public_decrypt failed: error:0407006A:rsa routines:RSA_padding_check_PKCS1_type_1:block type is not 01
<pefo> is there something in the libs that needs to be set to 64?
<drahn> my guess would be libssl/crypto/arch/mips64/opensslconf.h
<pefo> oh! thanks! it's a great timesaver when someone knows the answer or where
       to look!
<grange> dale was there with arm ;-)
<pefo> :)

This 64-bit mips work also led to unexpected discoveries.

<pefo> heh! found a new lever on my chair!
<pefo> this one adjusts the y-position of the seat.
<pefo> and i've had it for almost 3 years!
<miod> next week, you'll find the instructions manual!

The 64-bit adaptation work was commited on september 9th.

Kernel moves to 64 bit. A few more tweaks when binutils is updated.

And then we faced our first severe bug.

<pefo> oh crap!
<pefo> panic: pool_get(mbpl): free list modified: magic=e291a1a; page 0xffffffffc332f000; item addr 0xffffffffc332f880
[...]
<pefo> panic: pool_get(mbpl) happens everytime i try to ssh to the O2.
<miod> smells like alignment issue.
<miod> i.e. you allocate @0 but access @0+4 onwards...
<pefo> what is mbpl
<markus> mbuf pool
<pefo> ok. perhaps i should try using the old trusty fxp to see if it may be
       an mec driver problem.
<pefo> when i think about it's very possible. that driver have never run in
       64 bit mode before since netbsd have no mips64 yet.
[...]
<miod> do you still get the mbpl panic?
<pefo> haven't had time to check that any further. i switched to a fxp though
       but that one crashes in the driver with a messed up mbuf chain. funny
       thing is that dong nfs from the O2 works fine. but as soon as i try to ssh to
<pefo> the box it crashes.
<pefo> same sypthimh taht is with the fxp. either fxp or mec works fine nfs-client.
<pefo> sympthom that...
<pefo> never saw this problem with the 32bit kernel.
[...]
<miod> panic: pool_get(mbpl): free list modified: magic=56617018; page 0xffffffffc332f000; item addr 0xffffffffc332f580
<miod> still not triggered early by your diff Todd )-;
<millert> Oh well
<miod> still a nice idea.
<miod> it's from the MGET in m_prepend().
<millert> Yeah, I see
<pefo> miod, do you have a trace so you can see where pool_get is called ?
<miod> _pool_get+0x644 (1ffffce7,ffffffff803caab0,56617018,ffffffffc332f000) sp ffffffffc6c6b770 ra ffffffff801d3f6c, sz 64
<miod> m_prepend+0xb4 (1ffffce7,ffffffff803caab0,56617018,ffffffffc332f000) sp ffffffffc6c6b7b0 ra ffffffff80271c80, sz 48
<miod> udp_output+0x130 (1ffffce7,ffffffffc3293008,0,0) sp ffffffffc6c6b7e0 ra ffffffff80272608, sz 144
<miod> udp_usrreq+0x638 (1ffffce7,ffffffffc3293008,0,0) sp ffffffffc6c6b870 ra ffffffff801d933c, sz 64
<miod> sosend+0x62c (1ffffce7,0,0,ffffffffc332f480) sp ffffffffc6c6b8b0 ra ffffffff802b7d00, sz 128
<miod> nfs_send+0x88 (1ffffce7,0,0,ffffffffc332f480) sp ffffffffc6c6b930 ra ffffffff802b9430, sz 32
<miod> nfs_request+0x9e8 (ffffffffc3304de0,ffffffffc332f000,3c6c6bdd0,ffffffffc332f480) sp ffffffffc6c6b950 ra ffffffff802cdf10, sz 288
<miod> nfs_lookup+0x400 (c3304de0,ffffffffc332f000,3c6c6bdd0,ffffffffc332f480) sp ffffffffc6c6ba70 ra ffffffff801f4308, sz 464
<miod> VOP_LOOKUP+0x60 (ffffffffc3304de0,ffffffffc6c6be10,ffffffffc6c6be38,ffffffffc332f480) sp ffffffffc6c6bc40 ra ffffffff801e8c2c, sz 48
<miod> ddb> show pool mbpool
<miod> POOL mbpl: size 128, align 8, ioff 0, roflags 0x00000018
<miod>         alloc 0xffffffff80434538
<miod>         minitems 16, minpages 1, maxpages 8, npages 1
<miod>         itemsperpage 31, nitems 20, nout 11, hardlimit 4294967295
<miod>         nget 2638, nfail 0, nput 2627
<miod>         npagealloc 1, npagefree 0, hiwat 1, nidle 0
<miod>         currently entered from file /data/src/sys/kern/uipc_mbuf.c line 236
<miod> (sorry for flood)
<espie> flood expected for a pool overflow, you know.
<miod> go drown yourself.
<espie> ENOMEM: pool not big enough.
<miod> QUI EST GROS ?
<pefo> stfu, anything interesting scrolls off!
<pefo> ;)
<miod> echo -n "heh"
<pefo> you get this in nfs, eh?
<miod> ssh -1 machine, but the private key is on an nfs mounted share.
<miod> gonna reboot and try ssh -lroot...
<miod> but then nfs when logged on the console works.
<miod> (and this ssh works ~ 1 time out of 3)
<matthieu> I'm using nfs from the console too for my source tree.
<pefo> ah, ok. i'm strting to think it has to do with fragmentation. i get my
       crash in mec_start when it figures that the packet can't be send as is
       but must be revuilt.
<miod> pefo, but you had this with fxp as well, right?
<pefo> it crashes with the fxp, but it looks different. may be something else
       although related since nfs works fine but ssh to box crashes.
<pefo> to bad the R5K doesn't have the watch register feature. could have nailed
       this in less than an hour... :(
<miod> i thought you had an RM7k o2 too?
<pefo> no, r10k, and a r12k cpu module on its way.
<pefo> the r7k seems to be very rare...
<pefo> i have several other mips systems/boards with rm7k's though but they
       don't run mips64 yet.
[...]
<pefo> R10K have the watch register. looks like i'm going to add support for it
       a little earlier than i planned...

(``Qui est gros?'' above in all caps is an Obelix reference.)

It took several days and many people's brains to figure out; that was caused by a machine-dependent constant in the network stack, which ought to have been enlarged during the switch to 64 bit, but had been left unchanged, leading to network stack assumptions no longer being respected.

The machine-independent nature of the bug was confirmed by testing with an incorrect value on other platforms and experiencing similar network memory corruption.

This was fixed on the 17th:

Crank MSIZE and NMBCLUSTERS, per other 64bit arches.

Further investigation of the causes of that bug also pointed out an earlier change in the network stack had been subtly incorrect, and that change was also reverted.

Another good side-effect of that bug hunt was that Fogelström worked on R10000 processor family support earlier than initially intended.

<pefo> OpenBSD/sgi (moosehead12k.opsycon.se) (tty00)
<pefo> login:
<pefo> mickey! want a new kernel?
<miod> how come it sez (tty00) instead of (console)
<pefo> prolly something in conf.c? actually have no idea. my mind have been busy
       with other things. :)
<miod> probably your /etc/ttys
<miod> console "/usr/libexec/getty std.9600"   unknown off secure
<miod> tty00   "/usr/libexec/getty std.9600"   unknown on  secure
<miod> oh, i did not notice this earlier.
<pefo> go ahead and fix it if it's wrong.
<miod> no, actually it's ok until we get video console
<pefo> and that will be??? ;)
<miod> hopefully soon
<pefo> you have the stuff from glaurung's?
<miod> i have looked at it, yes
<miod> and then my hair turned white.
<pefo> miod, but you didn't lose your hair! it could have fallen off! ;)
<wvdputte> now we know why Miod wears a hat
<pefo> i'm running the 12K with dirty speculative still on in kernel mode, eg
       like the 10K will do. building a kernel right now and so far it seems to
       work OK.
<miod> actually, i should get a haircut soon.
<pefo> just powered up the Origin 200. pretty beefy machine. 225Mhz, 512Meg.
<miod> only 225 MHz?
<miod> that's a scam!
<pefo> for a R10K that is pretty good.
<miod> but my fastest r4.4k runs at 250MHz!
<miod> (ok, I'm sounding like an old record again here)
<pefo> you will be outrun anyway ;)
<pefo> building a kernel with the R12K was a little more than 3 times faster. i
       had hoped for about 4, but anyway.
<miod> wait till you have smp code working! (-:
<pefo> oh yeah! and 1 128 cpu Origin 2000 cluster! Only $10K on ebay! ;)

(glaurung in the discussion above being Vivien Chappelier.)

(Also note the name of Fogelström's system - Moosehead was the SGI code name for the O2, and 12k obviously is the processor type, a MIPS R12000.)

The mips64 codebase would enter another turbulence zone, and this time I was the one to blame. The machine-dependent part of the virtual memory subsystem, known as the pmap module, had still some parts coming from the OpenBSD/arc code years ago, and were behind many changes, in particular the data structures handling the modified and referenced state of the virtual memory pages (which had to be maintained by software on MIPS) could be improved. While working on these changes, I tried to kill too many birds with a stone (my first mistake) but did not test well enough (my second mistake) and introduced several bugs which caused, among other things, random segmentation faults in userland binaries.

I should have reverted my changes and done them again in smaller steps, but I was so sure this would be fixed by minor changes (a ``one-liner'' or two) that I did not want to do it (my third mistake); it took pressure from several developers and a heated discussion with more curse words than should have been needed for me to revert the troublesome parts, and we had lost three days.

But this allowed a much more stable snapshot to be built and released on the next day.

Date: Fri, 24 Sep 2004 13:30:18 +0200
From: Per Fogelström
To: private OpenBSD mailing list
Subject: New sgi snap available.

OK, after much bug digging a new snap is put up in ~pefo@cvs.
The kernel now seems stable wrt random core dumps. Problems were
found and fixed in pmap code and in ld.so.

This snap still lacks sendmail and friends. gcc still barfs on it.

binutils is a moving target and seems to have bugs which manifest
themselfs as failed linking of certain programs. it's being looked
at. however it means that a few programs fail to link or cores.
Most of these can be linked static but the major mess is gcc. don't
try to rebuild it unless you really need (wanna fix bugs?). in that
case contact me and i will explain how to build it. however if
something cores, try to build it -static. groff for example must
be linked static.

There are two extra files in the snap dir. One is the emulparam
that is going into ld. You only beed this if you plan to rebuild
binutils and especially ld. The other file contains the diffs i
have wrt head in my build tree. mostly binutils but also a "gross"
hack to ahc.c to achive full disk speed. A better fix for this is
coming. Note that this diff is not MI safe so be careful.

known problems beside those in the toolchain is that the mec
ethernet chip sometimes get stuck with its interrupt asserted.
a power cycle or a reboot from the maintenance console fixes it.

ahc craps out now and then, it seems. i'm not sure if this may
be related to the R10K speculative dirty problem. it would be
nice if people could test on both r10k's _and_ other machines
to see if the problem occurs over the entire line.

(The "R10k speculative dirty problem" is a reference to the speculative execution behaviour on this processor. Refer to the technical note I wrote earlier about it for details. In both NetBSD and OpenBSD, the cache invalidation discipline in the drivers, done as part of the bus_dma layer, turned out to be good enough to not suffer from speculative execution side effects. Linux, on the other hand, never was that lucky, and support for R10000 O2 has never been considered stable, to the point that you need to go out of your way in order to be able to build a kernel supporting that particular hardware configuration.)

Bugfixes kept coming, but we had to disable the stack-protector code on a few binaries (the libsmutil part of sendmail) as it would cause internal compiler errors.

Eventually Theo de Raadt was able to take over the snapshot builds in early november.

<deraadt> latest sgi snap is by me :)
<miod> theo, you mean the latest sgi snap is unreliable (-:
<deraadt> ?
<miod> you said you built it yourself!

Later that month, the O2 I had been using (courtesy of Wim Vandeputte, who had bought that machine and lent it to me earlier that year) failed.

<miod> damn! looks like the o2 here died
<miod> amber light, no startup sound...
<pval> no cereal?
<pval> mine did that, but it wasn't dead - there's a jumper you should try out
<pval> it's pretty close to the cpu when you take the board out, towards the
       edge, a single jumper used to reset everything to defaults
<deraadt> no, peter, that was because you did a setenv of a variable wrong
<deraadt> miod, there is a table that says what colours at boot mean what
<miod> i know
<deraadt> for amber, i have managed to let it sit off for an hour, and it worked
          again
<pval> yeah, i'm getting old as i forgot this so quickly
<deraadt> kind of scary
[...]
<kettenis> well, my o2 arrived completely dead.  After cleaning the motherboard
           <-> chassis connector even the disk works now.
<miod> unfortunately it is clean. i had cleaned the box when it was DOA, but here
       it just doesn't want to restart after being shut down one more time...
<deraadt> it's that 10mbit crap you are hooking it up to
<miod> no, it says "cpu board failure"
<miod> i'll let it rest the night.
<miod> wim?
<wvdputte> yo
<miod> do you remember who is the guy who lent the O2 which is at my place at the
       moment?
<wvdputte> yes, that would be me
<miod> no, you told me you got it as a lent [sic] from someone else.
<wvdputte> no, I bought it last year
<miod> oh.
<miod> want to attend the funeral?
<wvdputte> you broke it?
<miod> it died on me, apparently.
<pefo> amber led?
<miod> yes. not blinking.
<wvdputte> Open it up, try to fix it. Otherwise, I'll take it back and send it to
           my O2 doctor in .nl
<miod> according to http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=hdwr&db=bks&fname=/SGI_EndUser/books/O2_OG/sgi_html/ch04.html&srch=o@%20troubleshooting
       it's the cpu board.
<miod> wim, been there, done that.
<miod> and it's not the first time this happens, but today nothing seems to bring
       it back to life.
<miod> err, i said "cpu board failure" it's "system board failure" actually.
<wvdputte> RIP

Fortunately, I had connections at the local university, and I was able to get another O2 motherboard loaned to me one week later, thanks to François Delobel and David Delon.

fall 2004 status

SGI model common name Linux NetBSD OpenBSD
IP12 Indigo (R3000) complete distribution
no graphics
IP20 Indigo (R4000) complete distribution
no graphics
IP22 Indigo2 complete distribution
XL (newport) graphics only
complete distribution
no graphics
IP24 Indy complete distribution
XL (newport) graphics only
complete distribution
XL (newport) graphics only, no X server
IP27 Origin 200, Origin 2000 complete distribution
IP28 POWER Indigo2 R10000 not-yet-integrated kernel patches
otherwise same as IP22
IP30 Octane not-yet-integrated kernel patches
X server on Impact only
IP32 O2 not-yet-integrated kernel patches
no R10000 support
complete distribution
no graphics
complete distribution
no graphics
not public yet

2005, OpenBSD

I spent some time trying to get the O2 frame buffer to work, with no success. At some point I vented my frustration.

Date: Fri, 18 Feb 2005 21:40:02 +0000
From: Miod Vallat
To: private OpenBSD mailinglist
Subject: O2 video

Ok, this is a nightmare.

It turns out the Linux driver has been written after noticing the O2
``GBE'' hardware is close to the SGI ``DBE'' found in the expensive
x86-based ``Visual Workstations'' they produced in '99 or so.

These chips are supposed to be smart because they have no frame buffer
memory, but instead do DMA blits from the main memory, laid out in
``tiles'', in order to allow the memory to be discontiguous in
practice. Just like on Zaurus (-: (except the Z uses a single contiguous
area)

So the Linux guy tinkered a bit, got something close to working,
tinkered more, and tada, it was deemed working.

It is obvious, though, that he never looked at the contents of all the
``DBE'' registers first - because the GBE layout is slightly different.

In particular, a few things in the Linux driver are clearly wrong, but
apparently they are lucky enough to not suffer from putting apples in
the pumpkins registers.

Too bad I have not been as lucky, so either with ``more correct'' code
or with the exact same sequence of operations as the Linux code, I
lose - either an unstable image or a nice black screen. Or an
interleaved madness which I can not make any use of )-: (not to mention
spurious interrupts, but this part is solved now).

I am trying to find more documentation about the GBE, and will probably
disassemble some IRIX code to help...

Maybe there are different revisions of this piece of hardware as well...

Anyway, knowing that breakthroughs usually happen *after* I send mails
about non-working code or problems I am stuck with, I wrote this mail
only to relax my mind and shuffle my ideas, in the hope of getting the
damn thing to work. If you have read till this sentence, you may discard
this mail and resume your regular slack^Wwork. Thanks for reading!

Miod

PS: Several versions of the non-working code are available upon request
if you want to play this game, too!

One month later, I had to return the O2 motherboard and was unable to do further tinkering. An Octane was lent to the OpenBSD project, and ended up at my place in may, but I had no time to tinker with it.

I relocated across the country in autumn, and thanks to the help of Matthieu Herrb, I was lent another O2 and could resume bug hunting on that platform.

Near the end of the year, I stumbled upon a funny bug introduced during the switch from 32 to 64 bits: in userland, there were implicit memory aliases every 2 GB. In other words, if you had a variable at address A, accessing memory at address (2GB + A) would not only not cause a segmentation fault, but would return the value of the variable. I wrote a simple program demonstrating this fact, and shared it with the appropriate kernel fix.

Date: Fri, 16 Dec 2005 07:37:59 +0000
From: Miod Vallat
To: Per Fogelstrom, private OpenBSD mailinglist
Subject: [mips64] userland space aliases

The current mips64 codebase has an oddity introduced when switching from
32 to 64 bit: in userland, all addresses between 0 and
7fff.ffff.ffff.ffff are aliased every 2 GB.

Let's consider the following test program:

#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

jmp_buf cheat;

void
segv(int signo)
{
        /*
         * Since the SEGV we're going to trigger can't be recovered from,
         * pretend it's safe to use non-signal-safe functions here.
         */
        longjmp(cheat, 0x666);
}

void
test(char *address)
{
        printf("accessing %p... ", address);
        if (setjmp(cheat) != 0)
                printf("segfaulted!\n");
        else
                printf("%d\n", *address);
}

int
main()
{
        vaddr_t va;
        char *c;

        signal(SIGSEGV, segv);

        /* pick a valid address on stack... */
        c = (char *)&c;
        va = (vaddr_t)c;
        test(c);

        /* try 1GB later */
        c = (char*)((1ULL << 30) + va);
        test(c);

        /* now 2GB */
        c = (char*)((2ULL << 30) + va);
        test(c);

        /* now 3GB */
        c = (char*)((3ULL << 30) + va);
        test(c);

        /* now 4GB */
        c = (char*)((4ULL << 30) + va);
        test(c);

        /* now 160GB */
        c = (char*)((40ULL << 32) + va);
        test(c);

        return (0);
}

On mips64, here is what you will get:

$ ./obj/amazing
accessing 0x7ffe0720... 0
accessing 0xbffe0720... segfaulted!
accessing 0xfffe0720... 0
accessing 0x13ffe0720... segfaulted!
accessing 0x17ffe0720... 0
accessing 0x287ffe0720... 0
$

Yet the userland address space (so far) is supposed to be restricted to
2GB...

The reason behind this is that the XTLB refill handler has been cloned
from the 32bit TLB refill handler, but needs more bounds checking.

In the 32bit world, the current pmap scheme makes sure that all virtual
addresses (0 to ffff.ffff) are managed in the pmap's pm_segtab. But in
the 64bit world, there is a hole between the 2GB userland limit (at
0000.0000.8000.0000) and the kernel space (at 8000.0000.0000.0000),
which is not checked by the TLB handler. Since the code does a logical
and operation to narrow the pm_segtab access, this means the upper bits
of the logical address are ignored.

In practice, this means any access to a particular address would end up
using the mapping for this address modulo 2GB.

The diff below is a suggested fix to this - we simply add this bounds
check in the XTLB refill handler. Note that the TLB refill handler does
not need to be modified, as it will only get invoked for faults in the
32-bit address space, where we are always within our bounds.

The test program will thus behave as expected:
$ ./obj/amazing
accessing 0x7fff4af0... 0
accessing 0xbfff4af0... segfaulted!
accessing 0xffff4af0... segfaulted!
accessing 0x13fff4af0... segfaulted!
accessing 0x17fff4af0... segfaulted!
accessing 0x287fff4af0... segfaulted!
$

Comments?

Miod
[...]

The fix was commited shortly afterwards.

2006, NetBSD

There was not much visible sgi-related activity in NetBSD in 2006.

During the second half of the year, Steve Rumble added support for some Fast Ethernet GIO expansion cards for Indigo, Indy and Indigo2, as well as for the LG (Indigo entry-level) frame buffer.

2006, OpenBSD

2006 was a quiet year for OpenBSD on the sgi front. I had noticed some odd things in the kernel and stumbled upon worse problems every time I tried to clean or fix them.

At the end of october, given the stability issues on the rise on the O2, Theo de Raadt considered pulling the plug.

<deraadt> ready to give up on sgi.

(Follow this link to go forward to the next part.)