The Loeki's Strange Omniverse
««Nov 2009»»
SMTWTFS
1234567
891011121314
15161718192021
22232425262728
2930

Google Search

Web News
Images Groups
Scholar
English Nederlands

Wiki

My RSS Feeds








Solaris 10 and Sun Ray 3.1 running nicely

posted Monday, 21 November 2005

Thursday November 5, 16.30 hours.

Thanks to an automatically made change in configuration by WebRoot Spy Sweeper the Windows 2003 Terminal Server refused to accept any new connecitons anymore.
While I was working on a solution for that one our Sun Fire 280R Solaris 8 server suddenly starts racing. The two UltraSPARC III's quickly ran towards 100% usage and the entire system started acting weird.
Research showed something was utterly wrong. The root filesystem gave completely ridiculous readings for ls. Some files got outright I/O errors, while others suddenly were dozens of terabytes large, while yet others had a creation date of July 1, 1979 or something similar. Also, I've never seen files with file-type "?". In short, everything was just completely fucked and broken. 

After everybody left I hooked up the console to get some serious 1-on-1 with the server, which would last all night. Turns out there was nothing left to save. The entire root filesystem was dying a slow and painfull death. After the inevitable reboot Solaris refused to give some error, indeed, even the maintenance mode was unavailable.
I was screwed, with nowhere left to go, nowhere to enter and nothing to access. Everything on the / filesystem was completely wasted and buggered up, and tomorrow everybody had to get to work again. Since only the mission-critical data was backupped, there really was only one solution: Do a complete reinstall.
Coincidence would have that we were already into the advanced planning stages of doing a full upgrade from our deprecated Solaris 8/Sun Ray Server Software 1.3 stack to Solaris 10/Sun Ray Server Software 3.1, and I supposed this was as good a time as ever to do so.

The Sun Fire 280R had an option: either a CD-ROM drive ór a DDS4 tapedrive. We had chosen the latter for easy backing up. Of course I could've tried a JumpStart or some other mechanism of remote installation, but given the fact I had been dissatisfied with the current setup for quite some time, compounded with little in the way of feeling like more unprecedented trouble and because I have just bought a couple of Sun Enterprise 250-boxes myself, I decided to take a CD-ROM drive from one of those and replace the DDS4, even if only temporary, with that.
Got a huge pot of coffee, dropped in the first Solaris 10 Install CD (courtesy Sun Microsystems), and fired up the installation.

Lesson 1 on serial consoles: They're excruciatingly slow. Although the OpenBoot Environment (Sun's equivalent to the BIOS) allows for tweaking up the speed, I felt little like more potential trouble than I already had, so I stuck with the default 9600-baud setting.
Combining this, the Solaris Installer and the Windows HyperTerminal is already enough of a headache, and like I said, it's highly irritably slow, but heck, it worked.

For Sun Ray Server Software the folks at Sun strongly recommend a minimum installation level of "Entire Distribution". A big problem arose here. Cards Engineering, our suppliers, designed the device with a root slice of a mere 2 GiB, whilst an "Entire Distribution" install demands about 5.
Since we already are relatively short on space this meant I had to scour for megabytes, which in turn meant plowing through the package lists to save space by deselecting unnecessary packages.
Eventually I had to redesing the whole file system. Cutting the swap slices in half on both harddisks and adding them into the swap system individually instead of as a single mirror, in which it was delivered to me by default, saved me a full 1100 MiB, increased swap performance dramatically due to the striping algorithms and left my swap space as big as it was before. I took another full GiB from the data partition and ended up 4,4 GiB of available installation space.

After some cutting and gutting in the package configuration we were finally ready to start off. Flawlessly and quite effortlessly one CD after another was installed onto the harddrivew and after somewhat more of an hour uname -srv  proudly reported SunOS 5.10 Generic_118822-20 sparc back to me.
After that I patched the OBP, the POST and the RSC to the latest versions, installed the latest Solaris 10 Recommended Patch Cluster and the platform was ready for duties and configuration.
This all sounds reasonably simple, but believe me, by that time it had already been past 24.00 hours :-/

Next I copied the old data slice to the new shrunk data slice. In the meantime I had the opportunity to set up Webmin, SWAT and Samba. After some messing around I found out that the included versions aren't quite installed on their default locations, and are pretty old.
Ah well, too late now. Fortunately Webmin took up it's tasks effortlessly after I figured out there is such a thing as webminconfig, and Samba was quite friendly at getting configured as well.

After that the SunPCI ProII cards were up for enabling. Fortunately in the aforementioned planning stages I already had done quite some research, because it would've gotten quite a little later in the morning than it already was. SunPCI ProII-cards are Celeron-733 CPUs pasted on a PCI-X card, with it's own SO-DIMMs and GPU. They're configurable through the PCI Pro software. They run on these virtual harddrives, which are basically binary images created on the host system harddrive.
With the help of this hack, which, fortunately, worked without any problems, I was able to get the two cards we have to work once again, and even added them back into a Boot@Boot configuration. Without any issue, in the default way. The Windows'es on these systems acted like nothing at all had happened.
Although the direct interface with these cards is somewhat crappy and buggy (partly fixable with a patch) they're perfect as low-key RDP-hosts.

Finally I added all the usernames, without passwords or whatever, to allow everybody to log back in.
By then it had already been past 08.00 hours, many pots of coffee and quite some frustration. But, the shit was up and running again! I packed my gear, time to get some shuteye. One last runtest, just in case. Everything went fine, untill... I suddenly noticed the SuSE Linux Pro 9.1 FileMaker 5.5 (which I got to work thanks to these brilliant hacks by the way) database-server had become quite grumpy in the whole process.

<<<SIGH>>>

Pull open a PuTTY, login through SSH and find out that the NFS-mount to the Solaris-box (of course) wasn't functioning anymore.
That in itself wouldn't have been a disaster, had it not been for the fact that a small number of cronscripts were located on that NFS-share. And cron quite hated me for that. Especially because the NFS had not become unmounted (as one would reasonably expect) but instead was timing out like an unemployed person standing in line for his welfare allocation.
Cron on the other of course tried to go on and, for lack of a status message, apparently basically tried again. And again. And again. And it had been doing so since 18.00 hours the day before. This resulted in some 300 processes racing for CPU-time to use it to wait. And who do you suppose became a victim of that? Exactly, the half a dozen Filemaker processes. kill -9 was just to no avail, and even an init 6 was just not going to work (at least not within 15 minutes).

<<<SIGH>>>

I pulled the plug. I was out motivation, out of energy, out of clear thoughts, and I just pulled the plug. 30 seconds later I made my prayers and gave it juice again.
The good news: Linux returned. So did FileMaker, more or less.
The bad news was that the network interface was just non-existent. A few errors @ boot and nó hme0 in sight. Take another reboot, a gracefull one this time. No go. At that point the first colleagues started showing up.
Damn, I'd strived so hard to prevent that. After some messing around and searching on the internet I eventually tried with ifconfig to get the interface back up manually, which, after a few fruitless attempts ("WHAT DO YOU MEAN HME0 DOESN'T EXIST!?!?") and a little assistance from YAST miraculously worked.
GOOD. Now to kick Filemaker into gear and rebind it to the interface and I can finally go home (at least that's what I thought).
And although I should've known better by then, again the dissappointment was rather big when that turned out to be far less easy as I had imagined. Filemaker kicked itself up, binded to the interface, allowed for connections, and... got completely stuck.
99% CPU, no response at all. I tried it a number of times and there was nothing I could do about it, it just happened as soon as a connection was established.
One last flash of intelligence: I moved all the database files out of the directory and fired up FileMaker once again. O joyous day! It worked! Next I added every file manually through it's administration interface. Turns out every file not only had to undergo a number of consistency checks and stuff, but all of the indexes were being rebuilt as well. This was all good and well and it didn't even cost to much effort (average load on the SuSE is about a whopping 0.06), but apparently it had to be done with some degree of measured control.

And so, after manually adding 55 files to the database, at 10.30 hours I was f-i-n-a-l-l-y ready to go home. Somebody tried to tell me the e-mailsystem was down as well, but I had had enough (and I fixxed it within 5 minutes the evening after by manually supplying the e-mailserver with the signature file which usually, you've guessed it, is available over a currently non-available NFS-mount).

Afterwards I went to this good friend of mine (who had arrived at 23.00 hours that evening and provided support throughout the entire ordeal) and had a well-deserved beer, was just able to fend off a heart attack and that was it for the adventurous Thursday the 5th and Friday the 6th of November 2005.

tags:              

links: digg this    del.icio.us    technorati    reddit

AddThis Social Bookmark Button