Windows NT Serviceability

A few years ago, I owned a lovely Beetle 1300 that only let me down about twenty or so times in the two years I owned it. As a result, I owned a great owner’s repair guide, written by an old hippie. It was a great read in its own right, and I used the book extensively. One of the things that stuck with me is that the author told of a time that he took apart and serviced a Buick auto transmission using the instructions for a Beetle auto transmission. It worked, and he learnt a lot during the process. In the same way, I am hoping that you’ll stick with me, think outside the square for a few minutes, and see if you can take an idea or two from my article and apply it to your own situation.

Serviceability

On October 20, Microsoft released Service Pack 4. This service pack is the service pack that all NT shops have been dying for. After a considerable wait, it’s finally here, and it looks as if Microsoft has finally taken comments from the system administration community seriously. One of the bigger problems with software development, especially on a code base of the complexity of Windows NT, Solaris or Linux, it’s hard to separate new functionality from fixes. Microsoft has provided three different levels of update for SP4, based upon feedback garnered over the last few years.

The smallest update is just the fixes. In the minimal update, 641 new fixes (plus all the old ones) are provided in a single file 260 kb file. That’s fine if you don’t need to tick the y2k box or want any of the new features.

The intermediate update, 32 MB in size, not only fixes all known NT problems, but provides a lot of extra fixes and some new functionality, asked for specifically by NT Security gurus, like new versions of PPTP and LMv2 security. In many cases, to get really secure you need to ditch Windows 9x from your network. For the dubious still reading, 32 MB is very comparable to the 41.9 MB in recommended patches for Solaris 2.5.1 or 20 MB for Solaris 2.6, which has not been out as long as NT 4.0 has.

NT 4.0 had two y2k bugs and about 4 or five cosmetic y2k bugs. Microsoft provides the 76 MB y2k fix to get as many customers as possible to the same supportable configuration. This huge meta-update contains IE 4.01 SP1, SP4, a data connector update, and some BackOffice fixes you need to make NT 4.0 y2k compliant.

SP4 is one of the easiest service packs to apply in a long time. Click a couple of boxes, and it munges away. The downtime window is very small – the time it takes your server to shutdown and restart (mostly less than five minutes on the Intel and Alpha servers I’ve updated so far). But as always, prepare for the worst. Make a emergency repair disk (rdisk.exe), do a full backup, ensure that you understand your own disaster recovery plan (DISPLAN), and make sure you have your NT CD (and if you need them, the three boot disks) handy. The best case down time window will be the same, but you’ll be in a much better position in case something goes wrong.

My success ratio with SP4 is good – it fixed one seriously ill server that was cruising for a bruising with the NT install CD. The only thing stopping me is that it was our primary domain controller. It wouldn’t give up being the primary domain controller and the bandwidth to the box was approximately 9600 bps over a 100 Mbs full duplex Ethernet connection. However, NT is a very stable OS, and even though it was very sick, it stayed up for months on end, and it reliably serviced over 200,000 DNS and 142,000 WINS queries per week. Applying SP4 fixed both the promotion/demotion and the bandwidth issues, so it’s back to normal.

There was one “server” that didn’t take SP4 too well. It comes down to what we class as servers. This unit was an old HP Vectra VL 5/200 with 48 MB of RAM. It was servicing the Cisco 5200’s TACACS+ needs for the place I used to work at. I’m no great fan of using desktop PC’s as servers. My basic requirements for a server is that if it’s important enough to dedicate a machine to, it’s important enough to do it right. This means providing the necessary infrastructure and support for a server level operating system, things like a CD-ROM drive, some way of backing up and restoring the server commensurate to its importance in the enterprise, and whether the tier one vendor will support you when you have problems.

HP, like most tier one vendors (such as Sun, IBM, Compaq, Compaq nee Digital, Apple, and others) have two or more separate product lines – a desktop line and a server line. My personal opinion is that sometimes the distinctions can just be marketing, but HP provide support for NT Server, SCO Unix, Solaris x86, OS/2 Warp Server and NetWare only on their NetServers. They do not support these OS’s on their desktop PC’s. For any corporation, the data or service is of far more worth to the organisation than the hardware. That’s why I baulk at installing server level OS’s on desktop PC’s unless those PC’s are going to be used by a single user under test conditions – and even then a desktop PC is no predictor of success when translated to the real thing. In a bad taste analogy, it’s like clinical tests on mice – some drugs are fatal to mice that are benign to humans and vice versa.

If you’re not buying servers from tier one vendors, I’m sorry but that’s not such a good idea. I know friends who have rolled their own servers, but let me relay to you what happened at my last site with a roll your own. The machine was massively built – it was a full tower with a Asus mainboard, a DPT caching RAID card, heaps of RAM, the works. The problem is that the drive cage was painted with non-conductive paint. After a year of heavy service, the insulation wore through from the vibration and the drives started to earth their circuitry to the cage and died. First one drive died, and no one noticed because the box didn’t have any monitoring software loaded nor did the RAID card have a $2 piezoelectric bleeper like the HP NetRAID cards do. So the DPT controller made up the difference using parity. Then the next drive died, and the server stopped. There were no backups of the box for a month because the DDS tape drive could not read its own tapes (which is why you verify). The excrement hit the fan and someone got the arse. The server cost only $2000 less than an equivalent HP server, which also had vendor support (ie if a component dies, they courier out a replacement), and it had true hot swap rather than just the cold swap of the roll your own. Is your job worth $2000? The month’s lost data was worth far more than $2000 (mid-six figures, actually). If you’re wondering, NT was not the NOS running this box, but it’s irrelevant to this recounting.

Server Availability Tips

Do not install any protocols, services or products that are not going to be used as part of the server. For example, do not install IPX/SPX on an Oracle DBMS as clients will not use this protocol to communicate with the server. Never install Simple TCP/IP services.
Always have a CD-ROM drive on your servers. They’re only about $100, and can save you hours of repair time. I’m not too fussy about ATAPI vs SCSI CD-ROMs these days, just make sure that your OS can read it without additional drivers. Panasonic 32x SCSI CD-ROMs are less than $300, so if you can afford the SCSI alternative, go for it.
Take emergency repair and disk partition disks on a regular basis. I do ERD’s once a week, and disk partition disks about once a month, and I rotate the disks so I have more than one ERD per server. The reason is that floppies are terribly unreliable, and if you’re trusting a six or twelve month old floppy, you’re kidding yourself.
Try to avoid using the console at all. Domain Admin users are able to crash the server (as just as in Unix, root can cd / ; rm –rf * or kill –9 –1). There are some unavoidable reasons to use the console, so schedule this as part of your regular maintenance window.
Make sure you have a regular maintenance window. Never promise 100% uptime, as you’ll be setting unrealistic expectations. The aim is to have 100% availability for core hours. I worked in the hospital system, and we had the aim of 100% availability, but if we needed to, we could take some time from 4 am – 6 am on Sunday morning or longer if arranged beforehand. As it was, we were in the high 99.994% uptime (less than 30 minutes of unscheduled down time per year) for the vast majority of our servers (NT, Novell and Digital Unix). If anyone says that these operating systems are unreliable, I have a bone to pick with them based upon real life experience in the mission critical, health care enterprise arena.
With Windows NT, as in many OS’s, it’s worthwhile to separate the data from system files. This means at least two partitions on production servers. I have my own preference for partitioning, but to cut it short, you need about 1 GB for NT’s system partition (to have the OS, a copy of the installation files, the page file, and drivers), and the rest can be partitioned for user files. If you’re doing a print server (in my book, a server servicing more than 50 or so printers, or you’re doing PostScript RIP stuff), move the spool to the data partition, as you can fill the system partition with user files. The Q article in the knowledge base is Q123747.
Practice your disaster recovery plans. If you don’t have a test server that’s exactly like your production servers, allocate some budget, and buy it. It’ll pay you off the first time you have a crash. Learn (and document) how to restore your systems as quickly and as reliably as possible. Practice, practice, practice. Don’t have a DISPLAN? Write one today or seek advice on getting one written. They’re living documents, so keep them up to date.
If you don’t have a TechNet subscription, get it. It’s about $800 per year, and worth every cent. If you have even one developer in house, get the MSDN Universal subscription (about $3500 per year at today’s prices). It comes with lots of goodies, including MSDN Library (some of the best answers to your problems are in MSDN) and you get pretty much all the MS products including betas.
NT Magazine is a must have subscription – don’t waste your time with the emasculated Australian edition – pay the extra fifty bucks and get the US one airmailed to you.

There are various NT resources all over the Net. My favourites include http://ntsecurity.ntadvice.com and http://ntbugtraq.ntadvice.com, both run by Russ Cooper, a featured 1997 SAGE-AU conference speaker.

Avoid letting staff with a little knowledge administrate NT. It’s a recipe for disaster. Teach them a few things every month and bring their knowledge up, rather than let them just go for it. Management will dislike you because you’re “reducing productivity” or looking like a control freak (management speak: “You are not being a team player”), but the alternative is massive amounts of down time. Make sure that they are interested in boosting their knowledge levels by making them go for the MCSE exams. They exams are $135 a pop and easy to get as long as you actually use and understand the product (the instructor led courses can help, but they’re not mandatory). Under no circumstances give out Domain Admin privilege to those who do not need it.

In the next newsletter, I’ll explain how to use the resource kit utilities to administrate NT from a user level account (with access to a domain admin account, of course!).

Slagging Microsoft

Like many of you, I read Slashdot, although I am beginning to wonder why. Originally, Slashdot was a fun site that had many cool stories and lots of nifty Linux/Open Source articles. However, more and more often it has descended to outright MS bashing. Now I am not going to defend Microsoft for everything they do, because I personally find their marketing and monopolistic practices loathsome.

What’s the relevance? The problem is that SAGE-AU’s mailing list has descended to the lowest levels of Slashdot of late. The Executive will be making some announcements soon on measures to curtail the level of vendor bashing on the lists. This is because we are putting off people who might not ask questions that are necessary for them to get their job done. For example, I haven’t seen a NetWare-specific question on the list this year. Is it because we have no NetWare people on the list, or is because the NetWare people are fearful of being slagged by both the MS and Unix weenies? This is not professional behaviour dudes!

Whatever the reason, the SAGE-AU Executive have decided to take some action to curtail advocacy or just plain emotive slagging. There’s no point in voicing the opinion that OS x is not stable or is unsuitable to a particular task, particularly when the admin asking the question might already be using OS x in that situation quite happily. They may only have a small problem that would make their life easier if someone else on the list has already solved it.

Comments

Leave a Reply

More posts

WordPress updated

Privacy Policy

Keeping work papers

Porting Freemint to ARM – Retro Challenge RC2017/10