As there is no support at the moment for paravirtualised LCFG managed virtual machines (due to bootstrapping mechanism), all the deployed LCFG managed virtual machines are fully virtualised. That means that altering the number of CPUs for a VM requires a reboot of the VM as such changes can’t be done on-the-fly, like on paravirtualised VMs. Because of that I had to reboot one of these systems to increase the number of CPUs in use.
After rebooting the VM and loading Grub, the infamous “Error 15: File not found…” appeared. After using ‘kpartx‘ to access the partition of the disk image, I mounted the root partition under a local directory of the host system and realised that there was no vmlinuz and initrd under /boot. Pretty strange as I couldn’t recall anything that could cause that. I got the required files from another identical virtual machine (running on the same hardware, with the same specs and same OS and kernel version) and copied them over. The faulty VM was able to boot now but prompting two worrying messages:
ERROR: DM multipath kernel driver not loaded
and
tg3 device eth0 does not seem to be present, delaying initialization
I first checked the ethernet driver, as DM was not used anyway, in /etc/modprobe.conf and the entry was there, intact as it should be. I then realised that /usr/src/kernel was totally empty and most of modules missing from /lib/modules/2.6.18.123.4.1.el5/*. The later could explain both the MD and ethernet problems. I did the same as with /boot files and copied those over from another identical virtual machine. Next reboot and there was no MD or eth0 errors, good sign. However, the lcfg components failed to start, most importantly the boot component. As an addition to that, the login prompt had as hostname ‘localhost‘ instead of the one that should really be. Having a look under/var/lcfg/log/boot revealed that the system couldn’t locate its configuration database because of the wrong hostname:
qxprof: can't tie /var/lcfg/conf/profile/dbm/localhost.DB2.db :no such file or directory
I got to have a look in /etc/sysconfig/network and the file was empty! That explained the ‘localhost’ as hostname. I edited according to the systems configuration and gave it one more kick. This time the system started up fine. The LCFG components were complaining though that couldn’t find the group ‘lcfg’, which shows that the machine was seriously messed. Nevertheless, all the components did their work as they should and all users and groups were automatically sorted, machine could bind on NIS domain, the kernel updates were applied successfully and the machine was back to normal. Despite its working status, I still haven’t figure the exact reasons the kernel packages were missing. A quick look at the updaterpms (the LCFG component that takes care of the RPMs) shows that since two days ago, the kernel packages had an ignore status. Now need to find out why that happened and if that causes automatic uninstall of these packages.