Recovering messed LCFG SL5 Xen virtual machine

As there is no support at the moment for paravirtualised LCFG managed virtual machines (due to bootstrapping mechanism), all the deployed LCFG managed virtual machines are fully virtualised. That means that altering the number of CPUs for a VM requires a reboot of the VM as such changes can’t be done on-the-fly, like on paravirtualised VMs. Because of that I had to reboot one of these systems to increase the number of CPUs in use.

After rebooting the VM and loading Grub, the infamous “Error 15: File not found…” appeared. After using ‘kpartx‘ to access the partition of the disk image, I mounted the root partition under a local directory of the host system and realised that there was no vmlinuz and initrd under /boot. Pretty strange as I couldn’t recall anything that could cause that. I got the required files from another identical virtual machine (running on the same hardware, with the same specs and same OS and kernel version) and copied them over. The faulty VM was able to boot now but prompting two worrying messages:

ERROR: DM multipath kernel driver not loaded

and

tg3 device eth0 does not seem to be present, delaying initialization

I first checked the ethernet driver, as DM was not used anyway, in /etc/modprobe.conf and the entry was there, intact as it should be. I then realised that /usr/src/kernel was totally empty and most of modules missing from /lib/modules/2.6.18.123.4.1.el5/*.  The later could explain both the MD and ethernet problems. I did the same as with /boot files and copied those over from another identical virtual machine. Next reboot and there was no MD or eth0 errors, good sign. However, the lcfg components failed to start, most importantly the boot component. As an addition to that, the login prompt had as hostname ‘localhost‘ instead of the one that should really be. Having a look under/var/lcfg/log/boot revealed that the system couldn’t locate its configuration database because of the wrong hostname:

qxprof: can't tie /var/lcfg/conf/profile/dbm/localhost.DB2.db :no such file or directory

I got to have a look in /etc/sysconfig/network and the file was empty! That explained the ‘localhost’ as hostname. I edited according to the systems configuration and gave it one more kick. This time the system started up fine. The LCFG components were complaining though that couldn’t find the group ‘lcfg’, which shows that the machine was seriously messed. Nevertheless, all the components did their work as they should and all users and groups were automatically sorted, machine could bind on NIS domain, the kernel updates were applied successfully and the machine was back to normal. Despite its working status, I still haven’t figure the exact reasons the kernel packages were missing. A quick look at the updaterpms (the LCFG component that takes care of the RPMs) shows that since two days ago, the kernel packages had an ignore status. Now need to find out why that happened and if that causes automatic uninstall of these packages.

2 thoughts on “Recovering messed LCFG SL5 Xen virtual machine

  1. Faidon Liambotis

    Mounting the VM image with DM from the host and doing changes is *very* dangerous. This is because (parts of) the image are in the host’s page cache and you’re changing the underlying data without flushing the cache.

    The best way to do that is to create and run a program that executes posix_fadvise with the FADV_DONTNEED flag each time you’re doing changes and before re-booting the VM.

  2. panoskrt

    Faidon, let me outline the steps as I may miss something here:

    1) shutdown vm
    2) create device mapper loop with kpartx -a .img
    3) mount /dev/mapper/loopp /mnt/
    4) do changes
    5) umount /mnt/
    6) remove device mapper loop with kpartx -d .img
    7) boot vm

    I assume that once the partition is unmounted and the loop device is removed, the vm’s disk image data are saved and flushed from the hosts’ cache. Isn’t it? Or do you actually mean that the data of the previously running VM are still on host’s cache and the same data may be used next time the vm will be launched?

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s