I had accidentally given the same UUID on two guests and I didn’t realise it until I tried to start the second one. I generated a new UUID but the guest was still failing, but this time with a different error, like its disk image was used by another guest.
# xm create node6 Using config file "/etc/xen/node6". Error: Device 768 (vbd) could not be connected. File /data/guests/node6.img is loopback-mounted through /dev/loop5, which is mounted in a guest domain, and so cannot be mounted now
A bit odd as I was sure that the disk image was not used by another guest. After a couple of tries, I used the ‘lsof’ command to check what processes are using the guests’ images. I have all the disk images under /data/guests:
# lsof +D /data/guests/ | grep node6 qemu-dm 6250 root 6u REG 8,17 4194304000 161693706 /data/guests/node6.img qemu-dm 7121 root 6u REG 8,17 4194304000 161693706 /data/guests/node6.img qemu-dm 8906 root 6u REG 8,17 4194304000 161693706 /data/guests/node6.img qemu-dm 11262 root 6u REG 8,17 4194304000 161693706 /data/guests/node6.img
Four different processes pointing at the disk image of node6 guest. The first one being the first attempt to boot the guest with the wrong UUID and the other ones my attempts on trying to boot the guest after I changed the UUID. As none of these processes was corresponding to a running instance of the guest, I killed all of them:
# for i in `lsof +D /data/guests/ | grep node6 | awk {'print $2'}`;do kill -9 $i;done
As a side note ,if one of these was the last, and successful, attempt to run the guest, then I had to identify which one is the running instance by checking the PIDs. The following command would return the PIDs of the failed attempts to start the guest:
# lsof +D /data/guests/ | grep node6 | grep -v \ `ps aux | grep node6 | grep -v grep | awk {'print $2'}` | awk {'print $2'}
Once I killed the processes, I checked again with lsof and everything looked good:
# lsof +D /data/guests/ COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME qemu-dm 3345 root 6u REG 8,17 4194304000 161693699 /data/guests/node1.img qemu-dm 3633 root 6u REG 8,17 4194304000 161693702 /data/guests/node2.img qemu-dm 3788 root 6u REG 8,17 4194304000 161693703 /data/guests/node3.img qemu-dm 3945 root 6u REG 8,17 4194304000 161693704 /data/guests/node4.img qemu-dm 4154 root 6u REG 8,17 4194304000 161693705 /data/guests/node5.img qemu-dm 4513 root 6u REG 8,17 4194304000 161693708 /data/guests/node8.img qemu-dm 4996 root 6u REG 8,17 4194304000 161693707 /data/guests/node7.img
But trying to boot the host again gives the same error. I then powered off all of the running guests (node1-5,7,8) and tried to to boot node6. It failed again with the same error. I was a bit puzzled as there was no zombie instance listed by ‘xm list’ and no zombie guest ID listed by XenStore when running ‘xenstore-list backend/vbd’. But weirdly enough, previous experience says that without any guests running, the backend/vbd shouldn’t be present but it was. I could thought only of a “ghost” zombie instance keeping busy the disk image as during my attempt to start node6 with the wrong UUID I got node6 to be zombie at some point. My last try before rebooting the whole system was to remove the whole backend/vbd:
# xenstore-rm backend/vbd
Once I did, I restarted the xend daemon and tried to start node6 once again. It worked! I then booted and the rest of the guests and every one of them was happy.
I still can’t determine what was the exact cause of that but I guess it was because of the “ghost” zombie guest due to its failed startup attempt with the wrong UUID while there was another guest instance running with the same UUID.
Note: It’s really funny how “kills”, “zombies”, “daemons” and “ghosts” go along with computers 😛
you’re a genius!!!
you save my day! Thank you!
Matteo