HFS+ on Linux

My MacBook Pro seems to have died. Its hard disk holds lots of data that have not been backed up anywhere else. The important data have been mostly backed up but latest changes haven’t been backed up. Other than that, I had to check the hard disk for bad sectors etc and since I don’t have another Mac to test it on I had to do it on my Fedora box. My system (Fedora 15) would automatically detect and mount the HFS+ hard disk on read only mode. It was also missing the fsck tool for HFS+ partitions. Getting the fsck tool for HFS+ required downloading and installing hfsplus-tools via yum.However, fsck.hfs will not allow you, except if you use the –force option, to scan the partition if it has journalling on, which is the case with HFS+ partitions by default.

But how would I turn off journalling when I don’t have a Mac system to attach my disk on? After a couple of minutes I came across a post on the Ubuntu forum which linked to this blog. The author has a C code that turns journalling off. A fixed version of this code can be found here. The compiled code results in an executable which gets as its only argument the partition that you want to get journalling off.

# gcc journalling_off.c -o journalling_off
# ./journalling_off /dev/sdg2

Next step was to perform the fsck check on the target disk

# fsck.hfs /dev/sdg2
** /dev/sdg2
** Checking HFS Plus volume.
** Checking Extents Overflow file.
** Checking Catalog file.
** Checking multi-linked files.
** Checking Catalog hierarchy.
** Checking Extended Attributes file.
** Checking volume bitmap.
** Checking volume information.
** The volume OS X appears to be OK.

Disk looks OK. Next step is to mount it with read-write permissions:

# mount -t hfplus -o rw,user /dev/sdg2 /mnt/osx

Next issue encountered was the different UIDs between my account on the OS X system and that on the Linux system. Therefore, next step was to change the UIDs under the whole user directory on the OS X disk so I could access without problem with write permissions from my Linux box:

# find panoskrt/ -uid 501 -exec chown panoskrt {} \;

4th IC-SCCE

Last week I was in Athens for the 4th International Conference from Scientific Computing to Computational Engineering. Many interesting talks from a wide range of areas. My talk was about “HPC Applications Performance on Virtual Clusters“. The main outcome from my perspective, we need to investigate GPU virtualisation. There are more and more scientists/researchers that want to exploit such systems and in the near future we’ll need to deploy virtualised GPU systems in the same way we do with CPUs.

My paper
My presentation

Linux buffer cache state

Following Faidon’s comment on an earlier post, I came across  this informative site concerning Linux’s buffer cache state. In a nutshell, the following code will release all the cached data of a specified file.

#define _XOPEN_SOURCE 600
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char *argv[]) {
    int fd;
    fd = open(argv[1], O_RDONLY);
    fdatasync(fd);
    posix_fadvise(fd, 0,0,POSIX_FADV_DONTNEED);
    close(fd);
    return 0;
}

There are some useful samples and examples on the mentioned web page. posix_fadvise description here.

Linux software RAID

Recently I got two Maxtor 80GB disks. Sometimes the existing external hard disk runs out of space but can be easily sorted by erasing unnecessary data. As this external hard disk has been on constant use for around 4 years I thought it might be a good idea to use the additional Maxtors to build a software RAID and backup the existing external disk that stores the data.

My setup is pretty simple as all of the disks are external and connected via USB with the main system. The two 80GB disks are on RAID-1 (mirroring) syncing (via rsync in a cron-job) all the required data from the existing external hard disk. To keep syncing simple, I’ve created two single partitions (/dev/sdb1 and /dev/sdc1) on each of the RAID disks and then created a RAID-1 /dev/md0

I tried two different ways configuring the RAID, one with raidtools2 and one with mdadm.

mdadm is straight forward and can be used directly from the command file as below:

mdadm --create --verbose /dev/md0 --level=raid1 --raid-devices=2 /dev/sdb1 /dev/sdc1
mdadm: chunk size defaults to 64K
mdadm: /dev/sdb1 appears to contain an ext2fs file system
    size=78156192K  mtime=Thu Jan  1 01:00:00 1970
mdadm: /dev/sdc1 appears to contain an ext2fs file system
    size=78156192K  mtime=Thu Jan  1 01:00:00 1970
Continue creating array? (y/n) y
mdadm: array /dev/md0 started.

There is no need to explain the mdadm parameters as it is pretty much obvious what is happening. A look at the man page reveals all the possible options and what they stand for.

You can also check in /proc/mdstat to see if the RAID is running:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
      167772160 blocks 64k rounding

The other way is to use raidtools and declare the raid setup in /etc/raidtab:

$ cat /etc/raidtab
raiddev /dev/md0
        raid-level      1
        nr-raid-disks   2
	nr-spare-disks	0
	chunk-size	4
        persistent-superblock 1
        device          /dev/sdc1
        raid-disk       0
        device          /dev/sdd1
        raid-disk       1

And then the raid can be created:

# mkraid /dev/md0
handling MD device /dev/md0
analyzing super-block
disk 0: /dev/sdb1, 78156193kB, raid superblock at 78156096kB
disk 1: /dev/sdc1, 78156193kB, raid superblock at 78156096kB

Either way, raidtools or mdadm, you can then create and format partitions on /dev/md0 the normal way, using fdisk and mkfs.ext*. Once done so, the partitions can be mounted as would any other partition and syncing between the external storage disk and the raid can start.

I think that I’ll stick with mdadm as it is easier and more flexible than raidtools.

Some useful links:

How to replace a failed disk on Linux software RAID-1
mdadm: A new tool for Linux software RAID management
The Software-RAID HOW-TO

Report disk space usage

A quick shell script that reports which disks use more than a specified percentage of disk space. Can be used as a cronjob to mail the results.

############################################################################
# Copyright (C) 2009  Panagiotis Kritikakos <panoskrt@gmail.com>           #
#                                                                          #
#    This program is free software: you can redistribute it and/or modify  #
#    it under the terms of the GNU General Public License as published by  #
#    the Free Software Foundation, either version 3 of the License, or     #
#    (at your option) any later version.                                   #
#                                                                          #
#    This program is distributed in the hope that it will be useful,       #
#    but WITHOUT ANY WARRANTY; without even the implied warranty of        #
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
#    GNU General Public License for more details.                          #
#                                                                          #
#    You should have received a copy of the GNU General Public License     #
#    along with this program.  If not, see <http://www.gnu.org/licenses/>. #
############################################################################

#!/bin/bash
if [ $# -lt 1 ] || [ $# -gt 2 ]; then
  echo "Usage ./disk_usage <percentage>";
  echo " Example: ./disk_usage 50";
  exit;
fi
space=(`df -h | awk '{sub(/%/,"");print $5}' | grep -v / | grep -v Use | grep -v ^$`)
spaceLen=${#space[*]}

i=0
echo "The disks bellow use more than ${1}% of their space" > /tmp/diskSpace
echo "-----------------------------------------------------" >> /tmp/diskSpace
while [ $i -lt $spaceLen ]; do
   checkval=${space[$i]}
   if [ $checkval -gt $1 ] || [ $checkval -eq $1 ]; then
        df -h | grep $checkval | awk '{print $1 "\t" $5 "\t" $6}' >> /tmp/diskSpace
   fi
        let i++
done
/usr/bin/Mail -s "Disk space usage" foo@bar.com < /tmp/diskSpace
rm -f /tmp/diskSpace

Xen and loop devices continued

I created one more guest and that made them nine in total, but when I tried to power it on, I got the following error message:

# xm create node9
Using config file "/etc/xen/node9".
Error: Device 51712 (vbd) could not be connected. Failed to find an unused loop device

Once again problem with the disk image in combination with loop device but different than yesterday’s problem.
I checked which loop devices are used:

# losetup -a
/dev/loop0: [0811]:161693699 (/data/guests/node1.img)
/dev/loop1: [0811]:161693702 (/data/guests/node2.img)
/dev/loop2: [0811]:161693703 (/data/guests/node3.img)
/dev/loop3: [0811]:161693704 (/data/guests/node4.img)
/dev/loop4: [0811]:161693705 (/data/guests/node5.img)
/dev/loop5: [0811]:161693706 (/data/guests/node6.img)
/dev/loop6: [0811]:161693707 (/data/guests/node7.img)
/dev/loop7: [0811]:161693708 (/data/guests/node8.img)

And then how many are available in total:

# ls -l /dev/loop*
brw-r----- 1 root disk 7, 0 Mar  9 22:08 /dev/loop0
brw-r----- 1 root disk 7, 1 Mar  9 22:08 /dev/loop1
brw-r----- 1 root disk 7, 2 Mar  9 22:08 /dev/loop2
brw-r----- 1 root disk 7, 3 Mar  9 22:08 /dev/loop3
brw-r----- 1 root disk 7, 4 Mar  9 22:08 /dev/loop4
brw-r----- 1 root disk 7, 5 Mar  9 22:08 /dev/loop5
brw-r----- 1 root disk 7, 6 Mar  9 22:08 /dev/loop6
brw-r----- 1 root disk 7, 7 Mar  9 22:08 /dev/loop7

All eight loop devices are used. If each guest had a second disk image as well, I’d face the problem after node4. The problem resides to the fact that the default number of loop devices provided by the kernel are eight, and every Xen guest that doesn’t use the Xen blktp driver will use a loop device for every disk image it’s assigned with. The solution to this is to increase the number of loop devices. That can be done by editing /etc/modprobe.conf and adding a new definition for the maximum loop devices:

options loop max_loop=24

In order for that to take effect, you’ll need either to reboot the system or reload the loop kernel module. In order to reload the module, the current loop devices must be free, so every guest should be powered off. Then simply:

# rmmod loop
# modprobe loop
# ls -l /dev/loop*
brw-r----- 1 root disk 7,  0 Mar 24 12:25 /dev/loop0
brw-r----- 1 root disk 7,  1 Mar 24 12:25 /dev/loop1
brw-r----- 1 root disk 7, 10 Mar 24 12:27 /dev/loop10
brw-r----- 1 root disk 7, 11 Mar 24 12:27 /dev/loop11
brw-r----- 1 root disk 7, 12 Mar 24 12:27 /dev/loop12
brw-r----- 1 root disk 7, 13 Mar 24 12:27 /dev/loop13
brw-r----- 1 root disk 7, 14 Mar 24 12:27 /dev/loop14
brw-r----- 1 root disk 7, 15 Mar 24 12:27 /dev/loop15
brw-r----- 1 root disk 7, 16 Mar 24 12:27 /dev/loop16
brw-r----- 1 root disk 7, 17 Mar 24 12:27 /dev/loop17
brw-r----- 1 root disk 7, 18 Mar 24 12:27 /dev/loop18
brw-r----- 1 root disk 7, 19 Mar 24 12:27 /dev/loop19
brw-r----- 1 root disk 7,  2 Mar 24 12:25 /dev/loop2
brw-r----- 1 root disk 7, 20 Mar 24 12:27 /dev/loop20
brw-r----- 1 root disk 7, 21 Mar 24 12:27 /dev/loop21
brw-r----- 1 root disk 7, 22 Mar 24 12:27 /dev/loop22
brw-r----- 1 root disk 7, 23 Mar 24 12:27 /dev/loop23

Powering the guests and running losetup -a:

# losetup -a
/dev/loop0: [0811]:161693699 (/data/guests/node1.img)
/dev/loop1: [0811]:161693702 (/data/guests/node2.img)
/dev/loop10: [0811]:161693703 (/data/guests/node3.img)
/dev/loop11: [0811]:161693704 (/data/guests/node4.img)
/dev/loop12: [0811]:161693705 (/data/guests/node5.img)
/dev/loop13: [0811]:161693706 (/data/guests/node6.img)
/dev/loop14: [0811]:161693707 (/data/guests/node7.img)
/dev/loop15: [0811]:161693708 (/data/guests/node8.img)
/dev/loop16: [0811]:161693709 (/data/guests/node9.img)

Paravirtualised guests can make use of the blktp driver to access the virtual block device directly, without using a loop device. To do so, the guest’s configuration file must specify ‘tap:aio:’ instead of of ‘file’ at the disk entry. This is not true for full virtualised guests. However, both paravirtualsied and full virtualised guests can make use of physical partitions (defined as ‘phy’) and that eliminates the use of loop devices among other advantages.

Xen guest and and loop device being busy

I had accidentally given the same UUID on two guests and I didn’t realise it until I tried to start the second one. I generated a new UUID but the guest was still failing, but this time with a different error, like its disk image was used by another guest.

# xm create node6
Using config file "/etc/xen/node6".
Error: Device 768 (vbd) could not be connected.
File /data/guests/node6.img is loopback-mounted through /dev/loop5,
which is mounted in a guest domain,
and so cannot be mounted now

A bit odd as I was sure that the disk image was not used by another guest. After a couple of tries, I used the ‘lsof’ command to check what processes are using the guests’ images. I have all the disk images under /data/guests:

# lsof +D /data/guests/ | grep node6
qemu-dm  6250 root    6u   REG   8,17 4194304000 161693706 /data/guests/node6.img
qemu-dm  7121 root    6u   REG   8,17 4194304000 161693706 /data/guests/node6.img
qemu-dm  8906 root    6u   REG   8,17 4194304000 161693706 /data/guests/node6.img
qemu-dm 11262 root    6u   REG   8,17 4194304000 161693706 /data/guests/node6.img

Four different processes pointing at the disk image of node6 guest. The first one being the first attempt to boot the guest with the wrong UUID and the other ones my attempts on trying to boot the guest after I changed the UUID. As none of these processes was corresponding to a running instance of the guest, I killed all of them:

# for i in `lsof +D /data/guests/ | grep node6 | awk {'print $2'}`;do kill -9 $i;done

As a side note ,if one of these was the last, and successful, attempt to run the guest, then I had to identify which one is the running instance by checking the PIDs. The following command would return the PIDs of the failed attempts to start the guest:

# lsof +D /data/guests/ | grep node6 | grep -v \
`ps aux | grep node6 | grep -v grep | awk {'print $2'}` | awk {'print $2'}

Once I killed the processes, I checked again with lsof and everything looked good:

# lsof +D /data/guests/
COMMAND  PID USER   FD   TYPE DEVICE       SIZE      NODE NAME
qemu-dm 3345 root    6u   REG   8,17 4194304000 161693699 /data/guests/node1.img
qemu-dm 3633 root    6u   REG   8,17 4194304000 161693702 /data/guests/node2.img
qemu-dm 3788 root    6u   REG   8,17 4194304000 161693703 /data/guests/node3.img
qemu-dm 3945 root    6u   REG   8,17 4194304000 161693704 /data/guests/node4.img
qemu-dm 4154 root    6u   REG   8,17 4194304000 161693705 /data/guests/node5.img
qemu-dm 4513 root    6u   REG   8,17 4194304000 161693708 /data/guests/node8.img
qemu-dm 4996 root    6u   REG   8,17 4194304000 161693707 /data/guests/node7.img

But trying to boot the host again gives the same error. I then powered off all of the running guests (node1-5,7,8) and tried to to boot node6. It failed again with the same error. I was a bit puzzled as there was no zombie instance listed by ‘xm list’ and no zombie guest ID listed by XenStore when running ‘xenstore-list backend/vbd’. But weirdly enough, previous experience says that without any guests running, the backend/vbd shouldn’t be present but it was. I could thought only of a “ghost” zombie instance keeping busy the disk image as during my attempt to start node6 with the wrong UUID I got node6 to be zombie at some point. My last try before rebooting the whole system was to remove the whole backend/vbd:

# xenstore-rm backend/vbd

Once I did, I restarted the xend daemon and tried to start node6 once again. It worked! I then booted and the rest of the guests and every one of them was happy.

I still can’t determine what was the exact cause of that but I guess it was because of the “ghost” zombie guest due to its failed startup attempt with the wrong UUID while there was another guest instance running with the same UUID.

Note: It’s really funny how “kills”, “zombies”, “daemons” and “ghosts” go along with computers 😛

Mounting partitions within image file

kpartx is a tool which creates device maps from partition tables. Assume that you have a .img disk image which contains various partitions and you want to mount one or more of the partitions. kpartx will map each partition to a loop device under /dev/mapper and will allow you to mount the loop device to a local directory and access the files.
Mapping an image:

# kpartx -a /foo/bar/disk.img

Checking which loop device was used to mount the partitions:

# kpartx -l /foo/bar/disk.img
loop1p1 : 0 8177022 /dev/loop1 63

The disk image has one partition and is mapped to /dev/loop1p1. You can now mount the partition:

# mount /dev/mapper/loop1p1 /mnt/img

Once finished, you can remove the mapper device:

# kpartx -d /foo/bar/disk.img
loop deleted : /dev/loop1

Basic AutoFS configuration

AutoFS is an automounter of storage devices for Linux and UNIX operating systems. An automounter enables the user to mount a directory only whenever is needed e.g. when it needs to get accessed. After some time of inactivity, the filesystem will be unmounted. The main automounter’s file is /etc/auto.master (or auto_master sometimes, mainly in Solaris).

/misc	        /etc/auto.misc
/net		-hosts
+auto.master

The interesting parts are the two first entries. The first part of the entry specifies the root directory which autofs will use for the mount points. The second part specifies which file contains the mount points information. For instance, look at the second entry. The /net directory will be used as the root directory and /etc/hosts will be used as the file which contains the mount points information. That means, autofs will make available all the NFS exports of the all hosts specified in /etc/hosts under /net.

Let’s say now that you want to mount when needed, specific NFS exports from your file server. Let’s say that you want to mount them under /mount/nfs. At first place, you’ll need to create the file that will contains the information of the mount points. The file could be /etc/nfstab or whatever you like. You can specify the entries in the following easy understandable format:

music	-rw			192.168.1.10:/exports/music
photos	-ro			192.168.1.10:/exports/photos
apps    -rw,nosuid 	        192.168.1.10:/exports/apps

If you want, you don’t specify any options or specify as many as you need. The options that apply on ‘mount’, apply on AutoFs as well. Once you have created the list file, you need to add in /etc/auto.master and it should then look like this:

/misc			/etc/auto.misc
/net			-hosts
/mount/nfs	 	/etc/nfstab
+auto.master

You can use the automounter in order to mount non-network filesystems.
Next step is to restart autofs daemon. Having done so, you should be able to access the three shares. Note that they may not be displayed under the directory unless you try to access them.

Having a look in /etc/auto.misc gives a few examples:

cd      -fstype=iso9660,ro,nosuid,nodev	:/dev/cdrom

This will mount the CD/DVD drive under /misc when the user, or a service, will try to access it.