Archive for January, 2009

RHCS, Xen, and GFS on RHEL 5.2

Monday, January 12th, 2009

I recently built a Red Hat Cluster with GFS and utilized the Xen kernel to create a more robust virtualization environment. The combination of the three services allows you to have a shared storage platform and monitoring of guest operating systems. Some of the benefits are that if a guest or a host dies the cluster service restarts the guest(s) on another available host that is specified in a failover domain. While it is not as robust as VMWare’s SRM and DRS tools it certainly could be with just a little bit of scripting. It is also more cost effective.

I’m not going to go into the details of how I built the cluster in this post as a quick Google search can provide you with many guide. I would just like to share the setup and the gotchas of the install I ran into and provide a benchmark made using iozone.

The Setup

* 2 x Dell 2950, 2 quad core 2.66GHz, 16GB RAM,
* 2 x QLogic QLE2462 – PCI-Express to 4Gb FC.
* MPIO instead of PowerPath
* EMC CX3-80; RAID 6; SATAII; 14 Disks 1TB volume (not ideal, but what I had to use!). Write caching is enabled on the storage group.

The gotchas

1. If upgrading to Xen kernel.

I realized after upgrading to the xen kernel on the servers and rebooting the ramdisk didn’t include the xenblk driver required so the system kept rebooting. I had to boot into the regular kernel and make a new ramdisk using the following command.


#KERNEL=2.6.18-92.el5xen
#mkinitrd --with=xennet --preload=xenblk /boot/initrd-2.6.18-92.el5xenb.img $KERNEL

Then I changed the grub.conf to use that ramdisk and that problem was solved.

2. Recognizing EMC CX3-80 using MPIO.

Be sure to view the multipath.conf.defaults file and copy the proper configuration to your multipath.conf

view /usr/share/doc/device-mapper-multipath-0.4.7/multipath.conf.defaults

Added the following to /etc/multipath.conf.


devices {
device {
vendor "DGC"
product ".*"
product_blacklist "LUN_Z"
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio_callout "/sbin/mpath_prio_emc /dev/%n"
features "1 queue_if_no_path"
hardware_handler "1 emc"
path_grouping_policy group_by_prio
failback immediate
rr_weight uniform
no_path_retry 60
rr_min_io 1000
path_checker emc_clariion
}
}

After that restart multipathing and you should see your disks using multipath -ll

3. Fencing

I was a little disappointed with how fencing works against Dell’s Remote Access Controller (DRAC). Instead of /sbin/fence_drac using ssh or https to reboot fenced node it uses telnet by default. This is less than ideal for security purposes (telnet sending credentials in clear text), but it will have to work for now. It wouldn’t take too long to re-write the perl script to use ssh and the DRAC is already listening on port 22, but it’d be nice to have it included in the fenced RPM.

Benchmark of GFS Version 1 on Xen host

I performed some basic tuning (gfs_tool gettune, noatime, noquota, glock_purge, etc) on the file system. I ran a benchmark on the file system using iozone. The results are below. Please note the x-axis is kB file, y-axis is kB/sec, and z-axis is kB record.

1. The writer report indicates that best performance occurred with 4MB files with a record size between 256KB and 1024KB.

Click for full size

x-axis is file size, y-axis is record size.

2. The re-writer report indicates the positive effect of cache. Performance is very high, even for large file sizes.

Click for full size

Click for full size

3. The reader report.

Click for full size image

Click for full size image

4. The re-reader report.

Click for full size image

Click for full size image

5. Random read report.

Click for full size image

Click for full size image

6. Random write report.

Click for full size image

Click for full size image

Benchmark of GFS Version 1 from within the guest

Now that I ran a benchmark against the file system from the Xen host I thought it would be interesting to run the same benchmark from within the guest OS to see how the Xen hypervisor effects I/O. Again, note the x-axis is kB file, y-axis is kB/sec, and z-axis is kB record.

1. Guest Writer Report

Click for full size image

Click for full size image

Hadoop

Friday, January 9th, 2009

Hadoop distributed file system opens up some very interesting possibilities for organizations who want to reduce storage costs and processing time. Instead of building a formal data warehouse or n-tier architecture which struggles from the typical bottlenecks the architecture of HDFS is a master/worker architecture. Along with allowing organizations to run a file system across commodity hardware it provides the ability to run map reduce jobs over the cluster. Of course the best part is that it was named after the creators child’s stuffed elephant and falls under the Apache license. :)

HDFS Architecture

HDFS Architecture

I was impressed with some of the example jars included and the ability to index string data in a several ASCII text files in just seconds. I used Michael Noll’s Running Hadoop on Ubuntu Linux as a quickstart guide. He even provides links to examples of a map reduce job written in Python.

Some practical examples of where it could be used are for running analytics on large sets of data (think credit transactions, bank statements or even internal log data). The below diagram depicts ASCII data being dumped into an HDFS tier from a database tier. From there MapReduce jobs can be executed against the HDFS tier with faster results than a typical data warehouse.

Using HDFS and MapReduce for Analytics

Using HDFS and MapReduce for Analytics

I hope to use a large set of unused desktops to build a 25 node cluster shortly. Yahoo’s latest HDFS cluster will have me beat by about 3,975 nodes even if I do though. Check out the largest cluster here.

Hadoop Home