====== Lustre filesystem over ZFS ====== ==== Introduction. ==== This article tells about UPGRADE of follow [[http://myitnotes.info/doku.php?id=en:jobs:lustrefs|cluster]] with a little changes.\\ All following are valid for [[https://wiki.hpdd.intel.com/display/PUB/Lustre+Releases|version Lustre 2.5.3]]\\ ====Scheme and equipment configurations.==== \\ {{:ru:jobs:lustre-cluster-ex.png?600|300}} \\ **Below scheme was used:**\\ One server MGS/MDS/OSS and five OSS servers.\\ **Configuration of MGS/MDS/OSS server:** \\ Proc Intel Xeon 56xx 2x2.4Ggz\\ Mem 72Gb*\\ Net 6x1Gbit/s\\ SSD 2x120Gb\\ HDD RAID6+HS - 24x3TB disks\\ *Memory volume was caused by using this node for SMB,NFS export **Configuration of OSS server:** \\ Proc Intel Xeon 56xx 2x2.4Ggz\\ Mem 12Gb\\ Net 4x1Gbit/s\\ HDD Adaptec RAID6+HS - 24x3TB disks\\ **Network:**\\ All servers in one vlan. (There is no Backend or Frontend) **OS on all server:** Centos 6.5 ==== Preparing and tuning ===== The question is where SSD is plugged if Chassis have only 24 hot-swaps. The answer is that they are connected to motherboard and put into the server (there were free space). Our production restriction for this solution allows to power off hardware for 10 min. If your production restrictions are higher you should use only HOT-SWAP. Also if your production restrictions include 24/7 you should use fault-tolerant solutions. * Install ОС Centos 6.5 * Update and install packets yum --exlude=kernel/* update -y yum localinstall --nogpgcheck https://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm yum install zfs strace sysstat man wget net-snmp openssh-clients ntp ntpdate tuned Check that zfs module was compiled. (lustre 2.5.3 compartable with ZFS 0.6.3)\\ * Create bond on every of MGS/MDS/OSS and ОSS servers: bond0 BONDING_OPTS="miimon=100 mode=0" * Disable SELINUX\\ * Install follow packages:\\ yum install mc openssh-clients openssh-server net-snmp man sysstat rsync htop trafshow nslookup ntp * Configure ntp\\ * On all server set identical (uid:gid)\\ * Set scheduler parameters: tuned-adm profile latency-performance\\ * Tuning sysctl.conf # increase Linux TCP buffer limits net.core.rmem_max = 8388608 net.core.wmem_max = 8388608 # increase default and maximum Linux TCP buffer sizes net.ipv4.tcp_rmem = 4096 262144 8388608 net.ipv4.tcp_wmem = 4096 262144 8388608 # increase max backlog to avoid dropped packets net.core.netdev_max_backlog=2500 net.ipv4.tcp_mem=8388608 8388608 8388608 sysctl net.ipv4.tcp_ecn=0 ==== Installing Lustre ===== **For servers:** Download utils: wget -r https://downloads.hpdd.intel.com/public/e2fsprogs/1.42.9.wc1/el6/RPMS/x86_64/ and lustre: wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/server/RPMS/x86_64/ Install utils:\\ Remove utils (rpm -e --nodeps) e2fsprogs e2fsprogs-libs libcom_err libss\\ install new:\\ rpm -ivh libcom_err-1.42.9.wc1-7.el6.x86_64.rpm rpm -ivh e2fsprogs-libs-1.42.9.wc1-7.el6.x86_64.rpm rpm -ivh e2fsprogs-1.42.9.wc1-7.el6.x86_64.rpm Install Lustre: rpm -ivh --force kernel-2.6.32-431.23.3.el6_lustre.x86_64.rpm rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-osd-zfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-osd-ldiskfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm rpm -ivh lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm Check lustre kernel will be boot by default in /boot/grub/grub.conf\\ Configure LNET: echo "options lnet networks=tcp0(bond0)" > /etc/modprobe.d/lustre.conf Reboot nodes reboot **For the clients:** Download and install utils.\\ Update kernel: yum install -y kernel-2.6.32-431.23.3.el6 reboot Dowload lustre: wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/client/RPMS/x86_64/ Install Lustre: rpm -ivh lustre-client-modules-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm rpm -ivh lustre-client-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm ====Deploying Lustre==== Follow steps of deploying:\\ 1. Make MGS/MDS.\\ 2. Make OSS/OST\\ **For MGS/MDS/OSS:**\\ Just in case: ln -s /lib64/libzfs.so.2.0.0 libzfs.so.2 mkfs.lustre --reformat --mgs --backfstype=zfs --fsname=lustrerr rumrrlustre-mdt0msg/mgs mirror /dev/sdd /dev/sde mkfs.lustre --mdt --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-mdt0msg/mdt0 Create /etc/ldev.conf # exampl mple /etc/ldev.conf # # local foreign/- label [md|zfs:]device-path [journal-path] # ls-1 - MGS zfs:rumrrlustre-mdt0msg/mgs ls-1 - lustrerr:MDT0000 zfs:rumrrlustre-mdt0msg/mdt0 ls-1 - lustrerr:OST0000 zfs:rumrrlustre-oss0/ost0 service lustre start MGS service lustre start MDT0000 In case of problems check lustre - LNET\\ lctl list_nids. if no output\\ lctl network up mkfs.lustre --ost --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss0/ost0 /dev/ost-drive ost-drive -RAID6 named by udev rules. mkdir /lustre /etc/fstab 192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0 **For OSS servers:** mkfs.lustre --ost --backfstype=zfs --index=**N** --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss**N**/ost0 /dev/ost-drive where N-serial number. Example: mkfs.lustre --ost --backfstype=zfs --index=1 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss1/ost0 /dev/ost-drive Create /etc/ldev.conf # exampl mple /etc/ldev.conf # # local foreign/- label [md|zfs:]device-path [journal-path] # ls-M - lustrerr:OST000N zfs:rumrrlustre-ossN/ost0 #where M = N+1 **For the clients:** mkdir /lustre /etc/fstab 192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0 For every server with mounted lustre filesystem: lfs df -h UUID bytes Used Available Use% Mounted on lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0] lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0] lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1] lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2] lustrerr-OST0003_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:3] lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4] lustrerr-OST0005_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:5] filesystem summary: 334.0T 40.6T 293.4T 12% /lustre ====Working with Lustre==== The tasks below will be considered: Rebalance data, delete ost, backup/restore, restore data with snapshot 1. Rebalance of data for OST when new node was added **Example: (look at to lustrerr-OST0005_UUID)** lfs df -h UUID bytes Used Available Use% Mounted on lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0] lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0] lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1] lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2] lustrerr-OST0003_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:3] lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4] lustrerr-OST0005_UUID 55.7T 52.7T 5.0T 94% /lustre[OST:5] filesystem summary: 334.0T 40.6T 293.4T 12% /lustre There could be two problems:\\ 1.1 Adding new data problem associated with lack of free space just on one of OST\\ 1.2 Increasing of I/O load on a new node.\\ You should use following algorithm for solving this problem:\\ * OST deactivation (ost will available only for read) * Moving data to more free OST * OST activation Example: lctl –device N deactivate lfs find –ost {OST_UUID} -size +1G | lfs_migrate -y lctl –device N activate 2. Delete (OST) You need to use algorithm above for solving this task:\\ * OST deactivation (ost will available only for read) * Moving date to more free OST * Permanent OST deactivation lctl --device FS-OST0003_UUID deactivate #temporary deactivate lfs find --obd FS-OST0003_UUID /lustre | lfs_migrate -y #migrate data lctl conf_param FS-OST0003_UUID.osc.active=0 #permanently deactivate Result: lfs df -h UUID bytes Used Available Use% Mounted on lustrerr-MDT0000_UUID 108.4G 2.1G 106.2G 2% /lustre[MDT:0] lustrerr-OST0000_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:0] lustrerr-OST0001_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:1] lustrerr-OST0002_UUID 55.7T 6.8T 48.9T 12% /lustre[OST:2] lustrerr-OST0003_UUID : inactive device lustrerr-OST0004_UUID 55.7T 6.9T 48.8T 12% /lustre[OST:4] lustrerr-OST0005_UUID 55.7T 6.7T 48.9T 12% /lustre[OST:5] 3.Backup and restore. Solved by using snapshots. Snapshots can be moved to different places. Example of MDT backup (OST commented).\\ vi /usr/local/bin/snapscript.sh #!/bin/sh currdate=`/bin/date +%Y-%m-%0e` olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e` chk=`zfs list -t snapshot | grep $olddate` #creating snapshots for vol1 and Avol2 pools /sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate #/sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X #deleting 21-days old snapshots (if they are exists) /sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate #/sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost /sbin/zfs send -p rumrrlustre-mdt0msg/mdt0@$currdate | /bin/gzip > /root/meta-snap.gz #backup only mdt # also mdt and ost can be backuped to remote node # example: zfs send -R rumrrlustre-mdt0msg/mdt0@$currdate | ssh some-node zfs receive rumrrlustre-mdt0msg/mdt0@$currdate Restore from backup (example only for mdt)\\ service lustre stop lustrerr:MDT0000 zfs rename rumrrlustre-mdt0msg/mdt0 rumrrlustre-mdt0msg/mdt0-old gunzip -c /root/meta-snap.gz | zfs receive rumrrlustre-mdt0msg/mdt0 service lustre start lustrerz:MDT0000 See logs: tail -f /var/log/messages Oct 14 14:12:44 ls-1 kernel: Lustre: lustrerr-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1413281557/real 1413281557] req@ffff880c60512c00 x1474855950917400/t0(0) o38->lustrerz- MDT0000-mdc-ffff880463edc000@0@lo:12/10 lens 400/544 e 0 to 1 dl 1413281588 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 71364 previous similar messages Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Will be in recovery for at least 2:30, or until 1 client reconnects Oct 14 14:13:08 ls-1 kernel: LustreError: 3937:0:(import.c:1000:ptlrpc_connect_interpret()) lustrerr-MDT0000_UUID went back in time (transno 55834576660 was previously committed, server now claims 55834576659)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000-mdc-ffff880463edc000: Connection restored to lustrerz-MDT0000 (at 0@lo) Oct 14 14:13:08 ls-1 kernel: Lustre: Skipped 1 previous similar message Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-OST0000: deleting orphan objects from 0x0:1571748 to 0x0:1571857 Oct 14 14:13:33 ls-1 kernel: LustreError: 167-0: lustrerz-MDT0000-lwp-OST0000: This client was evicted by lustrerz-MDT0000; in progress operations using this service will fail. Oct 14 14:13:33 ls-1 kernel: Lustre: lustrerr-MDT0000-lwp-OST0000: Connection restored to lustrerz-MDT0000 (at 0@lo) 4. Restore data by using snapshot. The same script as for backup/restore was used.\\ 4.1.\\ vi /usr/local/bin/snapscript.sh #!/bin/sh currdate=`/bin/date +%Y-%m-%0e` olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e` chk=`zfs list -t snapshot | grep $olddate` #creating snapshots for vol1 and Avol2 pools /sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate /sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X #deleting 21-days old snapshots (if they are exists) /sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate /sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost 4.2.\\ For MDT: zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-MDT0000 -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=1 -o lustre:fsname=lustrerr -o lustre:index=0 -o lustre:version=1 rumrrlustre-mdt0msg/mdt0@date rumrrlustre-mdt0msg/mdt00 4.3.\\ For OST (N-namber of ost) zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-OST000N -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=34 -o lustre:fsname=lustrerr -o lustre:index=N -o lustre:version=1 rumrrlustre-ossN/ost0@date rumrrlustre-ossN/ostN0 4.4.\\ Stop lustre (on all nodes) service lustre stop 4.5.\\ In /etc/ldev (All must be edited. Follow is example for first server) ls-1.scanex.ru - lustrerr:MDT0000 zfs:rumrrlustre-mdt0msg/mdt00 ls-1.scanex.ru - lustrerr:OST000N zfs:rumrrlustre-ossN/ostN0 4.6.\\ Start lustre (on all nodes) service lustre start 4.7.\\ Copy data to the chose location (local paths or remote) and after stop lustre on all nodes service lustre stop 4.8.\\ Restore initial configuration /etc/ldev.conf and start lustre. service lustre start 4.9.\\ Copy data from path to luster 4.10.\\ Delete zfs-clones by zfs destroy. [[http://docs.oracle.com/cd/E19253-01/819-5461/gammq/index.html|doc]] ==== About author ==== [[https://www.linkedin.com/pub/alexey-vyrodov/59/976/16b|Profile]] of the author