====== Lustre filesystem over ZFS ======

 ==== Introduction. ==== 

This article tells about UPGRADE of follow [[http://myitnotes.info/doku.php?id=en:jobs:lustrefs|cluster]] with a little changes.\\
All following are valid for [[https://wiki.hpdd.intel.com/display/PUB/Lustre+Releases|version Lustre 2.5.3]]\\
 

====Scheme and equipment configurations.====
\\
{{:ru:jobs:lustre-cluster-ex.png?600|300}}
\\
**Below scheme was used:**\\
One server MGS/MDS/OSS and five OSS servers.\\

**Configuration of MGS/MDS/OSS server:** \\
Proc Intel Xeon 56xx 2x2.4Ggz\\
Mem 72Gb*\\
Net 6x1Gbit/s\\
SSD 2x120Gb\\
HDD RAID6+HS - 24x3TB disks\\

*Memory volume was caused by using this node for SMB,NFS export

**Configuration of OSS server:** \\
Proc Intel Xeon 56xx 2x2.4Ggz\\
Mem 12Gb\\
Net 4x1Gbit/s\\
HDD Adaptec RAID6+HS - 24x3TB disks\\

**Network:**\\
All servers in one vlan. (There is no Backend or Frontend)

**OS on all server:** Centos 6.5


==== Preparing and tuning  =====

The question is where SSD is plugged if Chassis have only 24 hot-swaps. The answer is that they are connected to motherboard and put into the server (there were free space). Our production restriction for this solution allows to power off hardware for 10 min. If your production restrictions are higher you should use only HOT-SWAP. Also if your production restrictions include 24/7 you should use fault-tolerant solutions.


 * Install ОС Centos 6.5
 * Update and install packets
   yum --exlude=kernel/* update -y
   yum localinstall --nogpgcheck https://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
   yum localinstall --nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm
   yum install zfs strace sysstat man wget net-snmp openssh-clients ntp ntpdate tuned
Check that zfs module was compiled. (lustre 2.5.3 compartable with ZFS 0.6.3)\\
    
 * Create bond on every of MGS/MDS/OSS and ОSS servers:
   bond0 
   BONDING_OPTS="miimon=100 mode=0"
 * Disable SELINUX\\
 * Install follow packages:\\

   yum install mc openssh-clients openssh-server net-snmp man sysstat rsync htop trafshow nslookup ntp

 * Configure ntp\\
 * On all server set identical (uid:gid)\\
 * Set scheduler parameters: tuned-adm profile latency-performance\\
 * Tuning sysctl.conf
    
     # increase Linux TCP buffer limits
     net.core.rmem_max = 8388608
     net.core.wmem_max = 8388608
     # increase default and maximum Linux TCP buffer sizes
     net.ipv4.tcp_rmem = 4096 262144 8388608
     net.ipv4.tcp_wmem = 4096 262144 8388608
     # increase max backlog to avoid dropped packets
     net.core.netdev_max_backlog=2500
     net.ipv4.tcp_mem=8388608 8388608 8388608
     sysctl net.ipv4.tcp_ecn=0
==== Installing Lustre  =====

**For servers:**

Download utils:

  wget -r https://downloads.hpdd.intel.com/public/e2fsprogs/1.42.9.wc1/el6/RPMS/x86_64/

and lustre:

  wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/server/RPMS/x86_64/

Install utils:\\
Remove utils  (rpm -e --nodeps) e2fsprogs e2fsprogs-libs libcom_err libss\\
 
install new:\\
   rpm -ivh libcom_err-1.42.9.wc1-7.el6.x86_64.rpm
   rpm -ivh e2fsprogs-libs-1.42.9.wc1-7.el6.x86_64.rpm
   rpm -ivh e2fsprogs-1.42.9.wc1-7.el6.x86_64.rpm

Install Lustre:

  rpm -ivh --force kernel-2.6.32-431.23.3.el6_lustre.x86_64.rpm
  rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
  rpm -ivh lustre-modules-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
  rpm -ivh lustre-osd-zfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
  rpm -ivh lustre-osd-ldiskfs-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm
  rpm -ivh lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64.rpm

Check lustre kernel will be boot by default in /boot/grub/grub.conf\\

Configure LNET:

  echo "options lnet networks=tcp0(bond0)" > /etc/modprobe.d/lustre.conf

Reboot nodes

  reboot

**For the clients:**

Download and install utils.\\

Update kernel:

  yum install -y kernel-2.6.32-431.23.3.el6
  reboot

Dowload  lustre:

  wget -r https://downloads.hpdd.intel.com/public/lustre/lustre-2.5.3/el6/client/RPMS/x86_64/

Install Lustre:

  rpm -ivh lustre-client-modules-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm
  rpm -ivh lustre-client-2.5.3-2.6.32_431.23.3.el6.x86_64.x86_64.rpm


====Deploying Lustre====

Follow steps of deploying:\\ 
1. Make MGS/MDS.\\
2. Make OSS/OST\\


**For MGS/MDS/OSS:**\\
  Just in case:
  ln -s /lib64/libzfs.so.2.0.0 libzfs.so.2

   mkfs.lustre --reformat --mgs --backfstype=zfs --fsname=lustrerr rumrrlustre-mdt0msg/mgs mirror /dev/sdd /dev/sde
  mkfs.lustre --mdt --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-mdt0msg/mdt0
  
  Create  /etc/ldev.conf
   # exampl mple /etc/ldev.conf
   #
   # local  foreign/-  label       [md|zfs:]device-path     [journal-path]
   #
   ls-1 - MGS                    zfs:rumrrlustre-mdt0msg/mgs
   ls-1 - lustrerr:MDT0000       zfs:rumrrlustre-mdt0msg/mdt0
   ls-1 - lustrerr:OST0000       zfs:rumrrlustre-oss0/ost0
  
      
   service lustre start MGS
   service lustre start MDT0000
   

In case of problems check lustre - LNET\\

   lctl list_nids.

if no output\\
 

   lctl network up


   mkfs.lustre --ost --backfstype=zfs --index=0 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss0/ost0 /dev/ost-drive
   ost-drive -RAID6 named by udev rules.

   mkdir /lustre
  
   /etc/fstab
   192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0


**For OSS servers:**
  
  mkfs.lustre --ost --backfstype=zfs --index=**N** --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss**N**/ost0 /dev/ost-drive
  where N-serial number.
  Example: mkfs.lustre --ost --backfstype=zfs --index=1 --fsname=lustrerr --mgsnode=192.168.5.182@tcp0 rumrrlustre-oss1/ost0 /dev/ost-drive
  
  Create /etc/ldev.conf
   # exampl mple /etc/ldev.conf
   #
   # local  foreign/-  label       [md|zfs:]device-path     [journal-path]
   #
   ls-M - lustrerr:OST000N       zfs:rumrrlustre-ossN/ost0
   #where M = N+1
  
**For the clients:**

   mkdir /lustre
   /etc/fstab
   192.168.5.182@tcp0:/lustrerr /lustre lustre defaults,_netdev 0 0

For every server with mounted lustre filesystem:
  
   lfs df -h
   UUID                       bytes        Used   Available Use% Mounted on
   lustrerr-MDT0000_UUID      108.4G        2.1G      106.2G   2% /lustre[MDT:0]
   lustrerr-OST0000_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:0]
   lustrerr-OST0001_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:1]
   lustrerr-OST0002_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:2]
   lustrerr-OST0003_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:3]
   lustrerr-OST0004_UUID       55.7T        6.9T       48.8T  12% /lustre[OST:4]
   lustrerr-OST0005_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:5]

   filesystem summary:       334.0T       40.6T      293.4T  12% /lustre  
====Working with Lustre====

The tasks below will be considered:
Rebalance data, delete ost, backup/restore, restore data with snapshot

1. Rebalance of data for OST when new node was added
**Example: (look at to lustrerr-OST0005_UUID)**

   lfs df -h
   UUID                       bytes        Used   Available Use% Mounted on
   lustrerr-MDT0000_UUID      108.4G        2.1G      106.2G   2% /lustre[MDT:0]
   lustrerr-OST0000_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:0]
   lustrerr-OST0001_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:1]
   lustrerr-OST0002_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:2]
   lustrerr-OST0003_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:3]
   lustrerr-OST0004_UUID       55.7T        6.9T       48.8T  12% /lustre[OST:4]
   lustrerr-OST0005_UUID       55.7T        52.7T       5.0T  94% /lustre[OST:5]

   filesystem summary:       334.0T       40.6T      293.4T  12% /lustre  

There could be two problems:\\
1.1 Adding new data problem associated with lack of free space just on one of OST\\
1.2 Increasing of I/O load on a new node.\\
You should use following algorithm for solving this problem:\\

  * OST deactivation (ost will available only for read)
  * Moving data to more free OST
  * OST activation

Example:
   lctl –device N deactivate
   lfs find –ost {OST_UUID} -size +1G | lfs_migrate -y
   lctl –device N activate

2. Delete (OST)


You need to use algorithm above for solving this task:\\
  * OST deactivation (ost will available only for read)
  * Moving date to more free OST
  * Permanent OST deactivation

   
  lctl --device FS-OST0003_UUID deactivate #temporary deactivate
  lfs find --obd FS-OST0003_UUID /lustre | lfs_migrate -y #migrate data
  lctl conf_param FS-OST0003_UUID.osc.active=0 #permanently deactivate


Result:

  lfs df -h
  UUID                       bytes        Used   Available Use% Mounted on
  lustrerr-MDT0000_UUID      108.4G        2.1G      106.2G   2% /lustre[MDT:0]
  lustrerr-OST0000_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:0]
  lustrerr-OST0001_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:1]
  lustrerr-OST0002_UUID       55.7T        6.8T       48.9T  12% /lustre[OST:2]
  lustrerr-OST0003_UUID       : inactive device
  lustrerr-OST0004_UUID       55.7T        6.9T       48.8T  12% /lustre[OST:4]
  lustrerr-OST0005_UUID       55.7T        6.7T       48.9T  12% /lustre[OST:5] 
  
3.Backup and restore.

Solved by using snapshots. Snapshots can be moved to different places. Example of MDT backup (OST commented).\\

vi  /usr/local/bin/snapscript.sh
  #!/bin/sh
  currdate=`/bin/date +%Y-%m-%0e`
  olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e`
  chk=`zfs list -t snapshot | grep $olddate`
  #creating snapshots for vol1 and Avol2 pools
  /sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate
  #/sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X 
  #deleting 21-days old snapshots (if they are exists)
  /sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate
  #/sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost
  /sbin/zfs send -p rumrrlustre-mdt0msg/mdt0@$currdate | /bin/gzip > /root/meta-snap.gz #backup only mdt
  # also mdt and ost can be backuped to remote node 
  # example: zfs send -R rumrrlustre-mdt0msg/mdt0@$currdate | ssh some-node zfs receive rumrrlustre-mdt0msg/mdt0@$currdate

Restore from backup (example only for mdt)\\

  service lustre stop lustrerr:MDT0000
  zfs rename rumrrlustre-mdt0msg/mdt0 rumrrlustre-mdt0msg/mdt0-old
  gunzip -c /root/meta-snap.gz | zfs receive rumrrlustre-mdt0msg/mdt0
  service lustre start lustrerz:MDT0000

See logs:

  tail -f /var/log/messages
  Oct 14 14:12:44 ls-1 kernel: Lustre: lustrerr-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
  Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1413281557/real 1413281557]  req@ffff880c60512c00 x1474855950917400/t0(0) o38->lustrerz-    MDT0000-mdc-ffff880463edc000@0@lo:12/10 lens 400/544 e 0 to 1 dl 1413281588 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
  Oct 14 14:13:08 ls-1 kernel: Lustre: 3937:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 71364 previous similar messages
  Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Will be in recovery for at least 2:30, or until 1 client reconnects
  Oct 14 14:13:08 ls-1 kernel: LustreError: 3937:0:(import.c:1000:ptlrpc_connect_interpret()) lustrerr-MDT0000_UUID went back in time (transno 55834576660 was previously committed, server now claims 55834576659)!  See   https://bugzilla.lustre.org/show_bug.cgi?id=9646
  Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
  Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-MDT0000-mdc-ffff880463edc000: Connection restored to lustrerz-MDT0000 (at 0@lo)
  Oct 14 14:13:08 ls-1 kernel: Lustre: Skipped 1 previous similar message
  Oct 14 14:13:08 ls-1 kernel: Lustre: lustrerr-OST0000: deleting orphan objects from 0x0:1571748 to 0x0:1571857
  Oct 14 14:13:33 ls-1 kernel: LustreError: 167-0: lustrerz-MDT0000-lwp-OST0000: This client was evicted by lustrerz-MDT0000; in progress operations using this service will fail.
  Oct 14 14:13:33 ls-1 kernel: Lustre: lustrerr-MDT0000-lwp-OST0000: Connection restored to lustrerz-MDT0000 (at 0@lo)

4. Restore data by using  snapshot.


The same script as for backup/restore was used.\\
4.1.\\
vi  /usr/local/bin/snapscript.sh
  #!/bin/sh
  currdate=`/bin/date +%Y-%m-%0e`
  olddate=`/bin/date --date="21 days ago" +%Y-%m-%0e`
  chk=`zfs list -t snapshot | grep $olddate`
  #creating snapshots for vol1 and Avol2 pools
  /sbin/zfs snapshot rumrrlustre-mdt0msg/mdt0@$currdate
  /sbin/zfs snapshot rumrrlustre-ossN/ost0@$currdate #must be started on every ost. Also it can be started with ssh -X 
  #deleting 21-days old snapshots (if they are exists)
  /sbin/zfs destroy rumrrlustre-mdt0msg/mdt0@$olddate
  /sbin/zfs destroy rumrrlustre-ossN/ost0@$olddate #for ost

4.2.\\  
For MDT:

   zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-MDT0000 -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=1 -o lustre:fsname=lustrerr -o lustre:index=0 -o lustre:version=1  rumrrlustre-mdt0msg/mdt0@date rumrrlustre-mdt0msg/mdt00

4.3.\\
For OST (N-namber of ost)

   zfs clone -o canmount=off -o xattr=sa -o lustre:svname=lustrerz-OST000N  -o lustre:mgsnode=192.168.5.182@tcp -o lustre:flags=34 -o lustre:fsname=lustrerr -o lustre:index=N -o lustre:version=1  rumrrlustre-ossN/ost0@date rumrrlustre-ossN/ostN0

4.4.\\
Stop lustre (on all nodes)

   service lustre stop

4.5.\\
In /etc/ldev (All must be edited. Follow is example for first server)

  ls-1.scanex.ru - lustrerr:MDT0000       zfs:rumrrlustre-mdt0msg/mdt00
  ls-1.scanex.ru - lustrerr:OST000N       zfs:rumrrlustre-ossN/ostN0

4.6.\\  
Start lustre (on all nodes)

  service lustre start

4.7.\\
Copy data to the chose location (local paths or remote) and after stop lustre on all nodes
  
  service lustre stop

4.8.\\
Restore initial configuration /etc/ldev.conf and start lustre.
  
  service lustre start

4.9.\\
Copy data from path to luster

4.10.\\

Delete zfs-clones by zfs destroy. [[http://docs.oracle.com/cd/E19253-01/819-5461/gammq/index.html|doc]]

 ==== About author  ====
[[https://www.linkedin.com/pub/alexey-vyrodov/59/976/16b|Profile]] of the author