Views
GEC3 VM migration evaluation
From OpenFlow Wiki
Current State
- Trying to migrate jgn-vm works fine, but Ubuntu vm gets intermittant timeouts after 3 minutes (exactly) of waiting at 10% progress
- Tried Advanced Migrate.NetTimeout to 2000 (from 30); didn't have any effect
- Tried to change the VM infrastructure client to something higher, no effect
- on mvm-esx1, the command `esxcfg-vmknic -l` was *hanging*
- ran /etc/init.d/vmware-vpsa restart`
- Had to turn off "Expose NX to Guest" on "Ubuntu" VM to get it to migrate to Japan!
Previous State
- Migrating a fully running VM takes MUCH longer than a non-running VM
- worse, it takes different amounts of time in different directions
- from stanford to japan: ~4 minutes (!!)
- from japan to stanford: ~2.5 minutes
- hypothesis: if vm destination is closer to nfs server, lower latency makes it go faster
- the bandwidth limited portion of the migration appears to take up 60% of the time
- still main focus of investigation
Statistics for migration within JGN
- We consistently observe following behavior for 32MB VM, with NFS located in JGN:
- NEC1->NEC2: 10% takes 60 secs, remaining takes 10 secs
- NEC2->NEC1: 10% takes 22 secs, remaining takes 9 secs
- Interruption time:
- NEC1->NEC2: Between 1-3 secs (for ping), No visible interruption for game
- NEC2->NEC1: Between 1-3 secs (for ping), No visible interruption for game
Outstanding Weirdness
- 'ether[61] == 0x01' to dump the icmp only traffic across the tunnel
- Conclusion: there is no direct path between nec-esx1 and nec-esx2
- on nec-esx{1,2} you cannot ping their *local* vmkernel interface, use vmkping
- there is no direct bridge between the two kernels
- 32MB is TOO small for the game server; game server wants 24 MB and if you turn off the swap (which I did) then it will not run
Resolved Weirdness
- Packet Duplication -- SOLVED: in japan, nf2c0 and n2c2 were on the same subnet!
- Pinging from my laptop at stanford to any address in japan gets no packet duplication (x1)
- pinging from esx{1,2} to nec-esx{1,2} get an extra packet (x2)
- pinging from nec-esx1 to nec-esx2 gets x4 packet duplication and 2x times the RTT from here to Japan
- Packet duplcation seems to happen only in the Japan->Stanford direction
- look at a tcpdump of pinging from 171.67.74.59 to 171.67.74.70 on both sides
- Stanford --> Japan: no duplication
- Japan ---> Stanford: x2 duplication
Old State
- VM migration works fine for the 32MB Ubuntu VM. Takes 40 secs. The TCP throughput is still limited.
- Added `esxcfg-firewall -e nfsClient` for a 33% performance gain (more?)
- Are not able to replicate the 60s xfer times that we got previously, with nfsClient disabled; get 2 minutes instead
- Get 40s xfers from esx3->esx2, esx3->esx1, and esx2->esx3, BUT esx1--> esx3 takes ~1:40 !!
- Working on improving the throughput
- Now suspect send buffer bottleneck instead of TCP window bottleneck
- reason 1: we hacked in wscale=7 change in the tunnel during the initial handshake, so the advertised window is now 2^7 bigger
- reason 2: the last packet between each delay has the PSH flag set, which is an OS level cue that this packet is at the end of a write()
- reason 3: actually forced all packets to windowsize of 64K; still no effect
- result: need to find a way to increase send buffer size and tune NFS
- Now suspect send buffer bottleneck instead of TCP window bottleneck
NFS Tuning Thoughts
- `esxcfg-info -a` from the esx console will list all config options
- I find adding `-F perl` makes it easier to parse
- netapp config guide might be useful
- Ran `esxcfg-firewall -e nfsClient` ; wasn't on.. might do something but can't test because the cable went away again.. will try tomorrow
DATA : for LinuxOct7 VM
- found out this has no OS, so is faster :-(
- migrate LinuxOct7: 3->1: 1 min 13 sec
- migrate LinuxOct7: 1->3: 2 min 59 sec ; much slower then other dir
- Added
- esxcfg-advcfg -s 32 /NFS/MaxVolumes
- NFS.MaxVolumes --> 32
- NFS.ReceiveBufferSize -->264
- migrate LinuxOct7: 3->2: 1 min 5 sec
- migrate LinuxOct7: 2->3: 1 min 59 sec ; success!? faster!
- migrate LinuxOct7: 3->1: 1 min 0 sec
- migrate LinuxOct7: 1->3: 1 min 53 sec ; replicated!
DATA: for "Small Debian (On Doemail)" VM
- has running OS, takes longer
- migrate: 1->3 : 4 min 2 sec ; !!!
- migrate: 3->1 : 2 min 56 sec
- set NFS.DisableLock to 1 (yes, disable locking)
- migrate: 1->3 : 4 min 11 sec; took longer, but with in noise(?)
- migrate: 3->1 : 2 min 46 sec; slightly faster; in the noise
- conclusion: nfs locking doesn't matter :-(
DATA: for "Small Debian (On nf-test3)" VM
- might take a different amount of time; diff NFS tuning params
- migrate 2->3 : 4 min 1 sec ; doesn't look different
- migrate 3->2 : 2 min 45 sec ; doesn't look different
VMware ESX issues
During VMmotion, we saw a whole horde of ESX issues. Following are ways to resolve it.
- License issue: When you get error "There are not enough licenses installed to perform this operation", you need to open the License Features in the Configuration tab of the ESX server. In that dialog, add the new license server (mvm-vc.stanford.edu). Then change "Unlicensed" to "Licensed" so as to enable the Add-ons.
- Swap file: The migration used to proceed to 95% and then exit with error "Failed to open the swap file". The reason was that the swap file used by Source was different than the one expected by Dest.
- The swap file name is typically derived from the name of the data store. So, we had to delete and readd each of the datastores.
- VMware recommends that you use the IP address of the NFS server while adding storage, instead of using the fqdn.
- Note that the swapfile must always be located at the same place as the VM files, for easier migration.
- HA agent: The HA agent kept failing quite often and prevented migration. We disabled it through the Cluster settings. Also, fixing hostname and dns on esx1 (in Japan) is required.
- CPU mismatch: The two ESXs were of different type (Xeon vs Core2). Had to tell the ESX's not to care that they are not the same type of CPU. More info
- In CPUMask of Guest OS Options, add "---- ---- ---- 00-- ---- ---- ---- ----" to ecx1 Level 1 field.
- Must hit "DOWN" not "RETURN" when entering number (WTF!?)
Network Tunnel issues
- MTU problem: Since some links in the tunnel use low MTU, as well as to make room for tunnel header, we had to use a MTU of 1400. MTU changing at endhosts causes NFS issues because we have no access to the NFS. Resolved this by using MTU frobbing in capsulator code.
- Setup capsulator
- grab a *current* copy of capsulator from yuba
- git clone yuba.stanford.edu:/usr/local/git/capsulator
- make sure it has the new -w and -m options
- setup the capsulator to frob the window scaling and mss values on decapsulation
- ./capsulator -t nf2c0 -f 60.45.91.58 -b nf2c1:1 -b nf2c2:2 -b nf2c3:3 -w 7 -m 1400
- from nf-test5@stanford to netfpga3@nec] ## ./capsulator -t nf2c0 -f 171.67.74.58 -b nf2c1:1 -b nf2c2:2 -b nf2c3:3 -w 7 -m 1400
- from netfpga3@nec to nf-test5@stanford
- connect "tunnel" to bottom port (nf2c0)
- connect vmkernel to 2nd from bottom (nf2c1)
- connect console to 3rd from bottom (nf2c2)
MTU Frobbing Commands
In order to make the packets small enough to deal with the software tunnel, we set the MTU of all devices to 1400 bytes. A more optimal value might be 1416, but it will vary with the tunnel software that we use so we decided to be more conservative. The voodoo to change the MTU changes by device:
- ESX: you can't change the virtual devices, but you can change the vswitches
- (from esx command line) esxcfg-vswitch -m 1400 vswitch0 (and repeat for vswitch1 and vswitch2)
- IMPORTANT: *don't* do this... it breaks the ESX internally
- saw evidence of locally dropping packets in esx2:/var/log/vmk*
- also was unhappy with the tcp offloading thinking mss=1448 and mtu=1400
- For mvm-kvm, the Vmware Infrastructure Client
- run regedit, LOCAL_MACHINE-> System -> Current Control Set-> TCP IP -> Interfaces -> $INTERFACE -> add "MTU" DWORD and set to 1400
Emulating Delay using WANem
| Bridging Delay (ms) | Only Console | Only VMKernel | ||
| ESX2->ESX1 (secs) | Reverse (secs) | ESX2->ESX1 (secs) | Reverse (secs) | |
| 0ms | 25 | 29 | 144 | 151 |
| 25ms | 29 | 35 | 514 | 548 |
| 50ms | 31 | 34 | 982 | 1,030 |
| 120ms | 36 | 31 | 2,181 | 2,280 |
- Note: TCPdump at the bridge reveals that window is limited to 33304. Will try to enable TCP window scaling at the ESX servers
- The traffic was directly between the two ESX servers for the most part. There was some exchange with NFS initially and intermittently. Of the 1,121,016 packets captured (both directions), 3812 packets were to/from the NFS server
- I also performed a bulk transfer of a 512MB file over https from the server behind the delay line. I observed the following:
Delay Tput Duration --------------------------- Direct 3.12 MB/s 224s 0 ms 3.00 MB/s 232s 50 ms 0.91 MB/s 578s 100 ms 0.53 MB/s 987s
I observed that in all cases the cwnd advertised in the tcpdump was 7504. I am not sure why that number came up.
- TCP tuning with the following parameters fed to the sysctl works well for the console stack. But, does not have much effect on the VMkernel. This observation has been seen by others too.
net.ipv4.tcp_window_scaling = 1 net.core.wmem_max = 108544 net.core.rmem_max = 108544 net.ipv4.tcp_rmem = "4096 1048576 4194304" net.ipv4.tcp_wmem = "4096 1048576 4194304" net.ipv4.tcp_mem = "98304 131072 196608"
- Following results were seen for migration across a 50ms delay line:
System Memory allocated Migration time ---------------------------------------------------------- Ubuntu 512MB memory, 104.72MB overhead 982 secs Debian1 128MB memory, 67.24MB overhead 309 secs Debian2 64MB memory, 66.30MB overhead 187 secs Debian3 32MB memory, 65.83MB overhead 105 secs
For 100ms delay, the 32MB memory system took about 226 seconds to migrate
