วันอังคารที่ 4 มีนาคม พ.ศ. 2557

ESXTOP


ESXTOP

ESXTOP is a fantastic tool available for the VMware administrator when troubleshooting performance issues in a vSphere Environment. ESXTOP has a somewhat steep learning curve, but it is all worth it. In this post I want to help you get a head start with ESXTOP. If you want a really good read I recommend Duncan’s very comprehensive post on the same subject here
ESXTOP is available in two ways. Either through the ESXi Shell or through the vSphere Management Assistant with the command RESXTOP. In this article I will focus on ESXTOP from the ESXi shell. It is very simple to get access to ESXTOP. 
Step 1: Get access to the ESXi Shell. This is done by opening your vSphere Client, go to host, configuration, security profile and start the ESXi Shell service on a specific ESXi host. 
Step 2: Download putty (or another SSH client) and create a SSH connection on port 22 to your ESXi host. Login with root and your password.
Step 3: Type the command esxtop and hit return
Step 4: You are now looking at ESXTOP it should look similar to this: 
esxtop2

What you are looking at is the CPU screen in ESXTOP and you are now looking for CPU specific counters. You can browse around through different pages. If you type you will see memory metrics. for network etc. If you type you will see all available commands. By default ESXTOP shows a lot of “worlds” a world is similar to a process in windows task manager. To sort it out and not show “vmkernel worlds” you type lower case v. By doing this you only see the virtual machines running on this specific ESXi host.
Now you are inside ESXTOP so lets focus on some good counters to use for performance troubleshooting.

CPU 

When troubleshooting CPU performance for your virtual machines the following counters are the most important. 
%USED, %RDY, %CSTP
%USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.
%RDY is a Key Performance Indicator! Always start with this one. This one defines how much time your virtual machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did you spend in a “queue”. I normally expect this value to be better than 5% (this equals 1000ms in the vCenter Performance Graphs read about it here)
%CSTP tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch up. If this number is higher than 3% you should consider lowering the amount of vCPU in your virtual machine.

Memory

When troubleshooting memory performance this is the counters you want to focus on from a virtual machine perspective.
MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s
MCTL? This column is either YES or NO. If Yes it means that the balloon driver is installed. The Balloon driver is automatically installed with VMware tools and should be in every virtual machine. If it says No in this column then figure out why.
MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates to the balloon driver inside the guest operating system has “stolen” 500MB from Windows/Linux etc. You would expect to see a value of 0 (zero) in this column
SWCUR tells you how much memory the virtual machine has in the .vswp file.  If you see a number of 500MB here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy environment you would want this value to på 0 (zero) 
SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual machine is suffering from hypervisor swapping.
SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every number above 0 is BAD.

If you have made it this far I suggest you to look at the following document that details ALL of the counters in ESXTOP. I call it the ESXTOP Bible :-)

ESXTOP


ESXTOP

This page is solely dedicated to one of the best tools in the world for ESX; esxtop.

Intro

I am a huge fan of esxtop! I read a couple of pages of the esxtop bible every day before I go to bed. Something I however am always struggling with is the “thresholds” of specific metrics. I fully understand that it is not black/white, performance is the perception of a user in the end.
There must be a certain threshold however. For instance it must be safe to say that when %RDY constantly exceeds the value of 20 it is very likely that the VM responds sluggish. I want to use this article to “define” these thresholds, but I need your help. There are many people reading these articles, together we must know at least a dozen metrics lets collect and document them with possible causes if known.
Please keep in mind that these should only be used as a guideline when doing performance troubleshooting! Also be aware that some metrics are not part of the default view. You can add fields to an esxtop view by clicking “f” on followed by the corresponding character.
I used VMworld presentations, VMware whitepapers, VMware documentation, VMTN Topics and of course my own experience as a source and these are the metrics and thresholds I came up with so far. Please comment and help build the main source for esxtop thresholds.

Metrics and Thresholds

DisplayMetricThresholdExplanation
CPU%RDY10Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set. See Jason’s explanation for vSMP VMs
CPU%CSTP3Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.
CPU%SYS20The percentage of time spent by system services on behalf of the world. Most likely caused by high IO VM. Check other metrics and VM for possible root cause
CPU%MLMTD0The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU.
CPU%SWPWT5VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.
MEMMCTLSZ1If larger than 0 host is forcing VMs to inflate balloon driver to reclaim memory as host is overcommited.
MEMSWCUR1If larger than 0 host has swapped memory pages in the past. Possible cause: Overcommitment.
MEMSWR/s1If larger than 0 host is actively reading from swap(vswp). Possible cause: Excessive memory overcommitment.
MEMSWW/s1If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment.
MEMCACHEUSD0If larger than 0 host has compressed memory. Possible cause: Memory overcommitment.
MEMZIP/s0If larger than 0 host is actively compressing memory. Possible cause: Memory overcommitment.
MEMUNZIP/s0If larger than 0 host has accessing compressed memory. Possible cause: Previously host was overcommited on memory.
MEMN%L80If less than 80 VM experiences poor NUMA locality. If a VM has a memory size greater than the amount of memory local to each processor, the ESX scheduler does not attempt to use NUMA optimizations for that VM and “remotely” uses memory via “interconnect”. Check “GST_ND(X)” to find out which NUMA nodes are used.
NETWORK%DRPTX1Dropped packets transmitted, hardware overworked. Possible cause: very high network utilization
NETWORK%DRPRX1Dropped packets received, hardware overworked. Possible cause: very high network utilization
DISKGAVG25Look at “DAVG” and “KAVG” as the sum of both is GAVG.
DISKDAVG25Disk latency most likely to be caused by array.
DISKKAVG2Disk latency caused by the VMkernel, high KAVG usually means queuing. Check “QUED”.
DISKQUED1Queue maxed out. Possibly queue depth set to low. Check with array vendor for optimal queue depth value.
DISKABRTS/s1Aborts issued by guest(VM) because storage is not responding. For Windows VMs this happens after 60 seconds by default. Can be caused for instance when paths failed or array is not accepting any IO for whatever reason.
DISKRESETS/s1The number of commands reset per second.
DISKCONS/s20SCSI Reservation Conflicts per second. If many SCSI Reservation Conflicts occur performance could be degraded due to the lock on the VMFS.

Running esxtop

Although understanding all the metrics esxtop provides seem to be impossible using esxtop is fairly simple. When you get the hang of it you will notice yourself staring at the metrics/thresholds more often than ever. The following keys are the ones I use the most.
Open console session or ssh to ESX(i) and type:
esxtop
By default the screen will be refreshed every 5 seconds, change this by typing:
s 2
Changing views is easy type the following keys for the associated views:
c = cpu
m = memory
n = network
i = interrupts
d = disk adapter
u = disk device (includes NFS as of 4.0 Update 2)
v = disk VM
p = power states

V = only show virtual machine worlds
e = Expand/Rollup CPU statistics, show details of all worlds associated with group (GID)
k = kill world, for tech support purposes only!
l  = limit display to a single group (GID), enables you to focus on one VM
# = limiting the number of entitites, for instance the top 5

2 = highlight a row, moving down
8 = highlight a row, moving up
4 = remove selected row from view
e = statistics broken down per world
6 = statistics broken down per world
Add/Remove fields:
f
Changing the order:
o
Saving all the settings you’ve changed:
W
Keep in mind that when you don’t change the file-name it will be saved and used as default settings.
Help:
?
In very large environments esxtop can high CPU utilization due to the amount of data that will need to be gathered and calculations that will need to be done. If CPU appears to highly utilized due to the amount of entities (VMs / LUNs etc) a command line option can be used which locks specific entities and keeps esxtop from gathering specific info to limit the amount of CPU power needed:
esxtop -l
More info about this command line option can be found here.

Capturing esxtop results

First things first. Make sure you only capture relevant info. Ditch the metrics you don’t need. In other words run esxtop and remove/add(f) the fields you don’t actually need or do need! When you are finished make sure to write(W) the configuration to disk. You can either write it to the default config file(esxtop4rc) or write the configuration to a new file.
Now that you have configured esxtop as needed run it in batch mode and save the results to a .csv file:
esxtop -b -d 2 -n 100 > esxtopcapture.csv
Where “-b” stands for batch mode, “-d 2″ is a delay of 2 seconds and “-n 100″ are 100 iterations. In this specific case esxtop will log all metrics for 200 seconds. If you want to record all metrics make sure to add “-a” to your string.
Or what about directly zipping the output as well? These .csv can grow fast and by zipping it a lot of precious diskspace can be saved!
esxtop -b -a -d 2 -n 100 | gzip -9c > esxtopoutput.csv.gz
Please note that when a new VM is powered on, a VM is vMotion to the host or a new world is created it will not show up within esxtop when “-b” is used as the entities are locked! This behavior is similar to starting esxtop with “-l”.

Analyzing results

You can use multiple tools to analyze the captured data.
  1. VisualEsxtop
  2. perfmon
  3. excel
  4. esxplot
What is VisualEsxtop as it is a relatively new tool (published 1st of July 2013).
VisualEsxtop is an enhanced version of resxtop and esxtop. VisualEsxtop can connect to VMware vCenter Server or ESX hosts, and display ESX server stats with a better user interface and more advanced features.
That sounds nice right? Lets have a look how it works, this is what I did to get it up and running:
  • Go to “http://labs.vmware.com/flings/visualesxtop” and click “download”
  • Unzip “VisualEsxtop.zip” in to a folder you want to store the tool
  • Go to the folder
  • Double click “visualesxtop.bat” when running Windows (Or follow William’s tip for the Mac)
  • Click “File” and “Connect to Live Server”
  • Enter the “Hostname”, “Username” and “Password” and hit “Connect”
  • That is it…
Now some simple tips:
  • By default the refresh interval is set to 5 seconds. You can change this by hitting “Configuration” and then “Change Interval”
  • You can also load Batch Output, this might come in handy when you are a consultant for instance and a customers sends you captured data, you can do this under: File -> Load Batch Output
  • You can filter output, very useful if you are looking for info on a specific virtual machine / world! See the filter section.
  • When you click “Charts”  and double click “Object Types” you will see a list of metrics that you can create a chart with. Just unfold the ones you need and double click them to add them to the right pane
There are a bunch of other cool features in their like color-coding of important metrics for instance. Also the fact that you can show multiple windows at the same time is useful if you ask me and of course the tooltips that provide a description of the counter! If you ask me, a tool everyone should download and check out.
Let’s continue with my second favorite tool, perfmon. I’ve used perfmon(part of Windows also know as “Performance Monitor”) multiple times and it’s probably the easiest as many people are already familiar with it. You can import a CSV as follows:
  1. Run: perfmon
  2. Right click on the graph and select “Properties”.
  3. Select the “Source” tab.
  4. Select the “Log files:” radio button from the “Data source” section.
  5. Click the “Add” button.
  6. Select the CSV file created by esxtop and click “OK”.
  7. Click the “Apply” button.
  8. Optionally: reduce the range of time over which the data will be displayed by using the sliders under the “Time Range” button.
  9. Select the “Data” tab.
  10. Remove all Counters.
  11. Click “Add” and select appropriate counters.
  12. Click “OK”.
  13. Click “OK”.
The result of the above would be:
Imported ESXTOP data
With MS Excel it is also possible to import the data as a CSV. Keep in mind though that the amount of captured data is insane so you might want to limit it by first importing it into perfmon and then select the correct timeframe and counters and export this to a CSV. When you have done so you can import the CSV as follows:
  1. Run: excel
  2. Click on “Data”
  3. Click “Import External Data” and click “Import Data”
  4. Select “Text files” as “Files of Type”
  5. Select file and click “Open”
  6. Make sure “Delimited” is selected and click “Next”
  7. Deselect “Tab” and select “Comma”
  8. Click “Next” and “Finish”
All data should be imported and can be shaped / modelled / diagrammed as needed.
Another option is to use a tool called “esxplot“. It hasn’t been updated in a while, and I am not sure what the state of the tool is. You can download the latest version here though, but personally I would recommend using VisualEsxtop instead of esxplot, just because it is more recent.
  1. Run: esxplot
  2. Click File -> Import -> Dataset
  3. Select file and click “Open”
  4. Double click host name and click on metric
Using ESXPLOT for ESXTOP data
As you can clearly see in the screenshot above the legend(right of the graph) is too long. You can modify that as follows:
  1. Click on “File” -> preferences
  2. Select “Abbreviated legends”
  3. Enter appropriate value
For those using a Mac, esxplot uses specific libraries which are only available on the 32Bit version of Python. In order for esxplot to function correctly set the following environment variable:
export VERSIONER_PYTHON_PREFER_32_BIT=yes

Limiting your view

In environments with a very high consolidation ratio (high number of VMs per host) it could occur that the VM you need to have performance counters for isn’t shown on your screen. This happens purely due to the fact that height of the screen is limited in what it can display. Unfortunately there is currently no command line option for esxtop to specify specific VMs that need to be displayed. However you can export the current list of worlds and import it again to limit the amount of VMs shown.
esxtop -export-entity filename
Now you should be able to edit your file and comment out specific worlds that are not needed to be displayed.
esxtop -import-entity filename
I figured that there should be a way to get the info through the command line as and this is what I came up with. Please note that needs to be replaced with the name of the virtual machine that you need the GID for.
VMWID=`vm-support -x | grep  |awk '{gsub("wid=", "");print $1}'`
VMXCARTEL=`vsish -e cat /vm/$VMWID/vmxCartelID`
vsish -e cat /sched/memClients/$VMXCARTEL/SchedGroupID
Now you can use the outcome within esxtop to limit(l) your view to that single GID. William Lam has written an article a couple of days after I added the GID section. The following is a lot simpler than what I came up with, thanks William!
VM_NAME=STA202G ;grep "${VM_NAME}" /proc/vmware/sched/drm-stats  | awk '{print $1}'

References

The following documents / articles have been used as a reference:

Changelog

07-01-2010 | decreased %RDY from 20 to a value of 10
22-01-2010 | added CPU –> TIMER/S
22-01-2010 | added MEM –> N%L
24-01-2010 | added sections (howto)
02-02-2010 | expanded analyze section and included screenshots
10-02-2010 | decreased %CSTP from 100 to 5
10-02-2010 | decreased KAVG from 5 to 2
23-03-2010 | increase %SWPWT from 1 to 5
23-03-2010 | added “e”, “V”, “i”, “2″, “4″, “6″, “8″ in the “views” section
16-06-2010 | added “-l” functionality and stressed NFS added option
08-11-2010 | added “l”, “e”, “l”, “#”, “%SYS”, “ZIP/s”, “UNZIP/s”, “CACHEUSD”
11-11-2010 | added threshold for “CONS/s
12-03-2011 | Redid some of the formatting
25-05-2011 | added “limiting the view section”
03-01-2012 | added NUMA details
08-07-2013 | added VisualEsxtop

vSphere 5.0 – what’s new for esxtop



vSphere 5.0 – what’s new for esxtop


I was just playing around with esxtop in vSphere 5.0 and spotted something that changed. I figured there must be more so I started digging. I didn’t dig too deep as there is a great VMworld session (VSP1999) on this topic by Krishna Raj Raja and I figured why re-invent the wheel. Anyway, here’s the things I noticed which will definitely come in handy at some point while troubleshooting performance issues:
  • Each display type now shows the number of Worlds, VMs and vCPUs on the host on the first line. This will allow you to quickly identify why there for instance is a high %RDY.
  • %VMWAIT is a derivitive of %WAIT, however it does not include IDLE time and only %SWPWT and “blocked”. It could for instance also be blocked when the connectivity to the storage device has failed.
  • In the Power display there’s a new line which is PSTATE MHZ. This shows you the different clock frequencies per state. For instance “2395″ is the clock frequency of %P0 and “1596″ is the clock frequency of %P7. Please note that “%USED” is based on the base (%P0) of your CPU. %UTIL is the utilization in it’s current state (%Px), so in this case that could be 40% of %P7 (1596) which is 638.
  • In the “Device Display” there are new stats starting with “F”, for example FCMDs, these show the failed I/Os. Fairly quick way to see if there are any I/O errors.
  • These two new counters in the “Memory Display”, LLSWR/s / LLSWW/s, show the amount of memory being written to host cache or read from host cache. Useful when you have enabled this feature and want to know if it is actively being used. Of course there are also vCenter stats for this one.
I love esxtop, with 5.0 is has become even better and especially “%VMWAIT” and the PSTATE details will come in handy at some point in time!

Re: Memory Compression


Re: Memory Compression

I was just reading Scott Drummonds article on Memory Compression. Scott explains where Memory Compression comes in to play. I guess the part I want to reply on is the following:
VMware’s long-term prioritization for managing the most aggressively over-committed memory looks like this:
  1. Do not swap if possible.  We will continue to leverage transparent page sharing and ballooning to make swapping a last resort.
  2. Use ODMC to a predefined cache to decrease memory utilization.*
  3. Swap to persistent memory (SSD) installed locally in the server.**
  4. Swap to the array, which may benefit from installed SSDs.
(*) Demonstrated in the lab and coming in a future product.
(**) Part of our vision and not yet demonstrated.
I just love it when we give insights in upcoming features but I am not sure I agree with the prioritization. I think there are several things that one needs to keep in mind. In other words there’s a cost associated to these decisions / features and your design needs to adjusted to these associated effects.
  1. TPS -> Although TPS is an amazing way of reducing the memory footprint you will need to figure out what the ratio of deduplication is. Especially when you are using Nehalem processors there’s a serious decrease. The reasons for the decrease of TPS effectiveness are the following:
    • NUMA – By default there is no inter node transparent page sharing (read Frank’sarticle for more info on this topic)
    • Large Pages – By default TPS does not share large(2MB) pages. TPS only shares small(4KB) pages. It will break large pages down in small pages when memory is scarce but it is definitely something you need to be aware off. (for more info read myarticle on this topic.
  2. Use ODMC -> I haven’t tested with ODMC yet and I don’t know what the associated cost is at the moment.
  3. Swap on local SSD -> Swap on local SSD will most definitely improve the speed when swapping occurs. However as Frank already described in his article there is an associated cost:
    • Disk space – You will need to make sure you will have enough disk space available to power on VMs or migrate VMs as these swap files will be created at power on or at migration.
    • Defaults – By default .vswp files are stored in the same folder as the .vmx. Changing this needs to be documented and taken into account during upgrades and design changes.
  4. Swap to array (SSD) -> This is the option that most customers use for the simple reason that it doesn’t require a local SSD disk. There are no changes needed to enable it and it’s easier to increase a SAN volume than it is to increase a local disk when needed. The associated costs however are:
    • Costs – Shared storage is relatively expensive compared to local disks
    • Defaults – If .vswp files need to be SSD based you will need to separate the .vswp from the rest of the VMs and created dedicated shared SSD volumes.
I fully agree with Scott that it’s an exciting feature and I can’t wait for it to be available. Keep in mind though that there is a trade off for every decision you make and that the result of a decision might not always end up as you expected it would. Even though Scott’s list makes totally sense there is more than  meets the eye.
Be Sociable, Share!

Swapping, esxtop and /proc/vmware/sched/mem


Swapping, esxtop and /proc/vmware/sched/mem




At a customer site we noticed that the ESX hosts were swapping, Nagios generated a nice alarm. After some research it seemed like certain VM’s were swapping to the VMFS volume, so not inside the OS but VMware swap usage. A closer look at the system revealed that we weren’t overcommitting. There was over 6GB of memory free and there were no limit’s set to the specific VM. Could it be just Nagios or… No, esxtop with the following commands “s2 m f j” revealed the following:

The column swcur displays the current swap file usage, I marked the values higher than 0 red.
After a couple of searches it seemed that there is little info about swcur. But Kit Colbert, a VMware employee, posted on the vmtn forum about checking your current memory / swap usage in the file “/proc/vmware/sched/mem”. With cat you can easily display this, and with “watch -n 1″ you can refresh your view every second. The following output was retrieved via the command “watch -n 1 cat /proc/vmware/sched/mem”:

We’ve migrated a VM which was swapping according to esxtop and nagios to another host, and as expected the swap remained. We powered down a VM that was swapping, and although the host had more than enough free mem available, the swap returned. It was less than before but still… The funny thing is that according to Kit it’s all about the column “swap out” and we did not see much action going on there.

Swapping?


Swapping?

 9 COMMENTS
We had a discussion internally about performance and swapping. I started writing this article and asked Frank if it made sense. Frank’s reply “just guess what I am writing about at the moment”. As both of us had a different approach we decided to launch both articles at the same time and refer to each others post. So here’s the link to Frank’s take on the discussion and I highly recommend reading it: “Re: Swapping“.
As always the common theme of the discussion was “swapping bad”. Although I don’t necessarily disagree. I do want to note that it is important to figure out if the system is actually actively swapping or not.
In many cases “bad performance” is blamed on swapping. However this is not always the case. As described in my section on “ESXTOP“  there are multiple metrics on “swap” itself. Only a few of those relate to performance degradation due to swapping. I’ve listed the important metrics below.
Host:
MEM – SWAP/MB curr = Total swapped machine memory of all the groups including virtual machines.
MEM - SWAP/MB “target” = The expected swap usage.
MEM - SWAP/MB “r/s” = The rate at which memory is swapped in from disk.
MEM - SWAP/MB “w/s” = the rate at machine memory is swapped out to disk.
VM:
MEM – SWCUR = If larger than 0 host has swapped memory pages from this VM in the past.
MEM - SWTGT = The expected swap usage.
MEM - SWR/s (J) = If larger than 0 host is actively reading from swap(vswp).
MEM - SWW/s (J) = If larger than 0 host is actively writing to swap(vswp).
So which metrics do really matter when your customer complains about degradation of performance?
First metric to check:
SWR/s (J) = If larger than zero the ESX host is actively reading from swap(vswp).
Associated to that metric I would recommend looking at the following metric:
%SWPWT = The percentage of time the world is waiting for the ESX VMKernel swapping memory.
So what about all those other metrics? Why don’t they really matter?
Take “Current Swap”, as long as it is not being “read” it might just be one of those pages sporadically used which is just sitting there doing nothing. Will it hurt performance? Maybe, but currently as long as it is not being read… no it will most likely not hurt. Even writing to swap does not necessarily hurt performance, it might though. Those should just be used as indicators that the system is severely overcommitted and that performance might be degraded in the future when pages are being read!

Understanding VMware Ballooning


Understanding VMware Ballooning

VMware ballooning is one of the hard concept to grasp. There are a lot of misunderstanding out there about this feature. I have been discussing this feature with customers and students during the last 5 years. This is my attempt to explain balloning.
VMware ballooning is a memory reclamation  technique used when and ESXi host is running low on memory. You should not see balloning if your hosts is performing like it should. To understand ballooning we would have to take a look at the following picture:
balloon1

 This picture shows the three levels of memory in a virtual environment. In a physical world we would only have the two top levels (virtual memory & guest physical memory) but in the virtual world we also have the host physical memory. What is important to know is that the hypervisor (ESXi) has no knowledge of what is happening inside the virtual machine (grey area). The hypervisor maps memory when the virtual machines asks for it. The hypervisor will then give it memory from “host physical memory” but only if memory is available. If memory is not available the memory can med mapped to the .vswp file on a vmfs or nfs datastore. The virtual machine has no knowledge if the memory is mapped to physical memory or to a disk. This is called hypervisor swapping, and this is the last resort for the vmkernel to use this mechanism.
Ballooning in short is a process where the hypervisor reclaims  memory back from the virtual machine. Ballooning is an activity that happens when the ESXi host is running out of physical memory. The demand of the virtual machine is too high for the host to handle.
Lets take a high level example:
  1. Inside a virtual machine you start an application. For instance solitaire
  2. solitaire as an application will ask the guest operating system (in this case windows) for memory. Windows will give it memory and map it from the virtual memory -> guest physical memory 
  3. what happens next is that the hypervisor sees the request for memory and the hypervisor maps guest physical memory -> host physical memory
  4. Now everything is perfect. You play solataire for a few hours. And then you close it down.
  5. When you close solitaire the guest operating system will mark the memory as “free” and make it available for other applications. BUT since the hypervisor does not have access to Windows’ “free memory” list the memory will still be mapped in “host physical memory” and putting memory load on the ESXi host.
  6. This is where ballooning comes into place. In case of an ESXi host running low on memory the hypervisor will ask the “balloon” driver installed inside the virtual machine (with VMware Tools) to “inflate”
  7. The balloon driver will inflate and because it is “inside” the operating system it will start by getting memory from the “free list”. The hypervisor will detect what memory the balloon driver has reclaimed and will free it up on the “host physical memory” layer!
The balloon driver can inflate up to a maximum of 65%. For instance a VM with 1000MB memory the balloon can inflate to 650MB. The way to avoid ballooning is not to uninstall the balloon driver but to create a “Memory Reservation” for the virtual machine. In case of full inflation for this particular VM the result is the hypervisor gets 650MB memory reclaimed. The downfall of this is that you risk your VM to do Guest OS Swapping to its page file! Just remember page file swapping is better than hypervisor swapping. Hypervisor swapping happens without the guest operating system is aware of it. Page file swapping it is the OS that decides what pages to swap to disk!
To check for ballooning you can either open ESXTOP or the vCenter Performance Graphs.

balloon2

balloon3

ADDING A SECONDARY NIC TO THE VCENTER 5.1 APPLIANCE (VCSA)


ADDING A SECONDARY NIC TO THE VCENTER 5.1 APPLIANCE (VCSA)




While building my lab environment, I ran into a situation where I wanted to have a completely sealed off networking segment that had no outside access.
This is a trivial task on it`s own, just create a vSwitch with no physical NICs attached to it, and then connect the VMs to it. The VMs will then have interconnectivity, but no outside network access at all.
In this particular case, I was setting up a couple of nested ESXi servers that I wanted to connect to the “outside” vCenter Appliance (VCSA). This VCSA instance was not connected to the internal-only vSwitch, but rather to the existing vSwitch that as local network access.
Naturally, the solution would be to add a secondary NIC to the VCSA, and connect that to the internal-only vSwitch.
It turns out that adding a secondary NIC to a VCSA instance, isn`t as straight-forward as you might think. Sure, adding a new NIC is no problem through either the vSphere Client, or the vSphere Web Client, but getting the NIC configured inside of VCSA is another matter.
If you add a secondary NIC, it will turn up in the VCSA management web page, but you will not be able to save the configuration since the required configuration files for eth1 is missing.
In order to rectify this, I performed the following steps:
  1. Connect to the VCSA via SSH (default username and password is root/vmware)
  2. Copy /etc/sysconfig/networking/devices/ifcfg-eth0 to /etc/sysconfig/networking/devices/ifcfg-eth1
  3. Edit ifcfg-eth1 and replace the networking information with your values, here is how mine looks:
    DEVICE=eth1
    BOOTPROTO='static'
    STARTMODE='auto'
    TYPE=Ethernet
    USERCONTROL='no'
    IPADDR='172.16.1.52'
    NETMASK='255.255.255.0'
    BROADCAST='172.16.1.255'
  4. Create a symlink for this file in /etc/sysconfig/network
    ln -s /etc/sysconfig/networking/devices/ifcfg-eth1/etc/sysconfig/network/ifcfg-eth1
  5. Restart the networking service to activate the new setup:
    service network restart
    Check the VCSA web management interface to verify that the new settings are active
Client 2013-04-25 10-54-37
By adding a secondary NIC, configuring it and connecting it to the isolated vSwitch I was now able to add my sequestered nested ESXi hosts to my existing VCSA installation.

Client 2013-04-25 13-07-01
There may be several reasons for a setup like this, perhaps you want your VCSA to be available on a management VLAN but reach ESXi hosts on another VLAN without having routing in place between the segmented networks, or you just want to play around with it like I am in this lab environment.
Disclaimer:
Is this supported by VMware? Probably not, but I simply don`t know. Caveat emptor, and all that jazz.