[nylug-talk] After meeting tonight, anyone want to discuss XEN stuff? (Also a bunch of misc thoughts and notes).

Bryan J. Smith b.j.smith at ieee.org
Thu Mar 27 21:41:31 EDT 2008


On Wed, 2008-03-26 at 17:20 -0400, Brian Gupta wrote:
> 1) Closed source Nvidia drivers do not work with a XEN kernel. (Or, if
> they do it requires stronger pixie dust than I had access to).

Xen is a pre-emptive kernel that is not part of Linux.

Remember, ATI and nVidia's closed source drivers are 2 parts:  
- User-space, and, most importantly ...
- Kernel-space

Because Intel has refused to treat Graphical Processor Units (GPU) as
peers to microprocessors at the hardware-level on the system
interconnect with related, hardware coherency (which is a major
undertaking, although AMD has done it with HTX for Infiniband and now
has forthcoming ATI GPUs for it as well), any GPU connected to the AGP
or PCIe peripheral interconnect has two choices ...

1.  Use traditional burst memory access (slow for GPUs)

2.  Use direct memory access with associated coherency issues solved by
a complex set of software hacks

ATI, nVidia and even Intel (under Windows) do #2.  It's software and IP
owned by more than just ATI and nVidia themselves, but some of the major
IP is owned by Intel, Microsoft and SGI.  After receiving cease'n desist
letters from Intel, Microsoft and SGI after releasing their "unified
object" source code in the late '90s for the kernel and XFree86 3.3.x,
nVidia introduced a new unified object and kernel loader for all
platforms.  ATI followed suit several years later when it was clear the
prior Weather Service funded DRI driver was extremely far behind in
features, let alone performance, without the kernel-memory software
coherency hacks.

Because of those IP issues, the kernel driver must stay closed.  Intel
has refused to release a driver for Linux, hence why their driver's
performance is pathetic under Linux compared to Windows -- which is
already pathetic versus ATI/nVidia (the latest X48 is 7-11 slower than
the AMD 780G at most 3D titles under Windows, and not compatible at all
versus the 780G against most OpenGL APIs under Linux/UNIX).  ATI, as I
mentioned, more recently joined nVidia in the unified object with
general loader.  There are 3rd party engineering efforts to create an
Intel kernel-memory driver, as well as renewed efforts to update the
nVidia kernel-memory driver from the original NV0x (TNT2-GeForce[1]) era
source code release.

In any and all cases, open source drivers will always be well behind and
less reliable than mature closed source drivers when it comes to the
kernel-memory interface.  The option to not use that interface kills
performance, and Intel knows all-too-well.  I'm still hoping AMD-ATI
pulls through with their HTX plans, although they seem to be only
integrated at this point.  But with true hardware coherency between GPU
and CPU over HTX, it removes all of the software IP non-sense, and
should allow such a GPU to work in Linux -- at full performance --
without such a required software hack.

> 2) virt-manager is an emerging GUI tool that works with XEN, KVM, and
> QEMU. It looks like it is loosely based on VMWares management GUI.
> It seems that it is in various states of working out the box with XEN.
> CentOS 5.1 seems to have the best out of box experience, but the
> virt-manger is old and not as feature rich as other options.
> 3) virt-manager uses libvirt, which provides a common management
> framework for opensource virtualization technologies.

Which Red Hat started developing years ago because they knew the Xen
microkernel wouldn't be the only option.  The idea was to abstract the
VM aspects as much as possible, so switching virtualization
implementations would be easy.  Canonical (Ubuntu) has joined the fray
in saying "this is a good idea" and that's always good to see the broad
adoption.

E.g., it's very likely KVM will overtake Xen as the preferred full-virt
implementation for RHEL 6, which means leveraging libvirt in RHEL 5
makes it far less painless to move.  Fedora 9 has already made this
decision, so it's likely to stick for RHEL 6.  I'd still give para-virt
to Xen at this point, and it may be the case for a long time.

Also understand that libvirt gives a generic, usable ABI/API for VMs,
something that VMWare has completely _ignored_.  Their API solution is
rather fluid and broken and we've had to discard it time and time again

E.g., Over 90% of VMWare's customers are running Windows, as VMWare has
long been the "solution to avoid running self-toasty Windows directly on
iron."  As such, the stack for GUI management is mature, whereas VMWare
really doesn't care about CLI/scripted users.

Hence why VMWare is basically "writing themselves off" when it comes to
Linux/UNIX users.  Although if that's less than 10% of their market,
it's understandable.

> 5) virt-manager is probably not ready for primetime, thus learning the
> CLI is probably still mandatory for getting started with XEN.

One should learn libvirt anyway.  Remember, a lot of Linux/UNIX-centric
implementations have come about because VMWare has largely ignored
scripted and other API-interface automation, which libvirt is
addressing.  So virt-manager is not as much of a priority and will come
be matured later.

People who want GUI management should probably stick with VMWare for
now, sans for para-virt which VMWare is not a good solution for (see
below).

> 6) Hardy Heron 8.04 beta is pretty slick. Wait for Automatix support
> before upgrading. Envy support is there, assuming you don't need XEN
> support. (Anyone know if closed ATI drivers work on a XEN kernel).

Closed source ATI drivers rely on the same kernel-memory object hacks
that nVidia has perfected over the last 9 years, so it won't work on the
Xen microkernel either.

Only the legacy Weather-DRI release for the R100 (which has been adopted
for the R200+) is usable, as it doesn't leverage a kernel-memory
software hack.  AMD's recent releases of technical information will
continue to assist R300/400/500+ developments, but the open source
community will never "keep up" in OpenGL ABI/API implementation, much
less performance.

Imagine, if you will, Intel changing its IA-32e Instruction Set
Architecture (ISA) every 18 months, instead of leaving it largely be
(sans extensions) for the past 14 years.  That's the world of GPUs,
quite unlike CPUs.  ;)

> 7) KVM is the preferred virtualization technology for Ubuntu, and the
> Linux kernel development team.

That's because KVM is the Linux kernel, not the Xen Source microkernel.
It's also the preferred full-virt kernel in Fedora 9, although Xen is
still the preferred para-virt kernel, and probably will be for some
time.

Full-virt is for running Windows or legacy Linux/UNIX.  Para-virt is an
interesting option for running recent or same Linux or very compatible
para-virt/Xen implementations.  Xen's origin and purpose from the start
is very different than VMWare or KVM, people forget that.  ;)

> XEN is out.

Not in the Red Hat world, the two options complement each other well.
For my client, we want para-virt, because full-virt really hits
performance when it doesn't need to be -- i.e., we're running modern
Linux on Linux.  E.g., RHEL 4 and 5 on RHEL 5.

> (But XEN is also cross platform, and not strictly tied to Linux
> implementations, as Solaris and BSDes will also be providing Dom0
> support) Dom0 = VMWare host.

Loosely defined, yes.

> KVM promises to be much easier for to use than XEN.

'Easier' in what way?

I think a "desktop bias" is coming through, which is not uncommon with a
preference towards Ubuntu (or Windows for that matter).  Most of us in
the Red Hat world (or even Debian for that matter, although I'm years
removed from being a maintainer now) see the constant balance between
server and desktop.  Fedora is feature-focused, but it feeds Red Hat
Enterprise Linux (RHEL), so some things cannot be crossed.  Some people
see that as a "bad thing."  I don't.

One thing I continue to love about Red Hat is that when it comes to
critical matters, sound designs wins -- from GLibC 2 to ANSI C++ to NTPL
to SELinux -- they pushed (and hard) adoption, broke things, fixed them
and worked with everyone else in the community to do so (despite
demonizations).  I spend my first few years (coming from OS/2,
SunOS/Solaris and SCO) from the original NT 3.1 Beta through NT 3.5/3.51
"Daytona" watching a decent NT kernel with MAC/RBAC being slaughtered by
the "Chicago" group to the point "Cario" became vaporware.  From then on
I vowed that desktop features would never override good design.

It's one of the reasons I'm a huge proponent of the full MAC/RBAC and
auditing model of SELinux, which is really the backbone of system-level
security much like NetFilter is for network-level.  The problem is that
people try to "directly use" SELinux, instead of using the tools and
common practices for it.  Using SELinux "directly" is trying to use
NetFilter directly, instead of "iptables" or a few other apps which does
98% of what you need for NetFilter.  It's also why I strongly believe
people should write tools for SELinux like they do NetFilter, instead of
proposing different models.

If you don't like SELinux, you can always set it "permissive" and still
get the auditing.  That's gold from a defense or financial perspective.

> It will probably become the virtualization technology of choice for
> most desktop oriented Linux distros.

Ala "full-virt," as well as leveraging the Linux kernel for its VM
engine, instead of a separate microkernel like Xen.

It's clear now that Fedora 9 is full-force on KVM for "full-virt," with
little focus on Xen "full-virt" anymore.  But Xen is still going to be a
significant strategy, because it's still one of the best "para-virt"
implementations around.

> 8) You can change the allocation of resources to Dom0. IE: I can tell
> my Dom0 that it only has two virtual CPUs, and 1GB or RAM. Dom0 seems
> to be just a special priveledged management VM, as XEN really takes
> over the metal.

Correct, it's the microkernel.

> 9) SE-Linux is an annoyance when you are learning new technologies.

SELinux was designed, from day 1, to be a massive annoyance.  There is
no "easy" MAC/RBAC.  Microsoft learned this the "hard way."  NT had
solid MAC/RBAC.  It's application developers utterly ignored it for 10
years.  The result?  Absolutely no usable MAC/RBAC, let alone people run
with full privileges just so things can operate, so we have the
interactive "what are you doing Dave?" non-sense, instead of a
sustainable MAC/RBAC model in SELinux.

> 10) XEN on (Open)Solaris will be very attractive, but it's not yet
> ready for production use. (ZFS integration will add exciting
> snapshotting, cloning and other functionality).

Xen is still very much Linux Dom0, and still very much para-virt.

> 11) Ubuntu and Gentoo Linux are the preferred platforms for Ruby on
> Rails production deployment. (Mac OS X with the Textmate editor, being
> the preferred development platform).

Gentoo is the preferred, leading edge development platform period IMPO.
Daniel took the ports approach from BSD and put it on steroids.  Once
Gentoo came out, I stopped doing Linux From Scratch (LFS) and NetBSD
altogether.  No need.  Gentoo is a "ports" approach.

Gentoo is often, and incorrectly, compared to a "packages" approach.
It's not even comparable.  So when people compare Gentoo to Debian,
Fedora, Ubuntu, etc... I just shake my head.  Gentoo is just great for
free-form development, period.

The only problem I have, and I see Daniel and others do as well, is when
people oversell Gentoo as solving everything.  People ignore the entire
aspect of mantaining ABI/API compatibility, integration and -- even more
so -- regression testing, etc...  It's great for when your company is an
"Internet technology" company, and you need features, and do your own
integration/regression testing, modifications of the vertical stack,
etc...  But when you're using other projects, 3rd party software,
etc..., ABI/API become a major consideration, and regressions are just
not acceptable.

> 14) There is a new open-source management platform in closed beta
> right now that lets you build your own EC2-like cluster/cloud.
> http://www.enomalism.com/ Bonus - At least one of the project leads is
> located in New York City. I'd love to hear them present at a future
> NYLUG meeting.

A lot of these developments, a true "standards" in the VM space which
VMWare has utterly ignored (largely for reasons of their stack/GUI
sales), is why VMWare will lose the Linux/UNIX realm.  But, again,
that's less than 10% of their "bread'n butter," so they really don't
care, and I can't say I blame them.  ;)

> 16) XEN is not easy, and it's rapidly evolving. Plan to spend some
> time cutting your teeth getting up to speed. (I am still in this
> process).

Xen para-virt requires "VMWare [full-virt] deprogramming," much like
Linux requires "Windows deprogramming," if you've used the latters,
respectively.  If you've never used VMWare or Windows, then Xen
para-virt and Linux are easier to learn, respectively.

> XEN is however, ready for production, if you put enough
> homework into it. (It just may not be worth it unless you are
> deploying it with some scale. It may be better to buy a VMWare or
> XenSource shrink wrapped solution.)

If you want the GUI.  VMWare has a great set of GUI tools built around
their $3-5K/node stack.  It's worth it.  But if you want to automate
things at a CLI/script or API level but, and I'm ready to be proven
wrong, VMWare sux.  They know it.  The OEMs know it.  Etc...

> It all depends on how much your
> time is worth, vs. how widely you plan to deploy. I see this changing
> by the end of this year, but XEN management still requires alot of
> homework.

If you're coming from VMWare, yes.  "Deprogramming" is required.

If you're not, then it's just the normal VM learning experience, much
like Linux if you've never touched a computer.

> 17) MySQL proxy is nearing the point where it is ready for production
> use. It's features outweigh it's newness. Check it out it's a very
> nifty piece of tech.
> 18) PCI-Express is very different than PCI-X and PCI.

Actually it's not.  PCI-Express (PCIe) is absolutely and logically
32-bit (datapath) PCI except at the physical level.  It's only the
physical implementation, and the peripheral-system interconnect that is
different.

AMD uses HyperTransport (its multi-point system interconnect)
tunnels/bridges, whereas Intel bridges PCIe channels from its Memory
Controller Hub (MCH, its unified front-side bus approach -- although
that's finally changing for servers this year).  You can bridge
PCIe-to/from-PCI-X quite easily.  In fact, that's how my inexpensive
server mainboard does it for AMD.

I.e., instead of using a native, but costly, AMD8131/8132 HyperTransport
to dual-PCI-X tunnels (separate from the HyperTransport to PCIe tunnels
also used), I have a board that takes PCIe x8 and converts it into a
single PCI-X channel via an inexpensive Intel ESB ASIC -- same as Intel
does it for its server mainboards.  It's a single socket mainboard so
it's really not a performance/latency issue consideration for me.

> PCI Express is the future of most PC based expansion card technologies.

Understand PCIe _is_ the staple of Intel, period, and has been for years
now.

PCI or PCI-X is bridged to/from PCIe for Intel, and has been for years.
Intel is just finally bringing its first, new, multipoint system
interconnect out this year, and it will be funny to watch all of their
cache, TLB and other coherency issues take form (I've already run into
many of them, and Intel has not been as "open" as they were pre-2007 on
the errata).

AMD has a generic, tunneled system interconnect that started with the
crossbar EV6 from Digital and became even more generic.  It has also had
its own set of issues, which get fixed, and AMD is very public with its
errata on the matter.  E.g., they were open with their initial TLB bug
in the 10h processors, and didn't ship their multi-socket units as a
result (whereas the coherency bug is virtually unseen in uni-socket).

Intel only recently admitted publicly that its TLB issues in its G0
steppings are quite extensive.  We've run into them quite a lot, and it
was a tri-vendor mess (I know others were in the same boats).  At least
Intel offers microcode updates with a loader at the OS level, so you
don't have to wait on the BIOS hacks.

> (And there are different kinds of PCI-Express cards x1 being the most
> common...

Not really true.  PCIe x4 and x8 are extremely common.  And a PCIe
"channel" is a PCIe "channel" and there is nothing "special" about it.
You can combine them x1, x2, x4, x8 and x16 as you wish at the
peripheral design level.  Not nearly as flexible as the HyperTransport
system interconnect for a system designer, but definitely gives Intel a
lot of options in a peripheral interconnect to its MCH without having to
design a full system interconnect (the newer CPU core and other options
make up for this issue though). 

Also note that PCIe is electrically up'n down compatible.  PCIe x4 cards
will work in x1, x16 in x4, x4 in x8, x1 in x16, etc...  The mechanical
issues are the only concern.  Frankly I wish Intel would have just
pushed for everything to always have a x16 (or x4 when size was a
consideration), and then came up with a "color code" and "number"
standard to indicate what is electrically x1, x2, x4, x8 and x16.

The only exception to all this is the "power" standard.  PCIe normally
only supports up to 25W (or is it 50W?), but PCIe x16 slots for GPUs are
designed to support an additional 100W (125W, although it may be 150W in
some specs).  Ironically enough Host Based Adapters (HBA) are pushing
50W+ so they need a "GPU-type" PCIe slot these days in servers.

> The various PCI technologies is definitely a worthy topic of
> discussion.).

Actually not.  PCIe is simple, it's people who make it more complex than
it needs to be.  PCIe really simplifies a lot of things, although with
0.8V being pretty much the lowest voltage you can go before running into
legacy diode incompatibility, PCIe should -- once and for-all -- solve
the voltage issues of PCI, PCI-X and -- God help us -- AGP (I still hate
Intel for "trade secret" AGP, don't get me started ;).

> Please excuse me for stating the obvious, but my last
> couple rounds of personal tech buying have been laptops, it's been
> about over 6 years since I built or bought my last "desktop".

Portables have ExpressCard (electrically PCIe x1), /34 (~34mm, not
really, long story) and /54 (~54mm, again, not exactly), but the "end"
is the same (as the /34).

> 19) DDR-3 is still way too expensive, and DDR2 is dirt cheap right
> now.

DDR3 is still really more marketing, and far less JEDEC standardization.
DDR2 is commodity, although there are plenty of JEDEC violations (e.g.,
JEDEC spec DDR2 is _always_ 1.8V ;).  Both are QDR technologies in any
case.

The bigger issue with Unbuffered v. Registered v. Full Buffered DDR2.
Intel and AMD both use the first for desktops and entry-level servers,
although AMD pushes the Unbuffered (non-registered) ECC option for its
uni-socket LGA-1207 servers (and it's not much of a premium).  AMD then
just uses registered for servers, since they have NUMA and 128-bit
dual-channel per CPU, so they don't need a concentrator like Intel.
Which is why Intel pushes Full Buffered DIMMs (FB-DIMMs), with all their
negatives, because it has that legacy, single Memory Controller Hub
(MCH).  Fortunately, they are opening a new option this year.

But frankly, IBM's X architecture has always been better IMPO, including
their new X4 with "slow" PC2-4200 (DDR2-533), which is more of a
NUMA-like design.  I.e., instead of a huge, full buffered bank, they go
twice as wide with more distributed memory banking, at a slower (and
more reliable, and far lower latency) clock than Intel and FB-DIMMs.
But that's another story.

> Buying 8GB of non-ECC RAM is very affordable. (I of course went
> with ECC DDR2 PC800 RAM, which is pretty hard to find, but still
> cheaper than DDR3 RAM)

Buying 8GB of _ECC_ JEDEC PC2 (DDR2) is _also_ very affordable, as long
as its standard, unbuffered, which is commonplace for uni-socket
servers.

BTW, if you are using four (4), unbuffered, 2GB DDR2 DIMMs at PC2-6400
(DDR2-800) DIMMs, you are _violating_ JEDEC specs.  At 200MHz QDR,
you're only supposed to use one (1) DIMM per 64-bit channel, so only two
(2) DIMMS in LGA-775 and Socket-AM2[+] for Intel and AMD uni-socket.  A
proper server/workstation mainboard will "slow" to 166MHz QDR, JEDEC
PC2-5300 (DDR2-667) if you have four (4) 2GB DDR2 DIMMs that are not
registered.

For multi-socket Intel LGA-771 and AMD LGA-1207 servers, you have the
FB-DIMM and registered DDR2, which is a whole other story -- especially
with AMD's NUMA approach (with associated process-I/O affinity latency
considerations) and Intel's buffering concentrator (and associated,
general latency considerations).

> 20) The Dell PowerEdge 2950-III seems to be a very attractively priced
> piece of server kit. (Not cheap enough for hobbyists though).

My biggest recommendations "on-the-cheap" are $40 Socket-AM2 (DDR2)
Athlon x2 processors which have AMD's SVM (aka AMD-V, unlike Socket-939
which doesn't), in a single HyperTransport tuennl nForce Pro 3000 series
(basically the nForce 600 series for professionals) with PCIe x8 bridged
out into an Intel ESB for a single PCI-X channel option.  They can be
had for under $200 and give you both PCIe x8 and PCI-X options.

nVidia has consistently proven to me that anything SPP/IGP02+MCP02
(nForce 2) or later has maximum I2C compatibility with outstanding GPL
support.  This has much to do with nVidia's semiconductor focus versus
anyone else, including Intel.  Over 90% of Intel's "profits" (after
margins) are from the dirt-cheap desktop market.  Although Intel puts a
lot of engineers on Linux drivers, it's ex-post-facto, and the semi
engineers seem to change things incompatible on a dime, breaking older
kernel compatibility, because "oh, we'll ship Windows drivers to add the
support."  On the other side, over 50% of nVidia's "profits" (after
margins) are from its high mark-up nForce Professional, which are
actually little more than the same things as their consumer nForce ASICs
(just the ones that test to better tolerances, trace lengths, etc...),
which is a staple Linux seller.  I.e., nVidia avoids changing semi in
incompatible ways because they know it will break Linux compatibility,
and a lot of workstations/servers run RHEL, SLES, etc... with older
kernel features.  

> 21) XEN has a neat technology that allows you to delegate hardware
> resources down to individual PCI cards to DomU guest OSes.
> (PCI-delegation). I can see this being useful down the road.

Actually, it's less "neat" and more "mandatory."  I wish the Xen
microkernel could do more auto-detection, but it's not the same OS as
the dom0, so it can't -- unlike KVM.  But yes, it's very useful for us
in para-virt, as we can have all sorts of storage, network, HBAs and
other options and assign them direct.

> 22) OpenSource RDP clients seem less performant than the Microsoft
> ones. This may be due to my Linux video driver not being a fully
> accelerated closed source driver.

I've never had a performance issue with RDP under Linux.

It's not the closed driver, but the X.org MIT 2D "nv" driver that hasn't
"caught up" to whatever new NVGxx core you're using.  nVidia is good
about working on that with the community, since there is no 3rd party IP
involved it can't expose, so it's only matter of time before the X.org
MIT 2D "nv" driver "catches up."

I've had no issue with NV4x and similar IGP51/61 (GeForce 6000/7000
chipset integrated).  In fact, you can even get video out options
without the closed sourced "nvidia" driver, although with far less
options.  I've found it better than Intel in many cases (especially back
in the i800 series of GPUs, especially in notebooks, before the
community hacks).

> Anyone interested in forming an unofficial NYLUG/New York XEN special
> interest group?

I think virtualization is virtualization.  I don't think we need a
Xen-specific one, but I'm new around here.  I think the key with
virtualization is not to make big deals about technologies any more than
I like "distro pissing contest."  Many distros have their focus, but
they're all 99% of the same concepts, so it's really a matter of that
focus and tools, not really that they are "different."

As always, my opinion, although some may be based on hearing things 2nd
hand from the authorities on the matter (like Red Hat's JMH and libvirt
developers).  I'm no expert.





-- 
Bryan J  Smith              Professional, Technical Annoyance
mailto:b.j.smith at ieee.org  http://www.linkedin.com/in/bjsmith
-------------------------------------------------------------
           Fission Power:  An Inconvenient Solution



More information about the nylug-talk mailing list