[nylug-talk] LOWMEM vs. HIGHMEM performance advantage?

Bryan J. Smith b.j.smith at ieee.org
Sun Apr 20 02:08:39 EDT 2008


Ajai Khattri wrote:
> http://kerneltrap.org/node/2450

That was a good article for its time, 2004.  Systems were just starting
to regularly break 1GiB of memory for most people (although my desktops
broke that before 2000, and my servers well before that).  It was also
just the beginning of kernel 2.6 and the first releases of the Opteron.

Unfortunately, things changed greatly within just a year after that.
Not only due to the Opteron and x86-64, but virtually all PCs had
processors with paging, let alone when PAE was turned on, on i686
kernels.  And various issues came about.

This includes major patches to the 2.6 kernel to deal with different
memory models.  There were some massive bugs to PAE as you crossed
4-8GiB (so bad that, again, Red Hat pushed a 4G/4G model).  Although
most of these issues have been partially addressed, there is still the
very strong argument to move to x86-64 when you're programs are tapping
1GiB of RAM or more.

A lot of this has to do with the fact that paging becomes rather
extensive and quite constrained in the i686 kernel.  The biggest issues
with x86-64 aren't performance.  For different segments (code, heap,
data as well as IPC, kernel entry, etc...), 48-bit virtual pointers
(which are then normalized to 32-bit flat, assuming your program isn't
using PAE, even if the kernel is) are already in use.  So switching to
48-bit flat pointers (x86-64) changes nothing in that regard.

I've said it before and I'll say it again, the system-level performance
and considerations have nothing to do with the application-level
structures.  Application-level developers think there is a 32-bit/4GiB
"flat" space.  That is hardly the case at all when it actually comes to
system-level performance.  Even 32-bit programs use 48-bit virtual
addressing.


Before I respond to Alex, I will comment on this comment ... ;)

Ruben Safir wrote:  
> Making more friends in high places again....  

In all honesty, how people react is of little consequence to me.  All
that matters is the technical info, in an attempt to try to help people.

People are free to assume everything I'm saying is pulled right out of
my rectum.  I know it takes _years_ of providing constant, repeat,
technically accurate information over and over to "earn" trust (let
alone being quick to admit you are mistaken, which I do have to do from
time-to-time when someone clearly has more experience in an area than
I).  Credentials, a few posts, etc... aren't anything compared to that,
so I don't try to sell myself on anything short of _years_ of posts.  I
just started here, so most people can and should assume nothing I post
is correct.  Only time can change that.

I recommended considering x86-64 from experience with i686 kernels when
my programs were constantly crossing 1GiB in heap and other usage, let
alone were Java or Python-based (with a VM involved).


Alex Pilosov wrote: 
> I told you to spare me.

I can't spare your assumptions.  Remember, you responded to my
recommendation to consider x86-64.  If you're going to do that, please
step back and realize why someone like myself would point that it is not
the same.  ;)

Alex Pilosov wrote: 
> Everyone calls it x64. as abbreviation for x86-64.

For the first three (3) years of its existence, _everyone_ called it
x86-64, and some called the hardware AMD64.  Officially Intel calls it
the IA-32e programming model, and their hardware EM64T.

Only Microsoft introduced the name "x64" and only a few vendors, who
have cross-platform products, drivers, etc..., call it "x64."  Your
insistence to call it "x64" shows you haven't been around x86-64 until
recently.

I was deploying Linux/x86-64 back in 2004, because it wasn't optional
for what I was doing in the financial industry.  No one refers to it as
"x64" in the Enterprise Linux space.  Only a few people who continually
apply a Microsoft focus do, not realizing the technical difference.

Alex Pilosov wrote: 
> What the heck does it have to do with anything.

Because 99% of what you'll find on "x64" is Windows, and there are
serious issues with Microsoft's approach.  I think that's part of your
issue, you've been researching "x64" -- hence your view "everyone calls
its x64."

If you read Microsoft's technical information on the matter, you'll get
exactly the views you have as a result.  ;)

Alex Pilosov wrote: 
> Blah blah. No, it really does bloat it up. Yes, seriously. Pointers take 
> more space. Done and done. Data alignment matters but not as much.

Pointers are 48-bit virtual/segmented (normalized to 32-bit) in i686,
48-bit flat (non-segmented 48-bit) in x86-64.  That is the reality of
what the operating system deals with, especially when changing segments
in i686 between code, data, heap, let alone IPC, system calls, etc...

Again, we're talking about operating system performance here, and the
programming running atop of it.  That's why the _original_ poster is
seeing decreased performance.

If all we cared about was application-level considerations, then yeah,
"oh, I have 4GiB that are flat."

Reality:  When your application starts crossing into GiBs, things
change, radically.  May not seem so from the application-level, but it
does.

Alex Pilosov wrote: 
> What the hell are you talking about. You are on crack.
> I'm not talking about windows. I'm talking about any 64-bit mode with
> 64-bit pointers.  They take more space. Pointers are large chunk of heap
> and binaries. Thus, code is larger, and data segment is larger. Thus
> blowing out cache. Done and done.

Again, you're making assumptions here.  Please do not.  Please read the
AMD x86-64 or Intel IA-32e programmer manuals.  That's just to start.
Then read up further on AMD64 and EM64T hardware implementation
differences, especially in the kernel documentation and/or list.

Segments are a concept of i686.  And because of segments, that's why
your pointers in i686 are _48_ bit, _not_ 32-bit.  You think one thing,
but you are so far outside the context of this thread, I can't help you
until you help yourself with some reading.  Again, for code, data, heap
as well as IPC, system calls, etc..., _48_ bit pointers are used on
i686.

Pointers are not what you think in an application, let alone definitely
not at the system-level.  Too many people think x86-64 is 64-bit
addressing and i686 is 32-bit addressing from the standpoint of
pointers.  That is just a total falsehood and anyone who starts
dissecting i686 binaries from x86-64 binaries (let alone gets into i686
and i686 36-bit PAE paging versus x86-64 52-bit PAE gets a rude
awakening).

For a x86-64 OS/kernel to support 64-bit pointers, it would _break_ i686
compatibility.  That's why 48-bit is used, for i486/686 paging/TLB and
other compatibility.  With 48-bit pointers (one segmented/virtual, one
flat), 6 bytes are used.  Data alignment considerations (especially so
for x86-64 targets) will often cause these to take up 8 bytes if only
one pointer is used in a structure.  Optimizing compilers for i686 often
do the same thing, align on 8 bytes for their 48-bit pointers.

Alex Pilosov wrote: 
> I'm talking about lib64 obviously.

Yes, and what size pointers does x86-64 use?  ;)


Alex Pilosov wrote:
>
> In response to what Bryan J Smith wrote:
> > I686: 6 bytes (48-bit segmented) 
> > X86-64: 6 bytes (48-bit flat)
>
> Blah blah blah.

Yes, "Blah blah blah" -- it's everything!  ;)

Alex Pilosov wrote:
> Bryan, cut the crap, seriously.

I cannot.  The original poster, among others, were wondering why there
are performance issues, even when the program is using under 4GiB.
Many, including yourself, seem to believe that as long as its under
32-bit/4GiB, it's all the same.

It's not.  It's not at all.

Furthermore, I tried to impress upon the reality that there are no less
than three (3) more, major factors to software performance that go
beyond the application itself, which is an additional factor.  This is
the hallmark of performance when applications are using GiBs of memory.

Your argument that "oh, bigger pointers make the program slower" is not
only untrue, but your statement is based on a pointer size assumption
that is _dead_wrong_.  ;)

Alex Pilosov wrote:
> You do know some things,

I wish you could focus on the technical details, instead of switching
between "questionable" statements and patronizing like this.  Just focus
on the technical details please.  But if you're combination of both
direct as well as passive-aggressive approaches are intended to
frustrate or upset me, I'm sorry, all I care about is technical reality.

In other words, act however you want, just know it won't bother me or
get me to drop to another level.

Alex Pilosov wrote:
> but you like to bring up technical details that absolutely have no
> relevance to topic discussed.

While I don't disagree that I include other performance or compatibility
considerations, or even a little history, in some of my posts, I do at
least maintain a 100% technical or at least "this may be of interest"
approach to my posting.  If that's a fault, then so be it.  But no one
can fault me for being anything but technical or someone who at least
considers what someone may be trying to do.

Such as a user with a Python application that is running into using GiBs
of memory.

You may wish to consider that before complaining about what I do.  After
all, I'm not here complaining about how you treat my posts, how you try
to upset me or others, with your approach.  You've called me everything
from an AMD fanboy (even though I'm recommending and deploying Intel in
many cases) to all that other junk I won't even acknowledge.  I guess
you're trying to keep from pointing the finger at yourself, I don't
know.  All I care about is the technical nature of the discussion.
Sometimes this crosses into areas where other factors are of major
concern, despite your belief they are not.

In this case, your assumptions on how pointers are used for i686 and
x86-64 are not only _dead_wrong_, but your assumptions that anything
under 4GiB is "flat" at the system-level (100% pertaining to the
original posters performance issue) is not true in the most remote
sense.  As far as your "spare me" on your use of "x64," it didn't
"bother" me at all -- but I finally realized that was why you were
making your assumptions (because you've read a lot of Windows "x64"
issues and assume they applied to x86-64 in general).

Virtually all of the "x64" documentation I've read pertains to Win32 v.
Win64 performance.  In that case, factoring in the WoW issues, you are
100% correct in your prior posts.  And they apply 0% to Linux/x86-64.

Alex Pilosov wrote:  
> What the hell does it have to do with ALU. Go take Architecture 101
> class and find out what ALU is.

I'm not going to even get into my resume, because the problem with
credentials and experience is that someone else always has more.

First off, and I'll re-quote what you were responding too ...

> Bryan J Smith wrote:  
> > You're trying to apply a "from afar" assumption about ALU
> > realities to a discussion on paging. This is about paging,
> > not what user-space application developers think what is
> > going on.

'You're trying to apply a "from afar" assumption about ALU realities.'

I used the phrase ALU because it's clear you continue to make
assumptions that because the AMD64/EM64T processors have a 64-bit ALU,
that the x86-64/IA-32e programming models must use 64-bit pointers.
That was my point.

'to a discussion on paging.  This is about paging, not what user-space
application developers think what is going on.'

Without looking at i686, 32-bit seemingly can do everything in 4GiB,
including code, data, heap, system, other program, etc... access.
Everything seems 32-bit.

When looking at i686, the realities become clear.  "Oh crap, there are
segment pointers" -- one that is stored in the structures as well.  You
spoke of "heap" -- a program is allocated and assigned its heap with
6-byte (48-bit) pointers because it is in a different segment, paged by
the processor completely different, than the program itself.  This is
even more so when you start talking of VMs, IPC, etc...

Now that's before we start looking at how the kernel works, pages,
etc... for i686 -- be it using PAE or not.  It's quite messy when you're
programs are using GiBs of data.

CASE-IN-POINT ... SO LISTEN UP ...

That's why I originally stated -- and you took issue with -- that if
you're programs are accessing GiBs, x86-64 is afar better move than
trying to find the right i686 memory model for the kernel.  We've had to
deal with this first-hand at my client on Wall Street, like other
financial clients I've had before -- and even some of the most stubborn
developers finally come to grips with it and port their damn apps.

They used the same arguments you did.  I had to avoid rolling my eyes.
4G programming language developers, some who not only didn't know the
first thing about basic C structures, but even the ones that did were
ones they definitely didn't understand how the kernel wasn't providing a
"flat" space under 4GiB in the first place.  You seemingly have the same
attitude.

"No x64 is bad!  It's inefficient!  It takes up more space!"

Sigh.  There's a lot of this out there for Windows x64, and it readily
applies, but for very different reasons.  Heck, even the hold Digital
Alpha, with its anal "don't let any instruction execute more than a few
cycles" used to cause major bloat in code.  None of this is applicable
at all to the x86-64/IA-32e programming models on the AMD64/EM64T
processors.

Your continued belief that it's 32-bit and 64-bit pointers and nothing
more tells me you do not understand the first thing about not only
x86-64, but even i686, and how the kernel uses them.  AGAIN -- This
thread started about i686 kernels and programs using more than 1GiB and
then you took issue with my recommendation to look at x86-64 in most of
these situations.  Assumptions, assumptions, assumptions have been your
responses (at least the seemingly technical responses ;).

Alex Pilosov wrote:  
> What the hell does context switching have to do with anything
> discussed above?!

Two-fold ...

One, in an i686 kernel, with an application using more than 1GiB, there
can be serious paging performance hits if anything else is running
(including services and the like, beyond just normal "background
processes").

Two, the original poster mentioned this is a Python application, so
there are further VM considerations.  In the constrained user-space of
an i686 kernel, and lots of paging overhead going, this quickly becomes
a major consideration.

Again, this all "adds up" to a very, very constrained user-space on an
i686 kernel with the LOMEM/HIMEM memory model.  Lots of paging.  Lots of
other issues.  And then they are compounded by any context switching,
let alone the VM overhead that causes even more.

AMD started scoring a lot of customers and major design wins the second
Opteron hit back in 2004.  There is a major reason for that.
Applications were already crossing 1GiB of memory usage and the
traditional i686 approaches for user-space, kernel-space, etc... just
don't fit well.

Alex Pilosov wrote:  
> Are you saying that context switch on 64-bit platform is more
> expensive than 32-bit platform? Or that somehow doing 3/1 split
> results in more expensive context switches than 1/3 split? I have
> no idea. In either case, I call bs.

It's clear you're not even following me.  Context switches can make
paging worse.  And when I say paging, I don't mean "OS paging."  I mean
the addressing done by the CPU, mitigated (as best as they can) by its
TLB, etc...  It's obvious you haven't looked at how the i686 kernel, and
its various memory model implementations, deal with the CPU paging, its
TLB (and possible flushes for coherency), etc...

So in all honesty, please don't take one of my recommendations and
balloon it into a discussion that you are not ready to follow.

I will only respond to two more comments ...

Alex Pilosov wrote:  
> Why are you switching from I/O to graphics cards now?
> I've pointed out that bounce buffers are irrelevant, what *else*?

What do you think graphics cards are?  Or better yet, what do you think
the problem with graphics is, when it comes to coherency?  And how did
Intel solve it?

Hell, people wonder why AMD bought ATI.  For anyone who has been around
x86-64, let alone prior RISC/UNIX solutions, the answer is extremely
obvious when you start looking at I/O, coherency, etc...  People wonder
why there are proprietary ATI and nVidia drivers, and why Intel's Linux
performance tanks compared to Windows.

Most people don't understand a lot about I/O, not a lot at all.

Alex Pilosov wrote:  
> I note that there aren't any SPEC benchmarks of series 460/X4 hardware.  
> Wonder why. In fact, the only published benchmark that I can find is for 
> TPC/C for a 16-socket series460 hardware.

Some of us actually have hardware in-house that is not publicly
available yet.  It might have to do with the type of clients I work for
and the customers I have.  I'm under NDA on a lot of things, but I can
refer to things like IBM's hardware and their designs that are publicly
available.

I'm not into reading enthusiast pages or "x64" sites.  I'm into reading
engineering technical documentation and first-hand experience.  That
includes understanding the complexities of the processor and its system
interconnects.  Heck, even the TLB is a performance hack -- one both AMD
and Intel are having major issues with in their new designs right
now.  ;)

Of course, I won't hold anything against anyone who thinks I'm pulling
all of this out of my rectum.  People who think x86-64/IA-32e is 100%
marketing and that i686 kernels and programs run faster than x86-64
kernels and programs when 1GiB usage by a single program (let alone
multiple, along with services, etc...) are in use.  I can only offer my
nearly 4 years experience in deploying Linux/x86-64 for financial and,
to a lesser extent, engineering applications (most of my engineering
experience is pre-200)).

And I have no "history" on this list to point to, so my word on my
experience means little.  I know this.



-- 
Bryan J  Smith              Professional, Technical Annoyance
mailto:b.j.smith at ieee.org  http://www.linkedin.com/in/bjsmith
-------------------------------------------------------------
           Fission Power:  An Inconvenient Solution



More information about the nylug-talk mailing list