[KLUG Members] Re: REDHAT: My suggestion on organizing binary CDs for x86 chip-specific optimizations

Bryan J. Smith members@kalamazoolinux.org
16 Apr 2002 11:21:01 -0400


On Tue, 2002-04-16 at 10:52, Bruce Smith wrote:
> Sure, you're needs are slightly different than mine.  
> That's a good approach for what you're doing.

Right.  I'm surprised more people don't do NFS installs.

> I think it'd be a good idea for a startup (like how Mandrake started),
> offering CPU optimized Redhat distributions.  I've thought about it...

I think it would serve RedHat well to do it itself.

> If nobody is going to buy these optimized CD's from KLUG, it's not worth
> the hassle.  That's the bottom line for me.  Nobody is buying my SGI CD,
> so it won't be offered for 7.3/8.0 and beyond.

But it would be nice to have them as an option from RedHat.  God knows
I'd download and distribute the i586/686 and Athlon/x86-64 CD #1s
_instead_ of the i386 "Default" ones.

> Is that Mandrake's numbers?  I take them with a grain of salt.
> "_up_to_30%" can also mean 1% or less.  

Well, it's Intel in general.  I have attached a portion of a post from
another list that goes into this deeper (see below at the end of this
E-mail).

> No, that's called me being a smart a$$.   :-)

So was I.  ;-P

> Again, I'm a very bad case for needing more throughput.  My employer
> makes big metal things that most people in the world never heard of, nor
> do they care.  We upgraded the web server from a P2-400 to a dual P3-733
> to keep marketing happy, but the P2-400 was overkill for our traffic. 
> Believe me, we're NO yahoo.com.

It all depends.  I've seen 4-way P4 Xeon systems installed that cannot
match a dual-P3 in throughput.  I've also seen far too many people
upgrade to dual-processors, but not address the I/O throughput issue
they were having with a good mainboard selection.

> What kind of applications are the "some"?  Do GUI desktops get more
> improvement than server applications like squid, apache/php, ... ?

Again, see my post below ...

> Nope, I don't have a clue.  And PLEASE _DON'T_ EXPLAIN!!!
> I would still be clueless after your explanation!    :-)

Well, see my post below ...

> I'm happy when I don't have to wait for normal desktop operations.  
> Evolution, Galeon, and VI are plenty responsive for my needs on the
> hardware I'm running.  Those three applications make up over 90% of 
> what I do on my desktop.

Me too.  But I also like to capture video.

> I can see where it would help Quake.  BUT, I'm not a gamer,
> so I'm still a bad case study for needing this.

That's just an _well_known_example_!  ;-P

> I'm not going to do it for myself, but I might be talked into doing it
> if there is enough interest.

No, no, the post was more for RedHat, NOT small fry.  ;-P

-- Bryan

--- FORWARDED MESSAGE ---

Yes, there are various gcc optimizations to consider to maximize
targeting.  Some are backward compatible -- e.g., -m486 optimized
binaries will still run on 386 processors.  But even in those cases,
such optimizations can actually _reduce_performance_significantly_ on
older hardware, or competing architectures.

My favorite is the IdSoftware's "Pentium optmization" for register
loading.  Loading 32-bit values from memory into Intel's ALU was so damn
slow, that they found it faster to use the FPU unit to do it, even if
additional instructions were required to move the values from FPU
registers over to ALU ones.  Such is the case with many "Pentium
optimizations" -- little more than "hacks" to address clear Pentium
_design_flaws_.  ;->

The result is an "efficiency" problem results on non-Intel
architectures.  AMD's K6 core could do register loads 3x faster than the
Pentium, and two at the same time.  Not only could it load 3 times as
many values, but it didn't tie up the FPU.  Furthermore, the "Pentium
optimizations" made the assumption that the FPU would be pipelined, and
the K6 was not.  So not only was the approach a poor use of the K6 chip,
but its assumptions resulted in code being optimized to produce a stall
situation on non-Intel chips.

Even today with the Athlon chip, it is still far from ideal.  The Athlon
can load 3 integers simultaneously far faster than a P4, as well as do
two complex FPU operations and one ADD/MULTI FPU operation whereas the
P4 only has one complex FPU pipe and one ADD-only FPU pipe (in addition
to the fact that the Athlon has a pipelined FPU).  While the Athlon's
newer design helps offset some of the "Pentium Optimization" load with
register renaming and run-time out-of-order execution, Athlon optimized
binaries could better fill all 6 pipes (3 ALU, 3 FPU) instead of just 2
FPU ones.  Those familiar with the Alpha 364's development can attest
the the power of combining run-time optimization with compile-time
optimization (and most of those Alpha 364 engineers are now at AMD ;-).

AMD has regularly shown that an _average_overall_system_performance_
increase of 40% is _easily_achievable_ with Athlon optimizations over
standard i386 ones.  Factor in the _reversal_ of not running Pentium
optimized binaries (which probably decreases Athlon performance between
10-40% overall), and it could amount to almost a 2x improvement
_overall_.  Athlon optimization has even more potential in games and
engineering applications where AMD's FPU clearly dominates Intel's.  As
such, x86-64 is not only an effort to create a x86/IA-32 compatible
64-bit architecture, but one to finally _optimize_ binaries for Athlon
by default, instead of having them "de-optimized by default" in a world
dominated by binaries with "Pentium Optimizations."

P.S.  Furthermore, _unlike_ Intel who uses dedicated "lossy math" logic
to "accelerate" SSE instructions, AMD uses its FPU for SSE/3DNow so it
is "lossless."  Anyone who has seen DivX/MPEG-4 encoding with SSE-aware
encoders done on the Pentium 4 versus an Athlon XP can attest to the
loss in image quality on the former.  Intel has always _sucked_ when it
came to designing an FPU, despite the quite INcorrect believe that it
was faster (all because of a "hack" that assumed the FPU was pipelined).