Why was the Itanium processor difficult to write a compiler for?

It’s commonly stated that Intel’s Itanium 64-bit processor architecture failed because the revolutionary EPIC instruction set was very difficult to write a good compiler for, which meant a lack of good developer tools for IA64, which meant a lack of developers creating programs for the architecture, and so no one wanted to use hardware without much software for it, and so the platform failed, and all for the want of ~~a horseshoe nail~~ good compilers.

But why was the compiler stuff such a difficult technical problem? It seems to me that if the explicit parallelism in EPIC was difficult for compiler vendors to implement… why put that burden on them in the first place? It’s not like a good, well-understood solution to this problem didn’t already exist: put that burden on Intel instead and give the compiler-writers a simpler target.

Itanium came out in 1997. By this point, the UCSD P-Code bytecode system was nearly 20 years old, the Z-machine just slightly younger, and the JVM was the hot new rising star in the world of programming languages. Is there any reason why Intel didn’t specify a “simple Itanium bytecode” language, and provide a tool that converts this bytecode into optimized EPIC code, leveraging their expertise as the folks who designed the system in the first place?

As I recall at the time, the issue was not just the particulars of IA64, it was the competition with AMD’s x86-64 instruction set. By making their architecture backwards compatible with the x86 instruction set, AMD was able to leverage the existing tools and developer skill sets. AMD’s move was so successful that Intel (and Via) were essentially forced to adopt the x86-64 architecture.

The big barrier at the time was 4 GB RAM on desktop PCs (more realistically ~3.4GB usable on Windows). x86-64 smashed that barrier and opened up higher power computing to everyone. Had AMD never come up with x86-64, I’m sure Intel would have been happy to have everyone who wanted to jump to 4GB+ RAM pay a hefty premium for years for that privilege. Demonstrating how slowly markets move, it has taken years for applications to catch up to 64-bit, multi-threaded programming, and even now 4GB RAM is standard on low-end PCs.

In short, Intel tried to make a revolutionary leap with the IA64 architecture, and AMD made an evolutionary step with x86-64. In an established market, evolutionary steps that allow knowledge workers to leverage existing skills will win over revolutionary steps that require everyone to learn new skills. Regardless of the qualitative differences between the architectures, IA64 could not overcome the momentum of its own x86 platform once AMD added the x86-64 extensions.

I don’t buy the explanation that IA64 was too difficult to program for. It was only difficult relative to the alternatives. @delnan’s point about low-level IR is smack on, I just don’t think it would have made a difference.

As to why Intel didn’t try to shoulder that burden themselves, who knows? They were the market power at the time. AMD was something of a threat but Intel was the king of the hill. Maybe they thought that IA64 would be so much better than anything else that they could move the entire market. Maybe they were trying to make a premium tier and leave AMD, VIA, etc. in the second tier fighting over low-margin commodity hardware – a strategy that both Intel and Apple have employed quite successfully.

Was Itanium a deliberate attempt to make a premium platform and pull the rug out from under AMD, VIA, etc.? Of course, that’s how business works.

The Wikipedia article on EPIC has already outlined the many perils common to VLIW and EPIC.

If anyone does not catch the sense of fatalism from that article, let me highlight this:

Load responses from a memory hierarchy which includes CPU caches and DRAM do not have a deterministic delay.

In other words, any hardware design that fails to cope with (*) the non-deterministic latency from memory access will just become a spectacular failure.

(*) By “cope with”, it is necessary to achieve reasonably good execution performance (in other words, “cost-competitive”), which necessitates not letting the CPU fall idle for tens to hundreds of cycles ever so often.

Note that the coping strategy employed by EPIC (mentioned in the Wikipedia article linked above) does not actually solve the issue. It merely says that the burden of indicating data dependency now falls on the compiler. That’s fine; the compiler already has that information, so it is straightforward for the compiler to comply. The problem is that the CPU is still going to idle for tens to hundreds of cycles over a memory access. In other words, it externalizes a secondary responsibility, while still failing to cope with the primary responsibility.

The question can be rephrased as: “Given a hardware platform that is destined to be a failure, why (1) didn’t (2) couldn’t the compiler writers make a heroic effort to redeem it?”

I hope my rephrasing will make the answer to that question obvious.

There is a second aspect of the failure which is also fatal.

The coping strategies (mentioned in the same article) assumes that software-based prefetching can be used to recover at least part of the performance loss due to non-deterministic latency from memory access.

In reality, prefetching is only profitable if you are performing streaming operations (reading memory in a sequential, or highly predictable manner).

(That said, if your code makes frequent access to some localized memory areas, caching will help.)

However, most general-purpose software must make plenty of random memory accesses. If we consider the following steps:

Calculate the address, and then
Read the value, and then
Use it in some calculations

For most general-purpose software, these three must be executed in quick succession. In other words, it is not always possible (within the confines of software logic) to calculate the address up front, or to find enough work to do to fill up the stalls between these three steps.

To help explain why it is not always possible to find enough work to fill up the stalls, here is how one could visualize it.

Let’s say, to effectively hide the stalls, we need to fill up 100 instructions which do not depend on memory (so will not suffer from additional latency).
Now, as a programmer, please load up any software of your choice into a disassembler. Choose a random function for analysis.
Can you identify anywhere a sequence of 100 instructions (*) which are exclusively free of memory accesses?

(*) If we could ever make NOP do useful work …

Modern CPUs try to cope with the same using dynamic information – by concurrently tracking the progress of each instruction as they circulate through the pipelines. As I mentioned above, part of that dynamic information is due to non-deterministic memory latency, therefore it cannot be predicted to any degree of accuracy by compilers. In general, there is simply not enough information available at the compile-time to make decisions that could possibly fill up those stalls.

In response to the answer by AProgrammer

It is not that “compiler … extracting parallelism is hard”.

Reordering of memory and arithmetic instructions by modern compilers is the evidence that it has no problem identifying operations that are independently and thus concurrently executable.

The main problem is that non-deterministic memory latency means that whatever “instruction pairing” one has encoded for the VLIW/EPIC processor will end up being stalled by memory access.

Optimizing instructions that do not stall (register-only, arithmetic) will not help with the performance issues caused by instructions that are very likely to stall (memory access).

It is an example of failure to apply the 80-20 rule of optimization: Optimizing things that are already fast will not meaningfully improve overall performance, unless the slower things are also being optimized.

In response to answer by Basile Starynkevitch

It is not “… (whatever) is hard”, it is that EPIC is unsuitable for any platform that has to cope with high dynamism in latency.

For example, if a processor has all of the following:

No direct memory access;
- Any memory access (read or write) has to be scheduled by DMA transfer;
Every instruction has the same execution latency;
In-order execution;
Wide / vectorized execution units;

Then VLIW/EPIC will be a good fit.

Where does one find such processors? DSP. And this is where VLIW has flourished.

In hindsight, the failure of Itanium (and the continued pouring of R&D effort into a failure, despite obvious evidence) is an example of organizational failure, and deserves to be studied in depth.

Granted, the vendor’s other ventures, such as hyperthreading, SIMD, etc., appears to be highly successful. It is possible that the investment in Itanium may have had an enriching effect on the skills of its engineers, which may have enabled them to create the next generation of successful technology.

TL;DR: 1/ there are other aspects in the failure of Itanium than the compiler issues and they may very well be enough to explain it; 2/ a byte code would not have solved the compiler issues.

It’s commonly stated that Intel’s Itanium 64-bit processor architecture failed because the revolutionary EPIC instruction set was very difficult to write a good compiler for

Well, they were also late (planned for 98, first shipment in 2001) and when they finally delivered the hardware, I’m not even sure that it delivered what was promised for the earlier date (IIRC, they at least dropped part of the x86 emulation which was initially planned), so I’m not sure that even if the compilation problems has been solved (and AFAIK, it has not yet), they would have succeeded. The compiler aspect was not the only aspect which was overly ambitious.

Is there any reason why Intel didn’t specify a “simple Itanium bytecode” language, and provide a tool that converts this bytecode into optimized EPIC code, leveraging their expertise as the folks who designed the system in the first place?

I’m not sure where you place the tool.

If it is in the processor, you have just another micro-architecture and there is no reason not to use x86 as public ISA (at least for Intel, the incompatibility has an higher cost than whatever could bring a cleaner public ISA).

If it is externally, starting from a byte-code make it even harder than starting from an higher level language. The issue with EPIC is that it can use only the parallelism that a compiler can find, and extracting that parallelism is hard. Knowing the language rules give you more possibilities than if you are constrained by something already scheduled. My (admitted unreliable and from someone who followed that from far) recollection is that what HP(*) and Intel failed to achieve on the compiler front is the language level extraction of parallelism, not the low level which would have been present in a byte code.

You are perhaps underestimating the cost at which current processor achieve their performance. OOO is more effective than the other possibilities, but it is surely not efficient. EPIC wanted to use the area budget used by the implementation of OOO to provide more raw computing, hoping that compilers would be able to make use of it. As written above, not only we are still unable — as AFAIK, even in theory — to write compilers which have that ability, but the Itanium got enough other hard-to-implement features that it was late and its raw power was not even competitive (excepted perhaps in some niche markets with lots of FP computation) with the other high end processor when it got out of fab.

(*) You also seem to underestimate HP role in EPIC.

A few things.

IPF was in-order, for one. This meant you couldn’t rely on reorder to save you in the event of a cache miss or other long-running event. As a result, you ended up needing to rely on speculative features – namely, speculative loads (loads that were allowed to fail – useful if you didn’t know if you’d need a load result) and advanced loads (loads that could be re-run, using recovery code, if a hazard occurred.) Getting these right was hard, advanced loads especially! There were also branch and cache prefetch hints that could really only be used intelligently by an assembly programmer or using profile-guided optimization, not generally with a traditional compiler.

Other machines at the time – namely UltraSPARC – were in-order, but IPF had other considerations too. One was encoding space. Itanium instructions were, by nature, not especially dense – a 128-bit bundle contained three operations and a 5-bit template field, which described the operations in the bundle, and whether they could all issue together. This made for an effective 42.6 bit operation size – compare to 32 bits for most of the commercial RISCs’ operations at the time. (This was before Thumb2, et al – RISC still meant fixed-length rigidity.) Even worse, you didn’t always have enough ILP to fit the template you were using – so you’d have to NOP-pad to fill out the template or the bundle. This, combined with the existing relative low density, meant that getting a decent i-cache hit rate was a) really important, and b) hard – especially since I2 only had a 16KB L1I (although it was quite fast.)

While i’ve always felt that the argument of “the compiler was the one and only problem” was overblown – there were legitimate microarchitectural issues that really did I2 no favors for general-purpose code – it was not especially fun to generate code for compared to the narrower, higher-clocked OoO machines of the day. When you could really properly fill it, which often involved either PGO or hand-coding, it did great – but a lot of the time, performance from compilers was really just uninspiring. IPF didn’t make it easy to generate great code, and it was unforgiving when code wasn’t great.

What killed Itanium was shipment delays that opened the door for AMD64 to step in before software vendors commited to migrate to IA64 for 64 bit apps.

Leaving optimization to the compiler was a good idea. A lot of stuff can be done static that otherwise is inefficient in hardware. The compilers became quite good at it, especially when using PGO profiling (I worked at HP and HP’s compiler tended to outperform Intel’s). PGO was a hard sell however, it’s a difficult process for production code.

IPF was meant to be backwards compatible, but once AMD64 launched it became moot, the battle was lost and I believe the X86 hardware in the CPU was just stripped to retarget as a server CPU.
Itanium as an architecture was not bad, the 3 instruction per word was not an issue. What was an issue is the hyper-threading implementation by swapping stacks during memory IO was too slow (to empty and reload the pipeline) until Montecito etc. which prevented it from competing vs out-of-order PowerPC CPUs. The compilers had to patch up late-to-detect flaws of CPU implementations, and some of the performance edge was lost to hard to predict mistakes.

The architecture allowed Itanium to be relatively simple while providing tools for the compiler to eek out performance from it. If the platform had lived, the CPUs would have become more complex, and eventually become threaded, out of order etc. like x86. However the first gens focussed transistor count on other performance schemes since the compiler handled a lot of the hard stuff.

The IPF platform bet on the compiler and tools, and it was the first archiecture to expose an extremely complete and powerful Performance Monitoring Unit (PMU) design, that was later ported back to Intel x86. So powerful tool developers still don’t use it to its full ability to profile code.

If you look at ISA successes, it’s often not the technical side that rolls the dice. It’s its place in time and market forces. Look at SGI Mips, DEC Alpha… Itanium was just supported by the loosers, SGI & HP servers, companies with managements that piled on strategic business mistakes. Microsoft was never full-in and embraced AMD64 to not be boxed-in with only Intel as a player, and Intel didn’t play right with AMD to give them a way to live in the ecosystem, as they intended to snuff AMD.

If you look at where we are today, X86’s complex hardware has lead it to an evolution dead end so far. We’re stuck at 3+GHz, and dumping cores with not enough use for it. Itanium’s simpler design would have pushed more stuff on the compiler (room for growth), allowing to build thinner,faster pipelines. At same generation and fab technology, it would have been running faster and capped all the same but a bit higher, with maybe other doors to open to push Moore’s law.

Well at least the above is my beliefs 🙂

Itanium was announced in 1997 (as Merced at the time) but it didn’t ship until 2000 which is what eventually doomed it, really.

The real reason for this epic failure was the phenomenon called “too much invested to quit” (also see the Dollar Auction) with a side of Osborne effect.

Our story begins really at 1990 (!). HP is trying to answer the question: what’s next after PA-RISC? They started a visionary research project using personnel and IP from two notable VLIW companies in the 80s (Cydrome and Multiflow — the Multiflow Trace is btw the negative answer posed in the title, it was a successful VLIW compiler), this was the Precision Architecture Wide-Word. By 1993 they decide it’s worth developing it into a product and they are looking for a semiconductor manufacturing partner and in 1994 they announce their partnership with Intel. It is still not at all evident that x86 will win over everything, for example the DEC Alpha AXP looked way more like the future of high end.

They will continue development and announce EPIC in 1997 at the Microprocessor Forum but the ISA won’t be released until February 1999 making it impossible to create any tools for it before. Catastrophe hits in 1999 October when AMD announces the x86-64. With the Alpha chip design team at AMD, the Athlon already showed their ability to create competitive performance and x86-64 takes away the 64 bit advantage. While their own Pentium 4 was not yet public, it also showed how far x86 can get performance wise. To make things worse, McKinley was announced back in 1998 with a 2001 shipment date and as this ZDNet article from 1999 March mentions “Word on the street suggests Merced is more likely to be a development platform with few commercial shipments — most will wait for McKinley”.

What do do at this juncture? HP has been at this since 1988 when they acquired Cydrome IP and hired Bob Rau and Michael Schlansker from the company when it collapsed (see Historical background for EPIC instruction set architectures and EPIC: An Architecture for
Instruction-Level Parallel Processors ). Do they just scrap a decade plus, multibillion project because it’s visibly too late?

Later, further fuelling the Osborne effect, in the beginning of 2002 after Itanium sales off to a slow start one could read analysts saying “One problem is that McKinley…is expensive to manufacture. It also means yields are lower … Not until you get into Madison and Deerfield in 2003 do you start talking about volume.” — so where people were strung along from 1998 to 2002 to wait for McKinley now that the year of McKinley arrived, they were told, wait that’s too expensive, the next one will be better, or if not, then the one after. But Opteron launched two months before Madison and that’s approximately where this whole charade should’ve ended.

But why was the compiler stuff such a difficult technical problem? It seems to me that if the explicit parallelism in EPIC was difficult for compiler vendors to implement… why put that burden on them in the first place? It’s not like a good, well-understood solution to this problem didn’t already exist: put that burden on Intel instead and give the compiler-writers a simpler target.

What you describes is a bit what Transmeta tried to do with their code morphing software (which was dynamically translating x86 “bytecode” into Transmeta internal machine code).

As to why did Intel failed to make a good enough compiler for IA64… I guess is that they did not have enough compiler expertise in house (even if of course they did have some very good compiler experts inside, but probably not enough to make a critical mass). I guess that their management underestimated the efforts needed to make a compiler.

AFAIK, Intel EPIC failed because compilation for EPIC is really hard, and also because when compiler technology slowly and gradually improved, other competitors where also able to improve their compiler (e.g. for AMD64), sharing some compiler know-how.

BTW, I wished that AMD64 would have been some more RISCy instruction set. It could have been some POWERPC64 (but it probably wasn’t because of patent issues, because of Microsoft demands at that time, etc…). The x86-64 instruction set architecture is really not a “very good” architecture for compiler writer (but it is somehow “good enough”).

Also the IA64 architecture has builtin some strong limitations, e.g. the 3 instructions/word have been good as long as the processor had 3 functional units to process them, but once Intel went to newer IA64 chips they added more functional units, and the instruction-level parallelism was once again hard to achieve.

Perhaps RISC-V (which is an open source ISA) will gradually succeed enough to make it competitive to other processors.

Compilers have access to optimization info that OOO hardware won’t have at run time, but OOO hardware has access to information that is not available to the compiler,
such as unanticipated memory latency costs. OOO hardware optimizations were able to battle EPIC compiler optimizations to a draw on enough tasks that EPIC’s primary advantage was not a clear winner. http://www.cs.virginia.edu/~skadron/cs654/cs654_01/slides/ting.ppt
Itanium’s VLIW instruction bundles frequently increased code size by a factor of 3 to 6 compared to CISC, especially in cases when the compiler could not find parallelism.
This ate into available memory bandwidth, which was becoming an increasingly limited resource at the time Itanium was released. http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_density.pdf
Itanium’s VLIW instruction bundles offered speculative execution to avoid failed branch prediction costs, but the practice of executing calculations that were discarded most of the time ate into the CPU power budget, which was becoming an increasingly limited resource at the time Itanium was released.
It was hard to make a single binary that performed optimally on multiple generations of Itanium processors. This was challenging for shrink wrapped software vendors and increased the cost/risk of upgrading an Itanium platform to the current generation.
Itanium never achieved the necessary price/performance advantage necessary to overcome “platform inertia” because it was frequently delayed to compensate for issues 1-4.
Itanium never achieved the economy of scale that x86 & x64 was able to leverage to lower R&D costs per unit because of issue 5.

As Robert Munn pointed out — it was the lack of backward compatibility that killed the Itanium ( and many other “new” technologies).

While writing a new compiler might have been hard you only need a few of them. A C compiler which produces optimized code is a must — otherwise you will not have a useable Operating System. You need a C++ compiler, Java and given that the main user base would be Windows some sort of Visual Basic. So this was not really a problem. There was a decent operating system (NT) and a good C compiler available.

What would seem like a trivial effort for a company offering a software product — recompile and retest your C code base (and at that time most would have been written in pure C!) was not that simple; converting a large set of C programs which assumed a 32 bit integer and assumed 32 bit addressing to a native 64 bit architecture was full of pitfalls. Had IA64 become a dominant chip (or even a popular one!) most software companies would have bitten the bullet and made the effort.

So fast chip with a reasonable OS but a very limited set of software available, therefore not many people bought it, therefore not many software companies provided products for it.

It’s commonly stated that Intel’s Itanium 64-bit processor architecture failed because the revolutionary EPIC instruction set was very difficult to write a good compiler for, which meant a lack of good developer tools for IA64, which meant a lack of developers creating programs for the architecture, and so no one wanted to use hardware without much software for it, and so the platform failed, and all for the want of a horseshoe nail good compilers.

That’s an oversimplified view. I worked on one of the first two Merced systems in the UK from early 2000, trying to port a mathematical modeller to its 64-bit Windows. I reported a few dozen compiler bugs, and got most of them fixed, so I speak from some practical experience. I had a pretty good relationship with my Intel customer engineer, and with Microsoft’s compiler team.

It isn’t especially hard to write a compiler that produces working code for IPF. It is very hard, with frequent excursions into “impossible”, to write a compiler that makes IPF run fast. The reasons for this are actually fundamental to the architecture.

The compiler lacks information that is available at run-time

The idea that the compiler, with plenty of time available, can do a better job of scheduling memory accesses than hardware can at run-time is wrong.

It would be true for single-core, single-hardware-thread machines without processor caches. That takes us back to 8-bit and early 16-bit machines; it is not true of fast PCs since about 1990, which have had caches. It may be that some of the senior figures at Intel in the late 1990s who’d moved into management and come back to engineering so as to be involved with the company’s “next great success” were still thinking that way. If so, that’s a terrible failure of project management.

In a cached, multi-core system, running multi-threaded programs, it is impossible for the compiler to know for sure what is in cache and what isn’t. Out-of-Order execution addresses that problem, very effectively, by issuing loads, allowing instructions to proceed as soon as their data arrives, and in meanwhile, running other instructions. A compiler can’t do that.

You need to generate “bundles” of instructions that don’t clash

The idea of “explicitly parallel instruction computing” is that you generate bundles of instructions, and all the instructions in a bundle are wholly independent of each other. They can’t write to any of the same registers or memory locations, and they can’t use each other’s results in any other way. This is harder than it sounds.

Doing this with inherently serial languages like C, C++, and many styles of Fortran, requires the compiler to develop a sufficiently comprehensive model of the code it’s compiling that it can find operations that can safely be run in parallel. This is called “instruction-level parallelism” (“ILP”). This had been an objective of compiler design since the early 1970s. HP and Intel did not have a general method of solving it in the late 1990s, and nor does anyone else, as of 2023. It is a genuinely hard problem in computer science, and if it is ever solved, that will require some great insight.

The Intel compiler people were very over-optimistic

They knew they did not have a solution to the ILP problem, but they seem to have thought that, given lots of developers, they could develop collections of heuristic rules that would be good enough. This, of course, let them expand their empire within Intel. They also told the hardware planners, when those guys hit difficult problems, “We can handle that in the compiler.” That let them be a vital part of the huge project which Intel was sure was going to dominate the world of computing.

This is a very career- and company-orientated view, which neglects the engineering problem of actually delivering solutions. They presumably “had people to do that.”

There were many flaws in the hardware design

It does not seem to have been simulated sufficiently comprehensively, because many of these problems would have shown up there. When I was working with an IPF simulator in 1999, I asked if it was generated from the abstract model of the architecture. Nobody at Intel seemed to understand that question, even after a lot of explaining.

The instruction set was very bulky. Someone appeared to have decided that in an era of gigabyte memories, that didn’t matter. However, it mattered a lot on two pain points: memory bandwidth, and cache size. IPF always needed more of those than x86-64, and they’re expensive.

There was a design flaw with speculative access to floating-point data in memory. It’s a bit complicated to describe here, but it seemed to be based in an assumption that floating-point work would only happen in leaf functions. When that wasn’t true, the code got a lot bigger and slower.

The compilers were buggy

I’d had good support from Intel on builds I did with their x86-32 compiler, but some resistance from my ISV customers, who all used the Microsoft compiler, to using them. I expected Intel to do a better job with an Itanium compiler, owing to more expertise with the architecture. So I started out doing parallel builds on Microsoft and Intel compilers. Initially, both of them were quite buggy, and I reported bugs and got fixes. After a while, Microsoft pulled ahead.

When I could run my “was it built right?” tests successfully in an optimised build with the Microsoft compiler, and could not in a de-optimised build with the Intel compiler, I dropped Intel. They weren’t very pleased about that, but could not argue with the reasoning.

De-optimised Itanium code is very de-optimised. There are three “slots” for instructions in a bundle, and some instructions had to be in the middle slot. De-optimised code only puts a real instruction in the middle slot, and no-ops in the other two. It is very large, and very, very slow. Shipping it to end-users is out of the question. It’s just too slow.

Intel suggested that using link-time code generation based on profile-guided optimisation would achieve excellent performance. I tried it just once, and it took over an hour to link my main DLL. Intel tried to claim that “you only do that for your production link” but I responded “I’ve spent over a year digging out compiler bugs, there are still plenty, and you want me to give the compiler new and difficult challenges? Which will add an hour to each iteration of my modify-build-test-debug cycle?”

Developer tools and support

Intel seemed to appreciate that code would need revising to perform well on IPF, and said they’d be willing to help, but the price was that they’d acquire rights in the software. No ISV was going to agree to that.

There was no IDE: developers for Windows got a cross-compiler and the platform SDK. This did not bother me at all, but a lot of ISVs decided they’d wait for an IDE – which never appeared for Windows.

Intel handed out a lot of Merced-based developer machines, something like 15,000 of them. Of course, many of them went unused because of the lack of an IDE. They weren’t very fast, but they worked. However, if you wanted to upgrade them to C-series Merced processors with the errata fixed, that was $1,000 per processor. Many organisations decided not to bother and carried on using B-series processors with some errata, and using the compiler workarounds. Of course, those made things slower.

When HP started selling zx2000 and zx6000 workstations with McKinley processors, those were faster. But they were also expensive, and Intel dropped support on the Merced systems a few months after the HPs shipped. ISVs were suddenly being asked to spend a lot of money on something that wasn’t going nearly as well as they’d been led to expect. There was no price competition on McKinley systems, because HP were the only company with a motherboard chipset for 1-4 processor systems, and they were only selling it in complete systems.

Critical mass was never achieved

Lots of ISVs gave up on Itanium because of the compiler bugs, the lack of an IDE, the expensive hardware, and the competition from AMD. Athlon and Opteron, the first x86-64 processors, were much easier to work with.

That meant some ISVs who were still in the game could not get tools, libraries, or other software that they needed to complete their products. So they gave up. The situation snowballed, and very little Itanium software was ever produced.

Companies that wanted to run it found that maybe they could get some of the applications they wanted, but they could not get others. Or a management tool they had standardised on wasn’t available. Meanwhile, the AMD processors would run everything that 32-bit x86 would, quickly, and would integrate with existing networks. Itanium just died of neglect, in the mass market.

But what about that bytecode idea?

The bytecode would have been complex, because it would have to express all the semantics of all the languages that could be compiled into it, so that the bytecode processor could generate non-clashing bundles. Normal bytecodes describe very simple abstract processors, and are much easier to work with.

A bytecode could not have solved the problem of extracting the instruction-level parallelism, because that problem remains unsolved to this day.

It would have been possible to create a bytecode that was just a representation of IPF instructions and bundles, but there’s no point: why not just use the real instruction set?

Memory is getting vague… Itanium had some great ideas that would need great compiler support. The problem was it wasn’t one feature, it was many. Each one wasn’t a big deal, all together were.

For example, there was a looping feature where one iteration of the loop would operate on registers from different iterations. x86 handles the same problem through massive out-of-order capability.

At that time Java and JVMs were in fashion. What IBM said was that with PowerPC, you could compile bytecode quickly and the CPU would make it fast. Not on Itanium.

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARM M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing. So without any particular cleverness in the compiler I got 20 flops in 8 cycle out of a theoretical 24.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

What I do wonder is how much of the Itanium design could be used today. For example OOO execution on the ARM M1 is not perfect. If the available operations are close to the limit set by latency then you have delays, so handling two iterations of a loop could be beneficial. Instruction words of 32 bits might not be optimal. Maybe 128 bits = 3 times either a 42 or two 21 bit instruction would be better, plus two bits for hints. Maybe memory prefetch instructions would help. As long as code doesn’t have to be compiled for one specific model of the cpu.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 16:19

Thẻ: compiler, history

Thiết kế website giá rẻ

Danh mục