This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article contains content that is written like an advertisement. Please help improve it by removing promotional content and inappropriate external links, and by adding encyclopedic content written from a neutral point of view. (May 2017) (Learn how and when to remove this message) This article needs to be updated. Please help update this article to reflect recent events or newly available information. (December 2014) This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources. Find sources: "TRIPS architecture" – news · newspapers · books · scholar · JSTOR (January 2012) (Learn how and when to remove this message) (Learn how and when to remove this message)

TRIPS was a microprocessor architecture designed by a team at the University of Texas at Austin in conjunction with IBM, Intel, and Sun Microsystems. TRIPS uses an instruction set architecture designed to be easily broken down into large groups of instructions (graphs) that can run on independent processing elements. The design collects related data into the graphs, attempting to avoid expensive data reads and writes and keeping the data in high speed memory close to the processing elements. The prototype TRIPS processor contains 16 such elements. TRIPS hoped to reach 1 TFLOP on a single processor per paper published from 2003 to 2006.^[1]

EDGE

Main article: Explicit Data Graph Execution

TRIPS is a processor based on the Explicit Data Graph Execution (EDGE) concept. EDGE attempts to bypass certain performance bottlenecks that have come to dominate modern systems.^[2]

EDGE is based on the processor being able to better understand the instruction stream being sent to it, not seeing it as a linear stream of individual instructions, but rather blocks of instructions related to a single task using isolated data. EDGE attempts to run all of these instructions as a block, distributing them internally along with any data they need to process.^[3] The compilers examine the code and find blocks of code that share information in a specific way. These are then assembled into compiled "hyper-blocks" and fed into the CPU. Since the compiler is guaranteeing that these blocks have specific inter-dependencies between them, the processor can isolate the code in a single functional unit with its own local memory.

For example, with a program that adds two numbers from memory, then adds that result to another value in memory, a traditional processor would have to notice the dependency and schedule the instructions to run one after the other, storing the intermediate results in the registers. In an EDGE processor, the inter-dependencies between the data in the code would be noticed by the compiler, which would compile these instructions into a single block. That block would then be fed, along with all the data it needed to complete, into a single functional unit and its own private set of registers. This ensures that no additional memory fetching is required, as well as keeping the registers physically close to the functional unit that needs those values.

Code that did not rely on this intermediate data would be compiled into separate hyper-blocks. Of course its possible that an entire program would use the same data, so the compilers also look for instances where data is handed off to other code and then effectively abandoned by the original block, which is a common access pattern. In this case the compiler will still produce two separate hyper-blocks, but explicitly encode the hand-off of the data rather than simply leaving it stored in some shared memory location. In doing so, the processor can "see" these communications events and schedule them to run in proper order. Blocks that have considerable inter-dependencies are re-arranged by the compiler to spread out the communications in order to avoid bottle-necking the transport.

This greatly increased the isolation of the individual functional units. EDGE processors are limited in parallelism by the capabilities of the compiler, not the on-chip systems. Whereas modern processors are reaching a plateau at four-wide parallelism, EDGE designs can scale out much wider. They can also scale "deeper" as well, handing off blocks from one unit to another in a chain that is scheduled to reduce the contention due to shared values.

TRIPS

University of Texas at Austin's implementation of the EDGE concept is the TRIPS processor, the Tera-op, Reliable, Intelligently adaptive Processing System. A TRIPS CPU is built by repeating a single basic functional unit as many times as needed. The TRIPS design's use of hyper-blocks that are loaded en-masse allows for dramatic gains in speculative execution. Whereas a traditional design might have a few hundred instructions to examine for possible scheduling into the functional units, the TRIPS design has thousands, hundreds of instructions per hyper-block, and hundreds of hyper-blocks being examined. This leads to greatly improved functional unit utilization; scaling its performance to a typical four-issue superscalar design, TRIPS can process about three times as many instructions per cycle.

Whilst traditional designs uses varying units, allowing more parallelism than the four-wide schedulers would otherwise allow, TRIPS, in order to keep all of the units active includes all types of instructions in the instruction stream. As this is often not the case in practice, traditional CPUs often have many idle functional units. In TRIPS, the individual units are general purpose, allowing any instruction to run on any core. Not only does this avoid the need to carefully balance the number of different kinds of cores, but it also means that a TRIPS design can be built with any number of cores needed to reach a particular performance requirement. A single-core TRIPS CPU, with a simplified (or eliminated) scheduler will run a set of hyperblocks exactly like one with hundreds of cores, only slower.

The performance is also not dependent on the types of data being fed in, meaning that a TRIPS CPU will run a much wider variety of tasks at the same level performance. For instance, if a traditional CPU is fed a math-heavy workload, it will bog as soon as all the floating point units are busy, with the integer units lying idle. If it is fed a data intensive program like a database job, the floating point units will lie idle while the integer units bog. In a TRIPS CPU every functional unit will add to the performance of every task, because every task can run on every unit. The designers refer to as a "polymorphic processor".

Like TRIPS, DSPs gain additional performance by limiting data inter-dependencies, but unlike TRIPS they do so by allowing only a very limited workflow to run on them. TRIPS would be just as fast as a custom DSP on these workloads, but equally able to run other workloads at the same time. Though a TRIPS processor likely couldn't be used to replace highly customized designs like GPUs in modern graphics cards, they may be able to replace or outperform many lower-performance chips like those used for media processing.

The reduction of the global register file also results in non-obvious gains. The addition of new circuitry to modern processors has meant that their overall size has remained about the same even as they move to smaller process sizes. As a result, the relative distance to the register file has grown, and this limits the possible cycle speed due to communications delays. In EDGE the data is generally more local or isolated in well defined inter-core links, eliminating large "cross-chip" delays. This means the individual cores can be run at higher speeds, limited by the signalling time of the much shorter data paths.

The combination of these two design changes effects greatly improves system performance. However, as of 2008, GPUs from ATI and NVIDIA have already exceeded the 1 teraflop barrier (albeit for specialized applications). As for traditional CPUs, a contemporary (2007) Mac Pro using a 2-core Intel Xeon can only perform about 5 GFLOPs on single applications.^[4]

In 2003, the TRIPS team started implementing a prototype chip. Each chip has two complete cores, each one with 16 functional units in a four-wide, four-deep arrangement. In the current implementation, the compiler constructs "hyperblocks" of 128 instructions each, and allows the system to keep eight blocks "in flight" at the same time, for a total of 1,024 instructions per core. The basic design can include up to 32 chips interconnected, approaching 500 GFLOPS.^[5]

References

Processor technologies

Models

Architecture

Instruction set
architectures

Types	Orthogonal instruction set CISC RISC Application-specific EDGE TRIPS VLIW EPIC MISC OISC NISC ZISC VISC architecture Quantum computing Comparison Addressing modes
Instruction sets	Motorola 68000 series VAX PDP-11 x86 ARM Stanford MIPS MIPS MIPS-X Power POWER PowerPC Power ISA Clipper architecture SPARC SuperH DEC Alpha ETRAX CRIS M32R Unicore Itanium OpenRISC RISC-V MicroBlaze LMC System/3x0 S/360 S/370 S/390 z/Architecture Tilera ISA VISC architecture Epiphany architecture Others

Execution

Instruction pipelining	Pipeline stall Operand forwarding Classic RISC pipeline
Hazards	Data dependency Structural Control False sharing
Out-of-order	Scoreboarding Tomasulo's algorithm Reservation station Re-order buffer Register renaming Wide-issue
Speculative	Branch prediction Memory dependence prediction

Parallelism

Level	Bit Bit-serial Word Instruction Pipelining Scalar Superscalar Task Thread Process Data Vector Memory Distributed
Multithreading	Temporal Simultaneous Hyperthreading Simultaneous and heterogenous Speculative Preemptive Cooperative
Flynn's taxonomy	SISD SIMD Array processing (SIMT) Pipelined processing Associative processing SWAR MISD MIMD SPMD

Processor
performance

Transistor count
Instructions per cycle (IPC)
- Cycles per instruction (CPI)
Instructions per second (IPS)
Floating-point operations per second (FLOPS)
Transactions per second (TPS)
Synaptic updates per second (SUPS)
Performance per watt (PPW)
Cache performance metrics
Computer performance by orders of magnitude

Types

By application	Embedded system Microprocessor Microcontroller Mobile Ultra-low-voltage ASIP Soft microprocessor
Systems on chip	System on a chip (SoC) Multiprocessor (MPSoC) Cypress PSoC Network on a chip (NoC)
Hardware accelerators	Coprocessor AI accelerator Graphics processing unit (GPU) Image processor Vision processing unit (VPU) Physics processing unit (PPU) Digital signal processor (DSP) Tensor Processing Unit (TPU) Secure cryptoprocessor Network processor Baseband processor

Word size

Core count

Components

Functional units	Arithmetic logic unit (ALU) Address generation unit (AGU) Floating-point unit (FPU) Memory management unit (MMU) Load–store unit Translation lookaside buffer (TLB) Branch predictor Branch target predictor Integrated memory controller (IMC) Memory management unit Instruction decoder
Logic	Combinational Sequential Glue Logic gate Quantum Array
Registers	Processor register Status register Stack register Register file Memory buffer Memory address register Program counter
Control unit	Hardwired control unit Instruction unit Data buffer Write buffer Microcode ROM Counter
Datapath	Multiplexer Demultiplexer Adder Multiplier CPU Binary decoder Address decoder Sum-addressed decoder Barrel shifter
Circuitry	Integrated circuit 3D Mixed-signal Power management Boolean Digital Analog Quantum Switch

Power
management