# Computer Science Department Technical Report University of California Los Angeles, CA 90024-1596

HIGH-PERFORMANCE ADAPTIVE ROUTING IN MULTICOMPUTERS USING DYNAMIC VIRTUAL CIRCUITS

Yuval Tamir Yoshio F. Turner September 1990 CSD-900026

# High-Performance Adaptive Routing in Multicomputers Using Dynamic Virtual Circuits†

Yuval Tamir and Yoshio F. Turner

Computer Science Department
4731 Boelter Hall
University of California
Los Angeles, California 90024-1596
U.S.A.

Phone: (213)825-4033 E-mail: tamir@cs.ucla.edu

#### Abstract

An effective message transport mechanism, which provides high-bandwidth low-latency interprocessor communication, is the key to the ability of multicomputers to achieve high performance by exploiting parallelism. For both high performance and high reliability, it is important for the message transport mechanism to adapt to changing system conditions and route messages around congested areas or failed links or nodes. In order to efficiently support short messages, it is desirable to minimize the time to route and forward messages through intermediate nodes, as well as the required addressing and control information that must be attached to each message.

We introduce a new message transport mechanism, called *Dynamic Virtual Circuits*, that combines the best features of the traditional circuit switching, packet switching, and static virtual circuits. Routing through intermediate nodes usually requires only a single lookup in a small table, packets include minimal control information, and packets are delivered in FIFO order. Due to the ability of nodes in the middle of a dynamic virtual circuit to break it and later re-establish it through a different physical path, the system has the full advantages of adaptive routing while maintaining the virtual circuit semantics. We present the basic algorithms for dynamic virtual circuits and describe the required hardware support in the context of a VLSI communication coprocessor for multicomputers.

<sup>†</sup>This research is supported by the SDIO Innovative Science and Technology Office, managed by the Office of Naval Research, contracted to the Jet Propulsion Laboratory under task plan #80-2984. Y. Turner is supported by a Hertz Foundation Graduate Fellowship.

### I. Introduction

Multicomputers, composed of thousands of VLSI computing nodes, interconnected by point-to-point links, are likely to achieve the goal of tera-op performance within the next few years [1,4]. Such systems have the fundamental advantage that there is no single component whose operation is critical to the entire system and is thus a potential performance or reliability "bottleneck." An additional advantage is that the high-speed point-to-point links can be implemented at a relatively low cost, thus allowing for *cost-effective* "super computing" in a wide variety of environments. High-bandwidth low-latency communication between processors is critical to the ability of multicomputers to achieve high performance by exploiting both coarse grain and fine grain parallelism.

The "message transport" mechanism used in a multicomputer can have a dramatic effect on system performance and on its ability to handle changes in the workload or in the availability of resources. A packet often passes through intermediate nodes on its way from its source to its destination. For low latency communication, the delay added by each intermediate node must be minimized. This can be done using buffers that support virtual cut through [9, 14] and a routing mechanism that can quickly determine the output port to which the packet should be forwarded [4]. The routing scheme should direct the packet through the lowest latency path from source to destination. This implies that it should take into account the topology of the network and adapt to the current load and resource availability on the system to route packets around congested or faulty areas [12, 10, 4]. It is important for the message transport scheme to minimize the addressing and control information that must be sent with each packet as well as to maximize the availability of network resources for active connections.

With message transport based on *virtual circuits* [11,2], packets are sent through pre-established paths so that there is no need to recompute the route for every packet at each intermediate node, and packets do not need to include complete routing and control information. The paths are established by storing routing information in tables of intermediate nodes along the way from the source to the destination. The physical resources (the links) can be time-shared by many virtual circuits. Since each virtual circuit maintains first-in-first-out (FIFO) ordering of its packets and most of the routing information is stored in the nodes, the routing information that must be included with each packet is minimized.

The problem with virtual circuits is that the paths are static and cannot be easily changed in response to changes in the system (congestion or failure). This paper introduces a new message transport mechanism, called *Dynamic Virtual Circuits*, that has the advantages of virtual circuits but allows individual nodes to make a local decision to break or reroute an existing circuit. The scheme is based on

keeping at each node sufficient information to re-establish broken circuits while maintaining the FIFO ordering of packets through each circuit. Since routing with Dynamic Virtual Circuits is based on tables, the scheme does not depend on a regular system topology. Hence, efficient routing can be performed even after a large number of system nodes or links have failed.

The research described in this paper is part of the UCLA ComCoBB (Communication Coprocessor Building-Block) project, whose focus is the design and implementation of a high-performance communication coprocessor for VLSI multicomputers. A critical aspect of the ComCoBB chip design is support for efficient routing for a wide variety of network topologies. Hence, schemes based on routing tables [12] must be supported.

Conventional static virtual circuits are described in Section II. The disadvantages of static virtual circuits are discussed in Section III, where the basic technique of Dynamic Virtual Circuits is described. The hardware support for Dynamic Virtual Circuits in the ComCoBB chip is described in Section IV.

#### II. Routing in Multicomputers Using Static Virtual Circuits

The two fundamental approaches to message routing in multicomputers are circuit switching and packet switching [11,2]. With circuit switching, a static physical path is set up between the sender and receiver before communication takes place. Once the path is set, data can be transmitted quickly at nearly the full bandwidth of the links with almost no redundant control information [4]. On the other hand, with packet switching, the data is partitioned into small packets which are routed independently from source to destination. The advantage of circuit switching is that, once the circuit is established, data is sent to its destination with minimum latency and no bandwidth is wasted for transmitting routing and sequencing control information with each packet. The disadvantage of circuit switching is that physical links are statically allocated to a particular circuit for the lifetime of the circuit. Even when no information is being sent through the established circuit, the links cannot be used for other circuits.

Packet switching improves link utilization relative to circuit switching since links are never allocated to an idle circuit. The links along the paths chosen by a packet are used only during the time required to send the packet's bytes across the link. One disadvantage of packet switching is that each packet must contain complete addressing information as well as message and packet sequence numbers. Since packets may arrive at their destination in any order, they need to be buffered and reordered before they can be processed by the destination application. Message latency through a packet switching network is significantly larger than the latency through an established circuit since each packet of the message must be delayed at each node long enough to determine the appropriate next step on its path to the destination.

A key advantage of packet switching is that the routing can adapt to changes in the system, potentially sending different packets with the same source and the same destination through different paths. With adaptive routing, packets are routed around congested or failed links and nodes, thus providing higher reliability, higher throughput, and lower message latency than can be achieved with fixed routing. For example, simulations of 2D and 3D meshes and tori have shown that fixed routing limits maximum throughput to about 40% of the bisection bandwidth while adaptive routing can provide up to 80% of the bisection bandwidth [10]. Similarly, adaptive routing on a hypercube has been shown to provide low message latency as a result of routing around congested regions [4].

Virtual circuits are used for message transport in an attempt to combine the best features from circuit and packet switching [11,2]. As with circuit switching, paths are established and stored in routing tables along the way. However, each physical link can be time shared between multiple virtual circuits. The physical link is logically divided into multiple virtual channels, where a field in the header byte of each incoming packet indicates the virtual channel number used by the packet. Virtual circuits are paths through the network consisting of a sequence of virtual channels. At each node, mapping tables describe the established virtual circuits passing through the node.

To establish a virtual circuit, a source node generates a Circuit Establishment Packet (CEP) that is then transmitted, as in a packet switching network, to the destination node. At each node along the way, the CEP arrives on an unused input channel of an input port. Based on the final destination of the CEP, the node routes it to one of its output ports. In the most general case, this routing utilizes large off-chip routing tables, which include information regarding the topology of the system with the cost of links weighted according to recent traffic loads [12, 2]. In addition to choosing an appropriate output port, the routing of the CEP also involves choosing an output channel from among the currently unused channels at the output port. The mapping table is then set to route future packets arriving on the same input channel and input port to the chosen output port and output channel. An example of an established virtual circuit is shown in Figure 1.

After sending the CEP, the source node sends data packets along the circuit. Each data packet contains in its header byte the input channel number on which it arrives. This is the only overhead associated with the packet, representing a significant reduction from a pure packet switching mechanism. At each node, the data packets arrive on input ports and input channels and are routed according to the entries in the mapping table to output ports and output channels. This routing is much faster than the routing of the CEP since all it requires is a single lookup in the small on-chip mapping table. Since all packets are transmitted through the circuit along the same physical path, it is possible to guarantee FIFO



Figure 1: A virtual circuit from process A to process B. The circuit uses channel 3 of the first link and channel 2 of the second.

order of delivery of packets at the destination. Once the virtual circuit is no longer needed, the source node removes it from the system by sending a Circuit Destruction Packet (CDP) that traverses the circuit from source to destination, invalidating the appropriate mapping table entries at each node.

## III. Dynamic Virtual Circuits

The problem with the traditional virtual circuits scheme is that the paths are static and cannot be changed in response to changes in the system (congestion or failure). Throughout a virtual circuit's lifetime, it occupies the same resources, namely one virtual channel on each physical link the circuit uses. These resources are allocated for the exclusive use of the virtual circuit, even if the circuit is idle for long time periods. In addition, circuits may permanently occupy resources if nodes or links fail or if processes terminate without tearing down their circuits. Since the number of virtual channels per link is limited, new virtual circuits may be prevented from becoming established on desired links even if idle circuits are holding resources unnecessarily.

Dynamic Virtual Circuits (DVCs) overcome the limitations of static virtual circuits by allowing circuits to change the paths they occupy during their lifetimes. This flexibility guarantees the establishment of circuits, even when there are no free virtual channels on a desired link. Resources

allocated to idle DVCs are eventually deallocated and used for active DVCs as needed. In addition, if a particular circuit becomes slow or blocked, due to congestion or failure, a node can make a <u>local</u> decision to break the circuit and reestablish it using an operational, less congested route.

In the simplest case, the Dynamic Virtual Circuit mechanism is identical to the static virtual circuit mechanism. When a CEP arrives at a node, it is routed to determine the desired output port. For a regular network topology, this routing may be algorithmic [3]. For irregular networks, a more complex scheme, based on large routing tables, may be used [12]. If there is a free output channel on that output port, it is used by the new circuit. The mapping tables are modified to route future packets arriving on the circuit to the chosen output port and output channel, and the CEP is sent to the next node. When the process at the source node no longer needs the circuit, it generates a CDP, which follows the path of the circuit, invalidating the appropriate mapping table entries at each node. When the CDP arrives, the destination node releases all node resources associated with that circuit.

If a CEP arrives at a node and is routed to a link with no free virtual channels, the static virtual circuit mechanism cannot be used. Instead, the Dynamic Virtual Circuit mechanism chooses an established DVC on the desired link as a victim for temporary destruction. The victim DVC is, ideally, the circuit whose next packet will enter the node furthest in the future. Once a victim is chosen, the node generates and sends a CDP along the victim channel. The CDP is marked as *nonterminal* to inform the destination node that the circuit is being torn down temporarily from an intermediate node, as opposed to permanently from the source node. After the generated CDP is sent through the output port, the CEP can be sent, establishing the new DVC.

When a DVC is first established, information regarding the ultimate destination (node identifier) of the new circuit is kept at each intermediate node. When a packet arrives, on an established DVC, at a node where the DVC was previously cut, the information regarding the ultimate destination of the cut circuit is used to reestablish the DVC. The node chooses an output port on which to reestablish the cut circuit, creates a CEP, updates the mapping tables, and sends the CEP, reestablishing the circuit on the new path to its destination. The data packet that triggered the reestablishment of the circuit as well as future packets on the same circuit can then be sent along the new path.

Since there are multiple input and output ports operating and interacting concurrently at each node, the DVC mechanism description above, though accurate at a high level of abstraction, is overly simplistic. Care must be taken to prevent on-chip components of the chip from entering inconsistent states due to improper ordering of events or from becoming stuck forever waiting for each other to complete some operation. These issues are discussed further in Section IV.

Although DVCs provide most of the advantages and overcome the difficulties of static virtual circuits, they do not guarantee the <u>physical</u> FIFO packet arrival at the destination node as with static virtual circuits. For example, a circuit that has been torn down from an intermediate node may be reestablished on a different path before the nonterminal CDP reaches the destination. In this case, the packets on the reestablished branch of the circuit arrive at the destination before all the packets on the torn-down branch arrive. Since proper packet ordering is vital, some mechanism must be provided to allow the destination node to determine the order in which packets were sent by the source node. Two such mechanisms, one based on packet sequence numbers and one based on logical timestamps, are described and compared here. Both mechanisms rely on providing enough information in CDPs and CEPs to allow the destination node to identify and order arriving branches belonging to the same circuit.

With the packet sequence number mechanism, for each circuit established at a node, a packet count register records how many packets have arrived at the node on the circuit. A sliding window protocol can be used to bound the maximum value of the packet counters. Also stored at each intermediate node is other information needed to uniquely identify the particular DVC, such as source and destination process and processor identifiers. When a circuit is reestablished, the packet count at the node initiating the reestablishment is sent to the circuit destination, together with the information originally used to establish the circuit (source process id, destination process id, etc). The destination node accepts packets on reestablished circuits only after its local packet counter indicates that all previous packets have already been received. The main disadvantage of this scheme is that it requires dedicated hardware at each input port to store and increment the packet counters.

The timestamp mechanism avoids the use of dedicated packet counters for each incoming circuit at each input port. The mechanism requires only a small amount of information to uniquely identify branches, and it updates circuit information at the intermediate node only when a circuit is reestablished or disestablished from that intermediate node. Because the circuit information is rarely updated, it is not necessary to store it on-chip or provide dedicated hardware for updating. Each node maintains a count of the number of nonterminal circuit destructions the node has initiated. This count serves as a logical timestamp that, in conjunction with the identifier of the node where the circuit was broken, uniquely identifies the circuit destruction event. The counter has to be sufficiently large (40-64 bits) so that there would be no danger of the counter "wrapping around" leading to possible incorrect packet ordering. When the torn down DVC is to be reestablished, the stored destruction timestamp and the node identifier are sent with the CEP. When a circuit branch CEP arrives at the destination node, a matching circuit branch CDP must be found with the same timestamp and node identifier values. The only such CDP must be the one terminating the branch to be ordered just prior to the CEP's branch.



Figure 2: A multicomputer node.

#### IV. Implementation of Dynamic Virtual Circuits

Figure 2 shows a multicomputer node with the application processor, local memory used by the application processor, the ComCoBB chip, and a special routing processor with its memory. The routing processor is a general-purpose processor which is used as a dedicated controller to perform some of the infrequent but complex operation that are needed to support DVCs. Frequent operations, such as routing and forwarding of a packet on an established circuit, are handled entirely within the ComCoBB chip, using dedicated hardware. The routing processor handles tasks such as, initiating circuit destruction, reestablishing a circuit, updating of global routing tables [12], and resolution of deadlocks. It should be noted that, on a large chip, the routing processor and its memory may be implemented as a dedicated programmable controller on the same chip with the communication switch.

Tables which are accessed for every packet forwarded through the switch are stored at the input and output ports of the ComCoBB chip (see Figures 4 and 5). However, less frequently accessed tables are stored in the private memory of the routing processor. The tables in the routing processor's memory are: (1) the Circuit Destruction Table, which records timestamps of circuit teardowns initiated at the node, (2) the Inverse Output Mapping Table, which maps each valid output channel to the corresponding input port and input channel number, and (3) the routing table used to determine the path when the DVC is established. In addition, for DVCs which originate in each node, the node maintains a single Source

Table, which maps logical DVC identifiers to channel numbers for the first hop of the DVCs. At each node, for DVCs whose destination is the node, there is a single Destination Table, which maintains the information (CEP and CDP timstamps) necessary for ordering packets arriving over different paths of the same DVC. On its way from the source node to the destination node, each packet requires one access to a Source Table and one access to a Destination Table. Hence, the ComCoBB chip's processor interface must support fast access to these tables.



Figure 3: ComCoBB chip components.

The ComCoBB chip consists of four input ports, four output ports, the Processor Interface (PIF) to the application processor, and the Routing Processor Interface (RPI). The basic chip organization is shown in Figure 3. The input and output ports each consist of eight data lines and one flow control line. The input port uses the flow control line to stop the output port from sending data when, for example, the buffer at the input port becomes full. The RPI translates read and write requests by the routing processor to both read/write operations on storage elements inside the ComCoBB chip and commands affecting the behavior of ComCoBB modules. The RPI also fields interrupt requests raised by ComCoBB modules and passes them on to the routing processor.

When a packet arrives, it is placed in the input buffer and routed to determine the desired output port. Once the packet reaches the head of the buffer, the buffer makes a request to the crossbar for a connection to the desired output port. After the request is granted, the packet is removed from the buffer and sent through the crossbar switch and the output port to the neighboring ComCoBB chip.



Figure 4: Input port routing hardware.

An Input Mapping Table at each input port is used to route packets arriving on established DVCs. The 32-entry mapping table is addressed by the input channel numbers of incoming packets and contains the output port and output channel numbers the incoming packets are to use. Thus, once a DVC is set up, all packets arriving at nodes on the established circuit simply access the Input Mapping Table to rapidly determine the output port and output channel to use.

Figure 4 shows a block diagram of the hardware located at each input port of the ComCoBB. The figure shows four main components: the Synchronizer[13], the dynamically-allocated multi-queue (DAMQ) input buffer[14], the Auxiliary Buffer, and the Input Mapping Table. The Synchronizer produces eight bits of data synchronized to the local clock. These signals are input to both the DAMQ buffer, which is the main packet buffer at the input port, and to the Auxiliary Buffer, which is a much smaller FIFO buffer and usually holds the first eight bytes of the most recent packet arriving through that input port. In normal operation, as packets arrive they are placed in both the DAMQ buffer and the Auxiliary Buffer. In addition, the header byte of each incoming packet is forwarded to the Input Mapping

Table for lookup. If the lookup references a valid entry, the header is modified to contain the output channel number, and the new header is latched into the DAMQ buffer. If the access references an invalid entry, the Input Mapping Table raises an interrupt for the routing processor and causes the DAMQ buffer control to use the flow control line to stop traffic into the input port. A packet arriving on an input channel that has no valid mapping is either a Circuit Establishment Packet or a packet arriving on a circuit that has been disestablished from this node. In either case, routing processor intervention is required and incoming packet flow must be stopped.

The DAMQ buffer normally takes its input from the output of the Synchronizer, but it can also take its input either from the Auxiliary Buffer or from a packet buffer located at the RPI via the routing processor data bus. The DAMQ input comes from the RPI when, for example, the routing processor needs to insert a CDP into the circuit. The DAMQ input comes from the Auxiliary Buffer when forwarding a CEP which was held in the buffer while the corresponding IMT entry was set up. Care must be taken to ensure that flow from the input port is halted when the input to the DAMQ buffer is taken from one of the other sources. Otherwise, packets arriving on the input port will be lost.

In some cases, the routing processor requires information contained in the body of the packet. For example, when a CEP arrives, the header byte indicates the input channel, and subsequent bytes of the packet indicate the desired final destination. To access these bytes, the routing processor can read the Auxiliary Buffer whenever it is not being written by the input port.

The CDP that destroys a circuit is generated by the routing processor. However, circuit destruction originating in a remote node can be handled without the intervention of the routing processor. To support this fast handling of CDPs at the input port, an *Invalidate* input is added to the Input Mapping Table. This signal causes the table lookup to mark the entry referenced as invalid. A CDP automatically triggers this operation.

Figure 5 shows a block diagram of the hardware at each output port. This hardware consists of a table and logic for picking victim channels. The table is used to keep track of valid and invalid output channels and to maintain output channel use information. There are 32 entries in the table, one per output channel. The output port table is normally accessed when packets arrive at the output port from the crossbar—the entry corresponding to the channel number of the packet is updated. In addition, the table is accessed by the routing processor when a DVC is established. Each table entry consists of two bits: valid and use. The valid bit specifies whether the output channel is part of a circuit. The use bit indicates whether a packet has been sent on the corresponding output port recently.

The information in the output port table drives the circuit that selects a victim when there is a need



Figure 5: Output port logic. Invalidates circuits and picks victim output channels.

to find a free channel for use in establishing or reestablishing a DVC through the output port. The victim selection module is a combinational circuit that continuously computes the victim output channel number. The victim number is placed in the Victim Register, from which it can be read by the routing processor. If there are any invalid output channels (*i.e.*, channels not on established circuits), these are picked by the victim selection logic. If all the output channels are allocated to established DVCs, one of those channels is picked and the corresponding DVC is disestablished, starting from this node. The "clock" replacement algorithm, commonly used for page replacement in virtual memory [6], is used to pick the victim established DVC.

#### A. Sequencing of ComCoBB Operations

Some of the operations performed by the ComCoBB involve several sequential steps. As mentioned in Section III, proper ordering of on-chip events is crucial for avoiding inconsistent states. For example, when a DVC is established through a switch, it might be necessary to first identify a victim channel on the appropriate output port, disestablish the DVC currently using the channel, and only then proceed with forwarding of the CEP to the next node. A straightforward but incorrect procedure for performing this operation would have the routing processor reset the *valid* bit for the victim channel at both the Input Mapping Table and at the output port table as soon as a victim is selected. Then, the routing processor would create a CDP and send it directly out the output port. Once the CDP is sent, the pending circuit establishment request would be fulfilled.

The procedure above is wrong because there may be packets enqueued in the DAMQ buffer at the input port used by the victim circuit. If the mapping tables are changed and the CDP sent before these

enqueued packets exit the node, the packets will be forwarded out the same output port and output channel as the packets on the circuit being established, thus mixing packets of different circuits. Also, one of the enqueued packets may be a CDP, rendering the creation of a CDP by the routing processor redundant. When the routing processor needs to pick a victim output channel for use in a new DVC, it reads the Victim Register, that contains the selected channel number as well as the corresponding valid bit. If the entry is valid, the routing processor checks tables, stored in its private memory, to ensure that this channel had not already been reserved for a different DVC destruction. If it had, the routing processor reads the Victim Register again, to get a different victim. Assuming that the output port table entry for the chosen victim is valid, the correct procedure for this operation is as follows: (1) the routing processor stops incoming packets from arriving to the corresponding input port (using the flow control line), (2) if the corresponding IMT entry is valid, the routing processor inserts a CDP at the tail of the DAMQ buffer, just as if the CDP had arrived on the victim circuit's input port, (3) the routing processor re-enables arrival of packets to the input port, (4) the CDP causes the Input Mapping Table entry for the victim circuit to be invalidated, and only after all packets previously enqueued for the victim circuit have been sent does the CDP reach the output port and cause the invalidation of the victim channel, (5) if a CDP arrives at an output port, causing the invalidation of one of the output table entries, while an establishment request for that port is still pending, the routing processor is interrupted and it sets the mapping tables to use the invalidated channel for the establishment.

#### B. Support for Deadlock Resolution

In general, multicomputer interconnection networks are susceptible to *deadlock* situations when no messages can advance towards their destinations due to full buffers in one or more cycles of communication switches [11]. With a global view of the interconnection topology, it is possible to guarantee deadlock-free routing by restricting the routing algorithm and the use of the buffers at each node [7]. When deadlocks do not occur, deadlock-free routing results in inefficient use of system resources. An alternative approach to dealing with deadlocks is to detect and resolve them when they occur [8,5]. With this approach there is no restriction on the routing and all network resources are used for improved performance rather than reserved for eliminating deadlocks. Since Dynamic Virtual Circuits are designed to support distributed adaptive routing in arbitrary topologies [12], the choice of the latter approach to dealing with deadlocks is clear.

When an input DAMQ buffer remains full for some time without transmitting any packets, a deadlock may exist. Following the algorithm proposed in [8], when a node decides that it may be in a deadlock, it first attempts to determine if there is a cycle of nodes with full buffers which may form a

potential deadlock. If such a cycle is found, the packets are rotated along the cycle so that they move towards their destinations. Eventually, this process leads to one of the buffers becoming not-full, thus resolving the deadlock. Further details of the basic algorithm [8] will not be described here.

The cycle detection and deadlock resolution algorithms require, at each switch, a fixed amount of storage that is available even when the normal buffer is full [8]. This Auxiliary Buffer is used to receive control packets as well as for receiving a data packet during the rotation phase of the algorithm. The control packets and fragments of data packets can be read by the routing processor from the Auxiliary Buffer. In addition, it is necessary for the routing processor to be able to send a control packet directly to a specified output port. All the connection and controls necessary for these operations are provided in the ComCoBB design (Figures 3, 4, and 5).

The cycle detection and deadlock resolution algorithms require the switch to receive packets even if the DAMQ input buffer is full. Thus, it is not sufficient to implement a simple flow control line that inhibits any new packets from being transmitted by the neighbor whenever the buffer is full. Specifically, when a node is ready to participate in cycle detection or deadlock resolution, it must be possible for the routing processor to allow transmission of one control packet or one 8-byte fragment of a data packet. The simplest solution to the problem of enabling and disabling control packet reception, is to use two flow control lines per port, where one line is used for normal operation and the other flow control line is used solely to handle enabling and disabling when the DAMQ buffer is full. If pin limitations prevent this option, then a protocol requiring only the single flow control line per port can be used, with very little loss in overall performance. Specifically, when an input buffer is full, but the node is ready to receive eight bytes to be placed in the auxiliary buffer, the node sends a special control message to its neighbor. This control message is not placed in the DAMQ buffer or the auxiliary buffer of the receiver. Instead, it simply sets a single-bit state variable. This technique relies on the ability of the routing processor to transmit a packet to the output port without using the crossbar. The special control message can be sent only when the output port is not otherwise occupied. Hence, it may have to wait for completion of the current packet transmission.

#### V. Summary and Conclusions

Dynamic virtual circuits combine the best features of the traditional circuit switching, packet switching, and static virtual circuits. This new message transport mechanism minimizes the addressing and control information sent with each packet and the latency of forwarding packets through each intermediate node. Unlike static virtual circuits, DVCs can adapt to congestion or failures in the system by allowing nodes to make local decisions regarding the possible need to reroute the circuit through a

different physical path. We have presented the basic techniques for allowing circuits to be broken and reestablished while maintaining the semantics of traditional virtual circuits.

We have presented a brief overview of the hardware necessary to support DVCs in the context of a complete communication coprocessor for multicomputers. Dedicated hardware is used on-chip to handle the critical frequent case, while an off-chip routing processor is used for more complex but less frequent tasks. The proposed hardware supports a deadlock resolution algorithm that does not require limiting the flexibility of the routing scheme and leads to efficient utilization of system resources during normal operation.

#### Acknowledgements

Some of the ideas presented in this paper were explored by Gregory Frazier and Tiffany Frazier in course term projects at UCLA in 1988. Tiffany Frazier suggested the idea of using logical timestamps for maintaining packet ordering. Helpful suggestions regarding this work were also made by David Rennels, Bill Smith, and Vance Tyree.

#### References

- 1. W. C. Athas and C. L. Seitz, "Multicomputers: Message-Passing Concurrent Computers," *Computer* 21(8), pp. 9-24 (August 1988).
- 2. D. Bertsekas and R. Gallager, Data Networks, Prentice Hall (1987).
- 3. M.-S. Chen, K. G. Shin, and D. D. Kandlur, "Addressing, Routing, and Broadcasting in Hexagonal Mesh Multiprocessors," *IEEE Transactions on Computers* **39**(1), pp. 10-18 (January 1990).
- 4. E. Chow, H. Madan, J. Peterson, D. Grunwald, and D. Reed, "Hyperswitch Network for the Hypercube Computer," 15th Annual International Symposium on Computer Architecture, Honolulu, Hawaii, pp. 90-99 (May 1988).
- 5. I. Cidon, J. M. Jaffe, and M. Sidi, "Local Distributed Deadlock Detection by Cycle Detection and Clustering," *IEEE Transactions on Software Engineering* SE-13(1), pp. 3-14 (January 1987).
- 6. F. J. Corbato, "A Paging Experiment with the MULTICS System," Project MAC Memo MAC-M-384, MIT, Cambridge, MA (July 1968).
- 7. W. J. Dally and C. L. Seitz, "Deadlock-Free Message Routing in Multiprocessor Interconnection Networks," *IEEE Transactions on Computers* C-36(5), pp. 547-553 (May 1987).
- 8. J. M. Jaffe and M. Sidi, "Distributed Deadlock Resolution in Store-and-Forward Networks," *Algorithmica* 4(3), pp. 417-436 (1989).
- 9. P. Kermani and L. Kleinrock, "Virtual Cut Through: A New Computer Communication Switching Technique," *Computer Networks* 3(4), pp. 267-286 (September 1979).
- 10. J. Y. Ngai, "A Framework for Adaptive Routing in Multicomputer Networks," Computer Science Technical Report 89-09, California Institute of Technology, Pasadena, CA (May 1989).

- 11. D. A. Reed and R. M. Fujimoto, Multicomputer Networks: Message-Based Parallel Processing, The MIT Press (1987).
- 12. W. D. Tajibnapis, "A Correctness Proof of a Topology Information Maintenance Protocol for a Distributed Computer Network," Communications of the ACM 20(7), pp. 477-485 (July 1977).
- 13. Y. Tamir and J. C. Cho, "Design and Implementation of High-Speed Asynchronous Communication Ports for VLSI Multicomputer Nodes," *International Symposium on Circuits and Systems*, Espoo, Finland, pp. 805-809 (June 1988).
- 14. Y. Tamir and G. L. Frazier, "High-Performance Multi-Queue Buffers for VLSI Communication Switches," 15th Annual International Symposium on Computer Architecture, Honolulu, Hawaii, pp. 343-354 (May 1988).