On-Chip Network
From Bits
Contents |
Topology
Still experimenting with DualThing1...
Bus
One big bus connecting every node. Can send one packet per clock at most.
Dual bus
Observation: Routers send only to queues and trafficgens, while trafficgens and queues only send to routers. The nodes can be partitioned onto two buses that don't interfere.
Under this scheme, every node only attaches to one bus, so complexity is identical to the single bus. Having two buses with fewer nodes on each allows two concurrent packet transfers per clock (depending on node type), may improve fmax (fewer nodes on bus), without increasing the design complexity (same as single bus).
Simulation on config2_saturated.dat suggest dual bus can be significantly faster than single bus.
Crossbar
Every node can send one packet per clock to any destination node as long as each destination node receives no more than one packet.
Much too expensive to build in hardware, but a good upper bound on what might be conceptually constructable. Experiments on ts20-4 (~100 nodes) show that crossbar gets very close to ideal with ready_delta as small as 3, while the about 70% of the benefit can be achieved with ready_delta = 1.
Perhaps some scheme involving smaller crossbars may work.
DualThing1
Two separate interconnects, like the Dual bus.
Partition source and destination nodes into 4 groups. Interconnect each group using a 4x4 crossbar. Diagram shows only one of the 4 destination groups.
Performance isn't as good as desired. One reason is because some nodes (thus, partitions) get more traffic than others. Another possible problem is head-of-line blocking, since we use arbitrary scheduling of packets across the switch with single-entry input queues.
The "solution" to out of order packets on the crossbar also hurts performance.
May try to see whether some variation of iSLIP scheduling is feasible.
DualThing2
Similar to DualThing1, but each destination partition now has a current and a (T+1) queue (assuming rd = 1). Packets arriving at the destination partition are queued appropriately. Each clock cycle, the bus delivers a packet from the current queue to the destination node if the queue is not empty. Otherwise, it delivers the packet from the crossbar output if its timestamp is T. The bus can dequeue the (T+1) queue if the src_can_increment signal for this bus is high, which happens when this bus is stalled by the other bus. The destination bus raises the dest_can_increment[s] signal when the current queue is empty. The global clock is only incremented when both pairs of src_can_increment and dest_can_increment are 1. The current and (T+1) queues are swapped in each time step, which is similar to double-buffering.
Most of the performance gain in DualThing2 comes after allowing the (T+1) queue to be processed when src_can_increment is high.
DualThing3
There are two interconnects: Bus 0 (routers -> other nodes) and Bus 1 (other nodes -> routers). Bus 0's destinations have only one source, so contention and interleaving on the crossbar do not cause out of order packets. We simplify that interconnect by using one set of registers at the destination partitions, since any packet delivered over the crossbar can always be delivered on the destination bus. Also, Bus 0 does not block simulation progress, as delays and interleaving of packets does not affect correctness on Bus 0.
On Bus 1, the destination partitions after the crossbar are queued into multiple queues (like Thing2). The number of queues is (ready_delta+1). Unlike Thing2, the queues are not rotated/swapped at every timestep. Instead, dequeuing logic looks at the head of each queue, and dequeues from the one with the smallest timestamp.
We need (ready_delta+1) queues because any timestep, packets that arrive on the crossbar must have timestamps that fall between T and T+ready_delta, inclusive. Thus, queuing them in (ready_delta+1) queues, and only dequeuing those with timestamp <= T will ensure packets stay in order.
Tweak: We use static priorities when destination partition chooses packets from the source partitions. Rotate the priorities around for each destination instead of always starting the search from source 0. Zero cost to implement, gains ~0.13% performance.
Thing3 needs pipelining. This could be challenging: Crossbar scheduling must occur within the first 1-2 stages, as ready_delta = 1. The critical loop starts with node's ready signals, through source partition's search and selection, crossbar scheduling (competes with other partitions), then dequeuing of the source node.
Pipelined Thing3
Rev 119: Added one pipeline stage to simulator after s2_nodes. This is functionally identical to adding a pipeline stage immediately before s2_nodes, but this was easier to code. The difference is re-timing the s2 multiplexer across stages.
As of Rev 119, the extra pipeline stage caused only very minor changes in metrics: 3 less delivered packets (still in pipeline registers at simulation end), a slight increase in late packets on Bus0 (71k vs 69k), and lower avg. destination queue length on Bus 1 (about halved).
Rev 134: Disabled delivery from crossbar. Less logic and fewer long logic paths. 0.004% worse timesteps/clk.
On-Chip Network Implementation Ideas
ready_delta > 0 [1 seems good]
If we know that an event can only cause a new event some time (dt > 0) in the future, this could allow some amount of pipelining or aggregation of packet events in the on-chip network.
Need to explore the amount of serialization caused by lock-step simulation vs. dt. This can be done with the simulator using an ideal (i.e. crossbar) interconnect, and varying the value of dt, allowing packets earlier than time (t + dt) to be sent across the interconnect. When dt == 0, this reduces to lock-step, where packets can only traverse the interconnect when its timestamp is equal to the current simulation time t.
- One big bus. Poll every node at each cycle, and send each packet sequentially. Increment simulation time when all packets for the current timestep has been sent. Even with low fmax goal of 10 MHz, it may still be possible to achieve 10 Mpkt/s despite complete serialization.
- Some form of priority queue, treating packets as "events". Simply "polling" every node for a ready bit seems wasteful (essentially a one-bit wide CAM). Can techniques used for efficient priority queues be applied in hardware? Will it serialize all packet transfers?
- Hierarchical network with multiple queues to aggregate packets sourced from one cluster of nodes, sorted by destination cluster (local cluster vs. remote cluster?). Highest-level interconnect could still be a crossbar (e.g. 8x8). (dt > 0) could allow pipelining/buffering within the interconnect.
- Or some other network which takes advantage of the window between t and t + dt...
Simulations on config2_saturated.dat on bus interconnects suggest that ready_delta = 1 already gives most of the speedup as larger values of ready_delta. ready_delta > 0 is more important when there isn't an excess of packets to send. Compare config2_saturated.dat with config2_critical.dat.
Bus interconnect bubble [yes]
Bus interconnects send one packet per clock, advancing the simulation timestep when there are no more packets to be sent during the current timstep. This requires that the hardware to determine whether there are a) no packets needing to be sent: increment timestep, b) exactly one packet needing to be sent: send it and increment timestep, or c) more than one packet needing to be sent: send it and do not increment timestep.
An alternative approach could distinguish only between two cases: a) no packets need to be sent: increment timestep, or b) one or more packets need to be sent: send it and do not increment timestep. This causes an idle cycle one clock after the last (or only) packet for a timestep is sent, if there are no more packets currently needing to be sent. This scheme allows hardware to be simpler, checking only for the existence of one ready packet, instead of one and at least one other ready packet. Scanning for ready packets is anticipated to be expensive, so not needing to scan for two packets may save a lot of hardware.
Simulations on config2_saturated.dat suggest that ready_delta >= 1 recovers most of the performance loss for the latter scheme.
Circuits
- Scan for ready packet: Use even/odd chains each fed with all inputs. Two half-length chains better than one long chain.
- A[1:0] s1 mux: Use carry chain for speed. Each node has inputs A[0] and ready. Full-length chain, unfortunately.
Status
As can be seen from the plots, current interconnect designs give a significant improvement over the bus, but are still quite far from the crossbar.
Problems
How does this scale to multiple chips? The current scheme still appears to require global synchronization at every clock cycle.


