ECE1373 Status

From Bits

<-- ECE1373


Contents

[edit] WW7, Feb. 12, 2009

[edit] WW9, Feb. 25, 2009

  • Changed the default skin on the wiki :P
  • Played with the GT-ITM network topology generator
  • Perl script that generates the simulation config file from GT-ITM output
  • graphviz visualization of topology
  • Ran some experiment on a larger 60-node network to compare the two bus implementations (See Bubble in on-chip network)
    • Larger network saturates the capacity of dual bus (~2 pkts delivered / simulation cycle)
    • Bubbles are more easily hidden with ready_delta=1

[edit] WW10, Mar. 5, 2009

  • Experimented with dual hybrid crossbar-bus on-chip network
    • Better than bus, but still much worse than full crossbar (graphs)
    • Also significantly worse than ideal performance of the partial crossbar (e.g. 1.7 vs. ideally 4 pkts/clk)
    • Out-of-order packet "fix" limits performance
  • Interesting problem of the week: Improving the interconnect without excessive cost. Other than load imbalance, why is the hybrid crossbar-bus performing so poorly? (Scheduling?)
  • Next
    • Perfect send-ahead (to fix out-of-order issue)
    • Perfect crossbar allocation (upper bound for hybrid crossbar performance)
    • Some cheap approximation of perfect crossbar allocation (?)
    • Clock interconnect faster than other logic?

[edit] WW11, Mar. 12, 2009

  • Hybrid crossbar-bus interconnect
    • Crossbar scheduling is not a significant bottleneck (8x4: ~4% at rd=1)
    • Out-of-order "fix" is a significant problem (8x4: ~11% at rd=1, ~20% at rd=2)
    • Load unbalance is the last performance limiter. We can ~achieve this limit. (0.98 pkts/clk rd=3, ideal send-ahead, scheduled.) We'll deal with this later.
  • Thing3 needs pipelining.
    • Timestamp algorithm needs to allow "delayed" packet delivery

[edit] WW12, Mar. 19, 2009

  • Thing3 pipelined. Almost no change in metrics. Expecting 50-100 MHz.
  • Added mechanism to stop incrementing timestamp if packets become late beyond some limit.
    • We can prove that if both buses are monitored for late packets, as long as the PQ latency is >= 3 (including bandwidth component), the max delay is bounded by (delay_limit + 2)
  • Minimum router buffer size should be 2.
    • Clocks/timestep metric is 1.7% better for router buffer size = 2 in comparison to 1
    • No further improvement when buffer size is increased
  • Using bits[2:1] of timestamp to select the earliest packet in both s1_nodes and s2_nodes muxes results in no change in performance
  • Decided on routing table: Use an array, 8 bit entries x 256 nodes = 2Kbit < 1 block memory
  • Timestamping algorithm still works.

[edit] WW13, Mar. 26, 2009

  • Installing ns2 on itchy.eecg
    • otcl-1.13 and TclCl-1.19 required fresh installation of tcl/tk 8.4 and the following modifications
      • Manually copied tclInt.h and tclIntDecls.h from tcl source directory to $HOME/tools/include
      • Manually created the X11 header file WinUtil.h and put into $HOME/tools/include. Modified xwd.c in the nam-1.13 distribution to include <WinUtil.h> instead of <X11/Xmu/WinUtil.h>
      • Use the following parameters to circumvent the "Can't find X include files" error
 cd into otcl-1.13 source directory
 ./configure --prefix=$HOME/tools -with-tcl=$HOME/tools --x-includes=/usr/include --x-libraries=/usr/lib
 cd into tclcl-1.19 source directory
 ./configure --prefix=$HOME/tools --with-tcl=$HOME/tools --x-includes=/usr/include --x-libraries=/usr/lib
 cd into ns directory
 ./configure --prefix=$HOME/tools --with-tcl=$HOME/tools --x-includes=/usr/include --x-libraries=/usr/lib
 --with-tclcl=$HOME/tools
  • Add $HOME/tools/lib (tcl lib installation directory) to LD_LIBRARY_PATH
  • Specification on hardware modules

[edit] WW14, Apr. 2, 2009

  • Packet descriptor doesn't fit in 36 bits! Using 72 bits we can only afford 64-entry buffers for each PQ
  • Working on block RAM sharing between PacketQueues
  • Found a random number generator (Tausworthe, tau88)

[edit] WW15, Apr. 9, 2009

  • Implemented SuperPacketQueue in simulator
    • Multiple packet queues share a common buffer
    • Measured max buffer size when shared by 7 nodes in our benchmark is < 64 (so 1 block RAM = 36 bits x 128 entry can serve all)
    • Problem: fragmentation due to random access.
    • Internal free list is the solution? Requires 1 read + 1 write for each enqueue or dequeue -> double pump the RAM?

[edit] WW16, Apr. 16, 2009

  • Wrote verilog for the random number generator and router components
  • Working on testbenches for the system
  • Next: TrafficGen and PacketQueue

[edit] WW17, Apr. 23, 2009

  • Configuration constraint: max_delivery_lateness <= (min_PQ_delay - 2), min_PQ_delay >=3
  • Select-earliest at bus 0 source partitions (replacing carry chain)
    • 2-bit timestamp: 45 LUT, ~163 MHz, 4.16 pkts/clk
    • 3-bit timestamp: 64 LUT, ~137 MHz, 4.17 pkts/clk
  • Removed the output register in the FIFO path of the router and make the FIFO have 2-cycle minimum delay (414 LEs, ~212MHz)

[edit] WW18, Apr. 30, 2009

  • TrafficGen and PacketQueue verilog done
  • Verification/debuggin in progress

[edit] WW19, May. 7, 2009

  • 12-bit divide
    • Combinational: 198 LEs, 32.48 MHz...
    • Pipelined, 2-cycle: 235 LEs, 63.21 MHz
    • Pipelined, 3-cycle: 246 LEs, 85.59 MHz
    • Pipelined, 4-cycle: 249 LEs, 101.22 MHz
  • Divide-based calc_delay (4-cycle pipelined): 264 LEs, 103.58 MHz
  • Divide-based PQ (4-cycle pipelined): 503 LEs, 109.28 MHz
  • More corner cases found in the software PQ model, mainly due to the sequential ordering of events
    • To do: rewrite software model to completely separate out current state and updates to next state
  • If router has 2-cycle latency, performance suffers by 3% if buffer size is 2.

[edit] WW20, May 14, 2009

  • 72-bit 8-to-1 MUX
    • encoded select: 360 LEs
    • decoded select: 369 LEs
  • Limit bus 1 destination in queue size. When queue is full, do not update the pipeline registers.
    • Before change, max destination in queue size = 19, average size < 1
    • At FIFO size = 16: 1159 Regs + 881 LEs
    • Limit to 8, no observable performance change; 582 Regs + 376 LEs
    • Limit to 4, -2% (3.99 pkts/cycle); 157 Regs + 293 LEs
  • Simulator: node addresses now encode the destination bus (1 bit), partition (2 bits) and node index (5 bits)
  • Finished verilog coding for interconnect

[edit] WW21, May 21, 2009

  • Trying to catch up to what Danyao worked on the last few weeks (lots!)
  • Started to build a full-system verilog testbench for all the modules Danyao has written and individually tested. It's progressing, but slowly, as I'm not yet familiar with the HDL code yet... :)

[edit] WW22, May 28, 2009

  • top_tb now contains one TG, and both stages of Bus1 interconnect (one each) (May 25)
  • Added one Router to top_tb and both stages of Bus0 interconnectcd
  • Next:
    • Build top-level module
    • Build interface between FPGA board and PC, and develop the PC side software

[edit] WW23, Jun. 4, 2009

  • CAD tool chain (place, route, and byte-stream generation) done
  • Working on top-level module

[edit] WW24, Jun. 11, 2009

  • Finished top level module
  • Added stats counters
  • Control interface done
  • Removed encoders from Bus1SourcePart and Bus1DestPart and use the decoded signals to drive a decoded MUX
    • New fmax: 55.74 MHz
  • Presentation
  • To-Do:
    • Daisy needs to pad the byte stream after the routing table entries so the entire stream is a multiple of 32-bit words (done, r322)

[edit] WW25, Jun. 15, 2009

  • Ported sim to use daisy format configuration file
  • Packets delivered / cycle can differ by up to 10% (4.08 vs. 3.70) using different placement/routing. Need to investigate the cause and if there exists a good place and/or route strategy
  • Software / HDL simulation diff
    • sim starts injecting at T=1, hdl starts injecting at T=0. This difference is consistent, and is taken into account below
    • Injection traces of sim and hdl are identical (good)
    • Packet enqueue times can be different by several time steps. This applies to all types of nodes. Most likely due to node priorities on the interconnect, which may be different between sim and hdl
    • Packet dequeue times at PQs are mostly the same, with minor differences possibly due to the same ordering problem and the fact that nodes are serialized at the PQ
    • Total number of packets received at each TG is the same (ran both simulation for 690 ms)
    • Difference in average end-to-end latency is approximately 35.3 part-per-million (35ms for every 28min, negligible?)
    • Conclusion: HW appears to be working (mostly)
  • Changed the stats_pushed counter in TG to stats_received so the counter is incremented when a packet appears at the output of the input queue instead of when it enters the input queue. This is because the push (enqueue) time can be different due to node ordering in the interconnect, but the receive time (used to calculate end-to-end latency) should be the same irrespective of the ordering in the interconnect.
  • Error in ready-signal after T=65500 (0xFFDC) at PQ(246) with packet ts=65412. Likely due to wrap-around of sim_time (Fixed in max2 and by making PQ drop tail in r361)

[edit] WW26, Jun. 22, 2009

  • Report done!
  • Infinite loop bug in C simulator
Personal tools