Updated 2018-03-12
Timing Closure is just a fancy word for ensuring that the design meets the timing constraints. So far this was not complicated for the simple labs that we have done.
To find out how fast our circuit can run, in Quartus, we can go to the report produced by TimeQuest Timing Analysis. The variable we’re looking for is Fmax, it tells us the maximum frequency our design can run at. This is based on register-to-register delays. It looks at all the FFs and figures out the delay in the critical path.
If multiple clocks exists in the design, then we have two entries for Fmax. This is because the analysis tool notices that there are two critical paths.
How do we tell the tool which frequency to use? We use a timing constraint, which specifies a desried clock frequency. The TimeQuest timing analysis wizard enables us to specify (in assignment menu -> timing analysis wizard
).
If the constraint is not set, the tool set will try to set an unachieveable clock frequency like 1Ghz.
How does the timing constraint relate to optimization?
Slow clock: leads to small circuit with more weighting on optimizing for logic utilization
Fast clock: leads to larger circuit with more things happening in parallel
Fundamentally, the architecture is all about trade offs. So choose a desired clock frequency and corresponding size of circuit based on requirements.
Example: optimization of lab 1 design
Scenario Fmax Area of circuit No constraints 133 MHz 148 130MHz constraint 130 MHz 143 Timing constraint of 10 MHz 112 MHz 135
page 10
Fmax is really slow and no clock works for this design.
Solution: Find longest path. The tool can display Worst-Case Timing Paths. We see that there are paths that have negative slack (meaning that its violating time constraints).
Click on “Report worse case path (In TimeQuest UI)” for the path.
We see that the delay could potentially be caused by the combinatorial divider. page 13
We go into the code and find that it could potentially be caused by the expression % 256
but we fix that and recompile and find out that nothing has changed. We see that the division is actually not occurring here. The division is happening elsewhere so we keep looking.
page 16
We find another i%3
but 3 is not base-2 and i
is 32 bits, so the tool made a 32 bit divider. Fixing this, we solved the problem and Fmax increased.
It is possible to go faster than recommended Fmax since it’s only a conservative estimate. But no guarantee and is very dangerous. Operating conditions such as temperature will affect the timing.
If a critical path is too long to satisfy timing closure, then we an try pipelining, by cutting the critical path and connecting them with FFs.
For instance, if the critical path used to take 5ns, and we split into two paths: 2ns and 3ns, we have improved the Fmax from 200MHz to 333Mhz. However, now it takes 2 cycles to complete the cycle.
Because of this, pipelining would be helpful if:
This trades off latency with faster clock speed and throughput.
Note: we need to change the rest of the design to account for that fact there are more cycles needed to expect the output.
Cannot have too much pipelines because the of the setup time ad clock-to-q time is constrained physically. Having faster clock frequency will join the two signals together and there is no time for logic.
This is especially true for FPGAs since the chains of flip flop needs to route many logic blocks together and there will be more propagation delay.
Quartus will not pipeline the design automatically, but it will retime.
Key of pipelining is to make every state balanced (in terms of delay).
page 31
Notice that the first path takes 2ns and the critical path takes 4ns. What we can do is put the inverter and relocate it in front of the FF, so each path takes 3ns. The reposition of the inverter does not affect the circuit.
But be careful if the output of the inverter is immediately used elsewhere, in which case, the logic in the second path needs to be adjusted.
Ultimately, retiming can speed up Fmax.
Limit of Retiming
page 34
Consider that we have some initial value and we want to perform retiming.
We move the combinational logic to the left and that outputs to a single output register. It is easy for the compiler to evaluate the output register.
However, this is much more difficult in reverse. So the rule of thumb is initial values for registers may limit opportunities for retiming.
Clock will arrive later in some parts of the circuit that is limited by the physical properties. There is no guarantee that the clock edges that arrive at all flip flops happen simultaneously.
The implication of clock skew include
page 41
The clock period expression needs to be adjusted to account for clock delay:
\[t_{clock}\geq t_{clk-to-q}+t_{logic}+t_{setup}-D\]In this case, since it takes longer for the critical path, having a clock skew delay actually improves the design because the logic in the critical path have more time.
page 43
If the clock arrives in a different direction, then we do not have the advantage.
\[t_{clock}\geq t_{clk-to-q}+t_{logic}+t_{setup}+D\]This decreases our clock frequency.
Clock skew can cause hold time violations if the clock skew delay is greater than the hold time. The hold time is not long enough, the clock is late (and expects a longer hold time), thus a wrong value is sampled.
page 46
The moral is to avoid clock skew at all times.
If the components are hooked up such that the clock follows a clock tree such as a H-tree (fractal), given that the H-tree has a constant fractal level, then all components should have the same clock delay.
In an FPGA, an Phase Locked Loop (PLL) is used.
The phased locked loop exist in modern FPGAs. It generates a mixed signal (analog and digital) circuit that generates output clock aligned to an input clock.
Consider generating a 25 MHz clock from a 50 MHz clock. We use a counter that counts to 1 (NOT gate). Note that the flip flops has its own delays which makes two problems:
page 62
A glitch is when a circuit changes its output quickly as the combinational logic is being computed quickly.
page 65
page 66
Glitches usually don’t matter if we’re not constantly observing the output. However, this matters if we care about energy use since it takes energy to perform transitions.