   ## NOTES ON THE ALGORITHM AND ITS IMPLEMENTATION

It might help to think of CoDel as two separate algorithms being run simultaneously. One is time-based and keeps the delay low & utilization high when the bottleneck bandwidth is large (MSS-packet time << target). The other is backlog based and keeps utilization high and delay as low as possible when the bottleneck bandwidth is small (MSS-packet time >> target). The two terms of the "if" are these two different algorithms. (Figure 6 of the Queue paper sort of shows these two regimes in action and the transition between them.)

The comments around the maxpacket stuff may be inadequate since what's going on is possibly a bit subtle. If you test for !curq (i.e., "Is the queue non-empty?), this can completely hose link utilization on low speed links with mixed bulk data/web traffic or bi-directional traffic (intermixed acks with the bulk data packets). Figure 2 of the Queue paper sort of hints at why - TCP's "self clocking" tends to make bulk-data packets arriving at the head of the queue be spaced a bottleneck time apart. So if only one packet is left on the queue and it's small, after it's sent the link will often go idle, waiting a maxpacket bottleneck time for the next packet to show up. To get high utilization on low speed links (links where the time to send an MSS-sized packet is bigger than 'target'), one has to leave enough backlog on the queue to keep the link busy for a worst-case inter-arrival time. The appropriate amount of backlog can be inferred directly from the data stream - just use the maximum outgoing packet size. Remember that our goals were to get the queue as small as possible without impacting utilization.

### 05.11.12: Calculating the inverse square root in real life

There's no need for a lookup table. We're doing inverse square root which is even cheaper than square root (only requires multiplies). And we generate the roots sequentially which means you can start from the previous value and remove about half the computation. See wikipedia.

E.g., if you save the previous invsqrt (and set it to 1 when you set count to 1) and use one iteration of Newton's method, the computation would be:

invsqrt = (invsqrt >> 1) * (3 - count * invsqrt * invsqrt)

The worst case error is 0.02% (when count is 2) and goes down quadratically with count (the error is 4 parts in 1,000,000 when count is 1000). If you're doing the computation in 32 bit ints, since invsqrt is < 1 for count > 1, you prescale count by 2^16 (i.e., treat it as a fixed point binary integer with the binary point at the 16th bit) so the control law computation becomes the above then:

(interval * invsqrt) >> 16

Just shifts, adds & integer multiplies - can't get much simpler hardware than that. It's like 2 bytes of sequencer state if their design uses an ALU and <2000 gates if they unroll the whole thing as discrete logic. 