Speculative Packet Dispatch for Virtual Output Queuing Architecture Using LSTM Recurrent Neural Network

Virtual Output Queuing (VOQ) is an architecture widely employed in modern networking products. Traffic from every ingress port is stored in a set of queues mirroring the structure of the egress ports. This architecture allows congestion on one egress port to be isolated from the other ports. A request-grant protocol is used to route packets from ingress to egress. When a packet is received, a request signal is issued. After the request reaches the egress side, a grant signal is generated based on some fixed threshold indicating there is space in the egress buffer to absorb the largest packet size dispatched from ingress. The buffer must be sized deep enough to accommodate in-flight traffic associated with a scenario where heavy congestion is found after the grant is issued. Awaiting a grant signal to arrive before dispatching packets incurs significant end-to-end latency. To alleviate this problem, a speculative packet dispatch approach (SPD) is proposed in which the request grant protocol is completely eliminated. Packets are dispatched speculatively from ingress to egress based on predictions that there is enough space in the egress buffer. This is achieved by incorporating an LSTM recurrent neural network as part of the VOQ controller. The LSTM is trained by time-series data sets generated from past observations on the queue occupancy. The experimental results show that SPD delivers excellent improvement on the system performance, reduces buffering requirements and preserves the property of VOQ.


Introduction
Higher demands on network switches and routers continue to increase since the early days of the Internet. With the emergence of new generations of use cases such as cloud computing, high-definition video streaming, artificial intelligence applications and 5G mobile communications, switching devices with larger capacity and higher bandwidth are required to meet these demands [1][2][3].
The fundamental technology for high-speed packet switching is based on a switch with input (ingress) ports and output (egress) ports. Packets are received by the ingress ports and placed in the ingress queues. After they are classified by a forwarding engine, the packets are forwarded via an × cross-bar to the appropriate egress queues corresponding to the destination egress ports. There are various ways to implement this architecture. Some implementations assign a set of dedicated buffers for the ingress queues and another set for the egress queues. Other implementations employ more of a shared buffer approach where the memory that stores the packets are shared among the ingress and egress ports [4,6,7].
Two modes of operation normally supported by modern switches are cut-through and store-and-forward [2]. In the cut-through mode, packets are sent to the egress ports as soon as possible to minimize the input-output delay. Because packet inspection and packet transmission occur at the same time, a mechanism is put in place to indicate to the receiver whether the packet being transmitted is a good packet or a corrupted packet. If it is a corrupted packet, the error checking mechanism puts invalid error check bits at the of the transmitted packet. The store-and-forward mode, on the other hand, does not send the packet out until the entire packet has been received and inspected, which causes a significant increase in the input-output delay. This mode allows the switch to drop corrupted packets internally. Hence, the switch guarantees that only good packets are sent out. Regardless of whether the architecture is cut-through or store-and-forward, packet queuing is mandatory to manage congestion. Assuming the ingress and egress ports run at the same speed, traffic from two or more ingress ports destined to the same egress port will create congestion. If a downstream device asserts a flow control signal to an egress port, congestion is also created because the ingress rate is now larger than the egress rate. In both cases, packets need to be temporarily stored in the ingress or egress queues. To avoid packet loss in the switch, the queues must be large enough to absorb the in-flight traffic during congestion, which is assumed to be transitory. Packet loss cannot be avoided in sustained congestion since it would require infinite amount of buffering [14].
A simple ingress-egress queueing structure has an inherent problem known as Head-of-Line (HOL) blocking, in which congestion in one egress port can cause considerable throughput degradation in other egress ports [10,11]. To solve this problem, an architecture known as Virtual Output Queuing (VOQ) was developed [5,6]. In this architecture every ingress port has a set of queues, each of which is associated with an egress port. Received packets are classified and placed in their appropriate virtual output queues. Although referred to as output queues, it is important to note that these queues are actually located in the ingress side, hence, the name virtual output queues. The packets will be waiting in these queues until there is space in the egress queues before they are dispatched to the egress side. This technique ensures that egress ports that are congested are isolated.
VOQ is a powerful architecture that has been employed in networking products for many years. However, its performance is heavily dependent on various factors such as round-trip latency, egress queue occupancy and packet size [15]. This paper suggests a way to overcome these limitations by proposing a method called Speculative Packet Dispatch (SPD) where packets are speculatively dispatched from a VOQ on the ingress side. The accuracy of this method is largely determined by the predicted accuracy of a dispatch scheduler that constantly monitors the status of every destination egress queue. The scheduler is driven by a Long Short Term Memory (LSTM) network, which is a class of Recurrent Neural Networks (RNN). The LSTM is trained by the egress queue occupancy status represented as time-series data.
The rest of this paper is organized as follows: sections 2 and 3 describe the VOQ and Time-Series LSTM, respectively, section 4 presents the proposed technique, section 5 discusses the experimental results and section 6 ends the paper with some concluding remarks.

Virtual Output Queuing
Most modern switches provide deep packet buffering before the cross-bar and shallow buffering on the output side before the egress ports. The forwarding engine and the scheduler control how to fetch a packet from an input queue, route it through the cross-bar to a particular destination port. If none of egress ports are congested, then traffic can flow at line-rate and port fairness is achieved. Preserving port fairness in the presence of congestion is a non-negotiable requirement for modern switches and routers. Congestion that occurs on one egress port should not affect the traffic going to other unrelated egress ports. Port fairness cannot be guaranteed in the traditional method of ingress buffering. Consider a scenario depicted in Figure 1. Packets from Ingress Port 1 (IP1) are going to Egress Port 1 (EP1) and Egress Port 2 (EP2). A flow control issued by a congested downstream device on EP1 stops the switch from transmitting on that port but traffic from IP1 to EP2 should still flow. However, this flow control on EP1 causes all packets to wait in IP1's queue including those packets whose destination is EP2 until the flow control is deasserted [13]. This condition is called Head-of-Line-Blocking (HOLB).
There is another scenario that creates performance degradation without the presence of HOLB. Consider four flows: IP1 to EP1, IP1 to EP2, IP2 to EP1 and IP2 to EP2. Assume 75% of the packets from IP1 and IP2 are destined for EP1, while the rest goes to EP2. The throughput of each flow to EP1 will drop by 25%, which is acceptable since EP1 is 50% oversubscribed. However, the throughput of the other two flows to EP2 is also affected. This is not acceptable since EP2 is not oversubscribed.
The solution for all these limitations is to create a set of output queues placed in the ingress side that mirrors the egress ports as shown in Figure 2 [5,10]. These queues are referred to as virtual queues [8,9,12], hence, the term Virtual Output Queue (VOQ). It is easy to see that a flow control on EP1, which blocks all EP1's virtual output queues from sending packets, does not affect the switch to route packets destined for EP2 since those packets are stored in different virtual queues. Furthermore, EP2 is also impervious to an oversubscription condition on EP1. This is an elegant architecture that provides port fairness across all ingress and egress ports. The cost is a significant increase in the number of queues in the switch. If a switch has ingress ports and egress ports, then queues are required to support VOQ instead of queues in the non-VOQ architecture.

Long Short Term Memory for Time Series Predictions
An artificial neural network consists of an input layer, an output layer and one or more in-between layers referred to as hidden layers [16][17][18]. The output of a given layer, which is a function of the weighted sum of its input, becomes the input of the next layer. The weights are learned using training examples via gradient descent based optimization. The gradients are computed starting from the output layer backwards all the way to the input layer. This is called backward propagation. A recurrent neural network is a neural network that has an additional path connecting the output back to itself. For each time step a recurrent neuron receives the input as well as its own output from the previous step 1 as shown in Figure 3 (left). The network can be unfolded as depicted in Figure 3 (right) to explicitly show the information flow [19,[21][22][23].  Training an RNN is achieved by backward propagation through time [21]. Although it is conceptually straight forward, in practice it is very challenging due the vanishing and exploding gradient problems. As the gradients are computed recurrently, they can become extremely small or extremely large causing the training to become totally ineffective where the network can no longer capture long-term dependencies.
Long Short Term Memory (LSTM) is a special type of RNN shown in Figure 4 [19]. In addition to output , cell state is provided. The architecture is designed to mitigate the vanishing and exploding gradients by incorporating gates to regulate the information flow into and out of the cell. A gate consists of a sigmoid function followed by a pointwise multiplication operation.
An LSTM cell has three types of gates: forget gate, input gate and output gate [20]. The forget gate, controlled by , controls what information should be discarded from the cell.
The input gate, controlled by , controls how much of new information should be added into the cell state from the input. * * ̃ The output gate, controlled by $ , determines what to output from the cell.
where , , # , % are the weight matrices of each layer corresponding to input .
, , # , % are the weight matrices of each layer corresponding to the previous time step.

Proposed Architecture
As discussed earlier, VOQ should be employed as an architecture of choice in a networking device due to efficient packet transmission from the ingress side to the egress side. It is generally understood that higher egress buffer occupancy can be caused by either a flow control assertion from a downstream device or an oversubscription due to the dynamic nature of the traffic. As shown in Figure 5, buffer occupancy has a direct impact on throughput and latency, which constitute the overall system performance.  The performance drops rapidly as the buffer fills up indicating that the system cannot drain the data fast enough. Without intervention, this condition eventually leads to an overflow causing packet drop. Dropped packets have to be retransmitted from ingress incurring further performance hits. Alternatively, ingress packet transmission is stopped if a predetermined threshold on the egress buffer is exceeded. This approach prevents the egress buffer from overflowing but it requires ingress to have prior knowledge on the buffer occupancy before sending the packet. In other words, ingress sends a request, waits for a grant from egress then dispatches the packet. This sequence of events is typically referred to as a request-grant handshake. Again, performance degradation is unavoidable.
The proposed architecture presented in this paper will minimize the performance hit by eliminating the need of having a request-grant handshake. Ingress speculatively dispatches full or partial packets based on the predicted egress buffer occupancy, hence, the name Speculative Packet Dispatch (SPD). A time-series LSTM network functions as a predictor. Based on a series of occupancy status over time, the LSTM makes predictions of the current value. A flow chart is shown in Figure 6.
Assuming store and forward and neglecting the delay through the cross-bar, the performance of a typical VOQ switch is given by the pseudo-code shown in Figure 7. where 7 8, is the ingress packet time (the time it takes to transmit the entire packet) 9: ;< is the request-grant latency = >?@A is the depth of the buffer B. C@" is the length of the packet In a non-speculative approach, when ingress is ready to send a packet, it issues a request then waits for a grant before dispatching the packet. Another factor to be considered is the size of the egress buffer. Unlike SPD, which is a predictive algorithm, a non-speculative approach must allocate a fixed amount of storage space to absorb the in-flight data. The minimum depth of the buffer is given by D " /1/ ?@A 1 − 9 E#F G ⁄ × D! /1/ B! .@ I J@ + G where 9 #F is the ingress rate 9 E#F is the egress rate G is the safety margin The term 1 − 9 E#F G ⁄ describes the average congestion experienced by the system.

Experimental Results and Analysis
Four areas were investigated to verify the viability of the proposed architecture: egress buffer occupancy prediction, request-grant delay, VOQ performance and minimum buffer size requirements. Simulations were performed using TensorFlow running on Google Colab for the LSTM-based egress buffer occupancy prediction and using System Verilog

Egress Buffer Occupancy Prediction
The first five rows of Table 2 are cases where the congestion starts to take place. Data is entering into the buffer at a higher rate than it is exiting causing the buffer to fill up. Eventually, ingress will stop sending traffic. The last five rows correspond to the scenarios where the egress rate is higher than the ingress rate because either the congestion has begun to subside or the ingress traffic has slowed down.
Given past buffer occupancies, the Time-Series LSTM makes its predictions on the most likely current occupancy levels as shown in columns 1 and 2, respectively. These predictions are quite accurate relative to the actual buffer occupancy listed in column 3. This is important as it relates directly to the overall performance of the system.  Figure 8 shows the system throughput as a function of the request-grant round trip time (RTT) normalized by the ingress packet time. In non-speculative systems this quantity becomes a critical path as soon as the ratio exceeds 1. If the ingress packet time is larger than the request-grant RTT, the grant for the next packet is received while the transmission of the current packet is still in progress. The transmission of the next packet can proceed immediately yielding 100% throughput. However, if the request-grant RTT is larger than the packet time, an idle period (bubble) corresponding to the waiting time for the grant is inserted. This translates to permanent bandwidth loss that cannot be recovered causing the throughput to suffer. The severity of the throughput drop depends on the request-grant RTT to the ingress packet time ratio. Two factors that affect the ingress packet time are packet size and port speed. For the same packet size, packet time can be assumed to be linearly proportional to port speed. Any additional overhead associated with internal processing can be compensated by other means, such as internal speedups, hence, it can be neglected when considering the system performance. On the other hand, speculative systems do not have this issue since the request-grant loop has been removed completely and ingress now relies on the LSTM predictor. As expected, 100% throughput is maintained regardless of the size of the packet or the port speed. Throughput can also be observed on the egress port rates as depicted in Figure 9. The request-grant RTT does not vary with port speeds since it is an internal system parameter. If the request-grant RTT is 25 ns, then the minimum packet size for a 10 Gbps port and a 40 Gbps port to reach 100% egress rate is 250 bytes and 1000 bytes, respectively, in non-speculative VOQ systems. The packet time from a 100 Gbps port is always less than 25 ns. Therefore, the egress rate can never reach 100%. Once again speculative VOQ systems can sustain 100% egress rates throughout because they are not susceptible to packet sizes. In addition to throughput, the overall VOQ system performance is also measured with respect to end-to-end latency, which is the elapsed time from the moment a packet arrives on an ingress port until it exits out of an egress port. Since a packet cannot be dispatched until its grant is received in a non-speculative system, if the request-grant RTT is longer than the packet time, then the latency becomes unbounded. This Throughput Egress Rate (Gbps) can be viewed as unrecoverable loss of bandwidth. In practice, latency is bounded by the size of the input buffer. This buffer limitation governs the ingress rate in which packets can be processed from the source. At steady state the ingress rate must be equivalent to the egress rate. Although latency seems to be better with a smaller buffer size, a system with smaller buffering is less tolerant to bursty traffic. Figure 10 shows the latency values for different packet sizes as a function the ingress buffer depth for a fixed request-grant RTT of 791 ns. That particular request-grant RTT is chosen to match the packet time of a 791-byte packet, which represents an average size between 64 bytes and 1518 bytes, the minimum and the maximum Ethernet packet sizes, respectively. For non-speculative VOQ with store-and-forward, the latency grows exponentially as the ingress buffer depth increases for smaller packet sizes where the packet times are less than the request-grant RTT. Larger packet sizes whose packet times are greater than the request-grant RTT grow linearly. A speculative system shows a much better behavior. The latency is lower and grows linearly in proportion to the packet size regardless of the depth of the ingress buffer.

VOQ Performance
The buffer size must be sized to accommodate at least one maximum packet size (1518 bytes), also known as Maximum Transmission Unit (MTU). The simulation results for a 10 Gbps port with no congestion are shown in Table 3. The request-grant RTT of 741 ns is again assumed. For non-speculative systems, the latency grows exponentially with smaller packet sizes since these packets are affected much more severely by the request-grant RTT. The throughput increases with the packet size because larger packets would have smaller the idle period relative to the grant waiting time. Beyond 791 bytes, the latency is just the sum of packet time and request-grant RTT. Consequently, the throughput reaches 100%. Speculative systems does not have this limitation again because the request-grant loop has been eliminated.   Table 5 shows the required minimum egress buffer size relative to different levels of flow control for 10 Gbps ingress traffic. For non-speculative systems, the egress buffer must be large enough to absorb one MTU. Higher the congestion requires deeper the egress buffer in order to avoid an overrun condition. Table 6 shows the sustained congestion results caused by oversubscription. The 2x oversubscription non-speculative result is the same to that of 50% flow control in Table 5 since it effectively creates an identical system behavior. Intermittent congestion can be alleviated by providing large enough buffering in the system such that the source traffic does not need to be halted. Since infinite amount of buffering is not possible, sustained congestion will eventually cause the source traffic to be intermittently stopped. Speculative systems inherently can provide guaranteed space without the risk of buffer overrun. Therefore, shallow egress buffers to incorporate some guard banding is sufficient as seen in Tables 5 and 6.

Conclusion and Future Work
Throughput and latency of a VOQ-based system that employs the request-grant protocol is susceptible to packet size variation, which typically is solved by introducing considerable internal bandwidth speed-up and deep buffering. This solution is expensive and leads to a significantly more complex implementation. This paper proposes a superior method referred to as the speculative packet dispatch (SPD) method. It speculatively dispatches packets from the ingress side to the egress side based on the predictions from an LSTM time-series network that has been trained by queue occupancy data sets.
The experimental results confirm that SPD-based systems have lower latency and higher throughput regardless of the size of the packet. They also require smaller amount of buffering and preserve the congestion-related per-port isolation property of VOQ.
Investigation on how to apply SPD to solve other networking problems is warranted. Since SPD can accurately predict various congestion scenarios, it might be possible to use it a system-level congestion avoidance algorithm. SPD could also be altered to perform speculative priority-based arbitration tasks to support quality of service for certain class of traffic.