Slicing

From OpenFlow Wiki

Jump to: navigation, search

Contents

High Level

We assume that a future version of OpenFlow will support the ability to assign a Quality of Service to a flow. Quality of Service is a big topic: On one hand, it is probably essential for OpenFlow to support some basic support for QoS; on the other hand, if we require too much, it will make OpenFlow too complex, taking away one of its biggest advantages.

Currently, we see two fundamental types of QoS services: Rate limiting at ingress and minimum bandwidth guarantee on egress. The first is typically done with a meter that is associated with an ingress port or with ingress flows; after a certain rate is exceeded packets are dropped according to some algorithm.

We call minimum bandwidth guarantee for a queue Slicing because we think of each queue as being given a slice of the available network bandwidth. This is related to, but should not be confused with, the FlowVisor notion of slicing the flow space into separate pieces.

As a compromise between the complexity of supporting full QoS per flow and simplicity of no QoS, and with the OpenFlow campus trials in mind, we define here a bare minimum QoS support for OpenFlow : QoS-related Network Slicing . Our goal is to include the slicing mechanism to the OpenFlow 1.0 release. ( we discuss only the QoS issues related with slicing; for all other issues you can refer to the FlowVisor project ).

From early deployments of OpenFlow it has been clear that a slicing mechanism is necessary to provide isolation between different experiments or even different types of traffic/flows within a single experiment. A "network slice" should be able to secure a certain amount of network capacity, no matter what the other slices that co-exist in the network are doing.

This proposal describes a QoS-based mechanism that aims to provide the before-mentioned functionality. We don't target to a complete QoS scheme, though we build our framework in a way that could be extended from other parties to support more complicated scenarios. We define a default set of QoS options that (in time) all OpenFlow switches will be required to support, hence providing a bare minimum that can be assumed to be present in every switch, router, access point and basestation. In addition, the spec should allow vendors to add more QoS support (e.g. by supporting more primitives, service classes, and/or additional policies) so as to differentiate their product from others.

Slicing Mechanism Proposal

The slicing mechanism is based on queues attached to egress ports. The user (controller) is able to setup and configure queues and then map flows to a specific queue. The queue configuration dictates how a packet will be treated.

There are two distinct parts that form the slicing mechanism :

  • Configuration
  • Flow-Queue Mapping/Forwarding: Exposing the queues to forwarding actions

Configuration

Queue configuration is not part of the openflow protocol. We assume that it will be integrated to the upcoming configuration protocol. The proposed implementation for 1.0 adds configuration as a vendor extension.

  • We expect that within the same switch different linecards may have different capabilities. Thus, QoS description refers to each port individually and not to the whole switch.
  • A queue is configured on a per port basis and it is characterized by the mechanism that dictates its behavior.
  • Currently, we require only min-rate queues which provide minimum datarate guarantees for flows that are mapped into them. A min-rate queue is associated with a percentage of link datarate. Usage of residual bandwidth depends on the implementation and there are no guarantees about how this is beind done; for example, one could implement an internal priority scheme, or fair-share bandwidth allocation for the remaining bandwidth.

Flow-Queue Mapping and Forwarding

(All items in this subsection are to be included into the spec. Please check the related branch an the public openflow repo (devel/yiannisy/slicing) for updated spec -marked with blue color - and development process)

After the queues have been configured, flows can be mapped to queues and packets will be forwarded through them.

The basic structure for queues defines a single property or characteristic of the queue. Currently, only one such property is defined: The minimum rate to be guaranteed by the scheduler.

enum ofp_queue_properties {
  OFPQT_NONE = 0,       /* no property defined for queue (default) */
  OFPQT_MIN,            /* minimum datarate guaranteed */
                        /* other types should be added here 
                          (i.e. max rate, precedence, etc) */
};
  • The controller gets information about configured queues by querying the port with an OFPQ_GET_CONFIG_REQUEST. The command includes the port under query.
struct ofp_queue_get_config_request {
  struct ofp_header header;
  uint16_t port;         /* port to be queried */
  uint8_t pad[2];        /* 32-bits alignment */
}; 

The switch replies back with an OFPQ_GET_CONFIG_REPLY, containing a list of configured queues. If a queue is enabled, but has no properties configured, it will appear with the property OFPQT_NONE. The number of properties in the list is inferred from the length of the message. port_no should refer to a physical port (i.e. port_no < OFPP_ALL).

struct ofp_queue_get_config_reply {
  struct ofp_header header;
  uint16_t port;
  uint8_t pad[6];
  struct ofp_packet_queue queues[]; /* list of configured queues */
};

Each queue is characterized by a set of properties; currently only a min_rate property is defined:

struct ofp_packet_queue {
  uint32_t queue_id;        /* id for the specific queue */
  uint16_t len;      	    /* length in bytes */
  uint8_t pad[2];           /* 64-bit alignment */
  struct ofp_queue_prop_header properties[0]; /* list of properties */
};
OFP_ASSERT(sizeof(struct ofp_queue) == 8);

struct ofp_queue_prop_header {
  uint16_t property;        /* one of OFPQT_ */
  uint16_t len;             /* length of property, including this header */
  uint8_t pad[4];           /* 64-bit alignemnt */
};
OFP_ASSERT(sizeof(struct ofp_queue_prop_header) == 8);

struct ofp_queue_prop_min_rate {
  struct ofp_queue_prop_header prop_header; /* property is OFPQT_MIN, len is 16 */
  uint16_t rate;           /* parameter for the property */
  uint8_t pad[6];           /* 64-bit alignment */
};
OFP_ASSERT(sizeof(struct ofp_queue_prop_min_rate) == 16);
  • The mapping from flow to queues takes place within the OpenFlow protocol.
  • Assuming that a queue is already configured, the user can associate a flow with an OFPAT_ENQUEUE action which forwards the packet through the specifc queue in that port. Note that an enqueue action should override any TOS/VLAN_PCP related behavior that is potentially defined in the switch. In all cases, the packet should not change due to an enqueue action. If the switch needs to set TOS/PCP bits for internal handling, the original values should be restored before sending the packet out.

The following structure is used for enqueuing a flow to a queue:

struct ofp_action_enqueue {
  uint16_t type;         /* OFPAT_ENQUEUE */
  uint16_t len;          /* len is 12 */
  uint16_t port;         /* port that queue belongs */
  uint8_t pad[6];        /* 64-bits alignment */
  uint32_t queue_id;     /* where to enqueue the packets */
};
  • If the queue is not configured the switch replies back with an OFPBAC_BAD_ARGUMENT and rejects the flow.
  • The switch keeps statistics for each queue which can be retrieved using the OFPST_QUEUE stats command.
  • The stats request takes two arguments : the port and the queue id that we refer to.
struct ofp_queue_stats_request {
   uint16_t port_no;     /* OFPP_NONE for all ports. */
   uint8_t pad[2];       /* Align to 32-bits. */
   uint32_t queue_id;    /* OFPQ_NONE for all queues. */
};
  • The switch should report back with the following structure:
struct ofp_queue_stats {
   uint16_t port_no;
   uint8_t pad[2];       /* Align to 32-bits. */
   uint32_t queue_id;    /* queue id */
   uint64_t tx_bytes;    /* Number of transmitted bytes. */
   uint64_t tx_packets;  /* Number of transmitted packets. */
   uint64_t tx_error;    /* Number of packets dropped due to overrun. */
};

VLAN PCP/ TOS bits

A switch may support only queues that are tied to specific PCP/TOS bits. In that case, we cannot map an arbitrary flow to a specific queue, therefore the action ENQUEUE is not supported. The user can still configure queues through the configuration protocol, and map flows to these queues by setting the relevant fields (look SET_TOS and SET_VLAN_PCP openflow commands).

Reference Implementation

Architecture

The slicing mechanism is provided with the userspace reference implementation. We use linux traffic control (tc), a userspace front-end to configure kernel queueing.

Queue configuration can be done through dpctl. It's implemented as an OpenFlow vendor extension, so there shouldn't be any problem on setting them through a controller (has not be tested though).

Queues are integrated into the netdev port abstraction. The reference implementation supports up to 8 queues per port. Each queue is a class under a linux queue discipline. To efficiently share redundant between queues, we need to place all of them under a common root class.

Port-queue.png

Each queue is mapped to a write-only socket. We set SO_PRIORITY=class_id in these sockets, and therefore traffic sent through them is queued at the class pointed by SO_PRIORITY. Note that this introduces overhead of 9 sockets/port. Implementations that have access to sk_buffs, can do the mapping through the skb->priority flag, without the need of multiple sockets.

From openflow datapath perspective, packets that do not belong to queues are sent to the default class. All others are pushed into the appropriate socket.

Testing without FlowVisor

Slicing testing requires specific topology and rate measurements which are not available currently at the openflow regression suite.

We need to oversubscribe a link, and make sure that traffic is treated as dictated by queue configuration. The setup includes the openflow switch under testing, and two pc's which run instances of iperf server and clients. Instructions below go through the test process, so that it can be easily reproduced.

This is the topology we used. PC-1 has two ports, eth1 and eth2.

ifconfig eth1 192.168.10.32 netmask 255.255.255.0
ifconfig eth2 192.168.11.32 netmask 255.255.255.0

PC-2 has one port, with two IP addresses configured over it.

ip address flush dev eth1
ifconfig eth1:0 192.168.10.34 netmask 255.255.255.0
ifconfig eth1:1 192.168.11.34 netmask 255.255.255.0
Slicing-setup.png

We start the openflow switch :

sudo ofdatapath - i nf2c0,nf2c1,nf2c2,nf2c3 ptcp:

(This step is necessary to ensure that link bandwidth is the bottleneck, and not switch performance (which is the case at the reference implementation) We throttle port 3 at the OpenFlow switch at 10Mbps. Note that the throttling is done using tc, and it has to be done after we start the openflow instance (reference implementation removes any current tc configuration when it starts).

sudo /sbin/tc class change dev nf2c2 parent 1: classid 1:ffff htb rate 10000kbit ceil 10000kbit

We setup two queues at port 3, q1=6Mbps, q2=4Mbps.

dpctl add-queue tcp:localhost 3 1 6
dpctl add-queue tcp:localhost 3 2 4

We install the necessary flows to handle the iperf traffic (note that queues are not used):

arp,in_port=1,idle_timeout=0,actions=output:3
arp,in_port=2,idle_timeout=0,actions=output:3
arp,in_port=3,idle_timeout=0,actions=output:1,output:2
icmp,in_port=1,idle_timeout=0,actions=output:3
icmp,in_port=2,idle_timeout=0,actions=output:3
icmp,in_port=3,nw_dst=192.168.10.32,idle_timeout=0,actions=output:1
icmp,in_port=3,nw_dst=192.168.11.32,idle_timeout=0,actions=output:2
ip,in_port=1,nw_dst=192.168.10.34,idle_timeout=0,actions=output:3
ip,in_port=2,nw_dst=192.168.11.34,idle_timeout=0,actions=output:3
ip,in_port=3,nw_dst=192.168.10.32,idle_timeout=0,actions=output:1
ip,in_port=3,nw_dst=192.168.11.32,idle_timeout=0,actions=output:2

We start the iperf servers at PC-2:

iperf -s -p 8010 -f m -i 1 // tcp server
iperf -s -p 8011 -u -f m --reportstyle C -i 1 // udp server

We start the iperf clients at PC-1:

iperf -c 192.168.10.34 -i 1 -p 8010 -f m -t 120 --reportstyle C // tcp client
iperf -c 192.168.11.34 -p 8011 -i 1 -f m -t 120 -u -b 10000000

The following steps verify slicing functionality, and are shown at the figure below:


Slicing Testing.png
  • t=0: traffic is being forwarded through the default, best-effort queue at port 3. As expected, tcp back-off, and udp traffic dominates.
  • t=30: we send iperf traffic through the queues.
ip,in_port=1,nw_dst=192.168.10.34,idle_timeout=0,actions=enqueue:3:1
ip,in_port=2,nw_dst=192.168.11.34,idle_timeout=0,actions=enqueue:3:2
.

We can see tcp traffic taking its portions and udp limited at 4Mbps.

  • t=60: we modify q1 and set its minimum rate at 4Mbps.
dpctl mod-queue tcp:localhost 3 1 4

We then see udp and tcp fairly sharing the link (5Mbps each). The reference implementation, shares the redundant bandwidth (2Mbps in our case, since q1=q2=4Mbps) proportionally to the current queue configuration. This is not necessary. The spec doesn't define how redundant bandwidth is being used - it's implementation specific.

  • t=90: we reset the flow entries, and all traffic goes through default queue again.

Testing with FlowVisor

High Level

Note: This section assumes familiarity with flowvisor and setting up slices. For related info refer to the FlowVisor wikipage.

  • Setup flowvisor with two slices: one "production" and the other "meanie"
  • The high level idea is to have the "meanie" slice try to consume arbitrary bandwidth and see if it impacts the production slice
Setup
  • Meanie would get 30% of bandwidth reserved and TCP and UDP traffic on port 8011
    • meanie slice flowvisor config file
ID: 2
Host: tcp:localhost:7002
BandwidthSlice: 2
# see arp
FlowSpace: allow: dl_type: 2054 limit: 10000
# full for to/from iperf server
FlowSpace: allow: ip_src: 172.24.74.70 tp_src: 8011 limit: 10000
FlowSpace: allow: ip_dst: 172.24.74.70 tp_dst: 8011 limit: 10000
  • Production would get 70% of bandiwdth reserved and everything but Meanie's traffic
    • production slice flowvisor config file:
ID: 1
Host: tcp:localhost:7001
BandwidthSlice: 1
FlowSpace: deny: ip_src: 172.24.74.70 tp_src: 8011 limit: 10000
FlowSpace: deny: ip_dst: 172.24.74.70 tp_dst: 8011 limit: 10000
FlowSpace: allow: limit: 10000
  • Meanie would run iperf with UDP (as above) and production would run iperf over TCP (as above)
  • The network topology was 2 laptops (acting as iperf clients) directly connected to a PC running the reference switch. The PC/switch then connected to a second PC that acted as an iperf server for both clients, making the link between PC/switch and PC/server the bottleneck link.
  • On the switch, we set the queues with the script /etc/init.d/openflow_slicing:
#!/bin/sh
switch=unix:/var/run/dp0
dpctl=/root/openflow.git/utilities/dpctl
ports="1 2 3 4 5 6 7"
production_slice=1
production_slice_speed=700
other_slice=2
other_slice_speed=300
case "$1" in
       start)
               for port in $ports ; do
                       $dpctl add-queue $switch $port $production_slice $production_slice_speed
                       sleep 0.1
                       $dpctl add-queue $switch $port $other_slice $other_slice_speed
                       sleep 0.1
               done
       ;;
       stop)
               for port in $ports ; do
                       $dpctl del-queue $switch $port $production_slice
                       sleep 0.1
                       $dpctl del-queue $switch $port $other_slice
                       sleep 0.1
               done
       ;;
       show)
               $dpctl dump-queue $switch
       ;;
       *)
               echo Unknown command $1 >&2
               exit 1
esac


  • To ensure that the limit was the bandwidth rather then the reference switch's performance, we throttled the bottleneck link down to 10Mb/s using /sbin/tc
    • Note that we also attempted to use /sbin/ethtool to set the interface to 10Mb/s and the results were different
    • Also, we attempted the same experiment with a soekris box and the soekris box was not able to enforce the slicing correctly either. We expect that the soekris box was running into memory bandwidth limits
    • Look at the related section below for further details on known issues.
Results
  • Slicing was correctly enforced
  • Production was not able to take up 100% of it's allocated 70% because of TCP's performance characteristics, so Meanie was able to get slightly more than 30% of bandwidth on average
  • This is not a bad thing; we are only guaranteeing minimum bandwidth, not absolute or maximum

Slicing-flowvisor-PC.png

Known Issues - TODO

Link Throtte

The way we throttle the link seems to affect slicing behavior. For the following we assume 100Mbps ethernet cards, two slices setup as 70% and 30% respectively, with iperf (UDP and/or TCP) on each slice.

  • Using ethtool to go down to 10Mbs speed ling
    • UDP vs UDP works fine and slice-rates are respected.
    • UDP vs TCP does not work - UDP dominates. Need to check whether packets are dropped at the OF switch, or whether a kernel buffer (socket,tc,...?) is too long and causes TCP retransmission - backoff.
  • Using tc to throttle the link to 10Mbps
    • Both UDP vs UDP and UDP vs TCP works fine.
  • No throttling (operating at 100Mbps)
    • Switch CPU becomes the bottleneck. With a single slice we seem to get ~65Mbps. Slices are respected since UDP goes up to 30Mbps (as set by its queue configuration) and TCP gets the remaining 35Mbps of bandwidth).

One main difference between these two cases, is that when we use tc for throttling the link, the buffers are still drained at 100Mbps rate (therefore much faster).

Things to investigate :

  • Check whether there are dropped packets at the switch at netdev_send (does TCP backoff due to dropped or delayed packets?)
  • Put a limited PFIFO under the HTB classes ( mentioned here ).
  • Spot the bottleneck queue at the kernel and log the minimum info (read-write bit, time offset from last operation) to replay its utilization.
Switch Processing Constraints

At a low end switch (soekris box) we observed accuracy/starvation issues. Even though the rates were respected in average, there were long starvation periods for individual flows (up to 20 secs). CPU utilization didn't hit 100%, it's possible that this is a memory limitation.

r2q error

We get r2q warnings while configuring the queues, we should probably relate this to the class rate.

burst configuration

We don't explicitly configure burst, and tc sets the smallest possible. We may have to set a burst (either a default or one related with each class' rate)

Additional Notes

  • The slicing mechanism may have benefits from using the port-group feature due at OpenFlow 1.0 release.
  • Configuration of queues should allow for easy extension towards other primitives apart from min-rate (i.e. max rate)
  • This is the proposed timeline :

Slicing timeline.png