[ovs-dev] OVN architecture

Ben Pfaff blp at nicira.com
Tue Jan 13 11:29:41 PST 2015


Open Virtual Network (OVN) Proposed Architecture
================================================

The Open vSwitch team is pleased to announce OVN, a new subproject in
development within the Open vSwitch.  The full project announcement is
at Network Heresy and reproduced at:

        http://openvswitch.org/pipermail/dev/2015-January/050379.html

OVN complements the existing capabilities of OVS to add native support
for virtual network abstractions, such as virtual L2 and L3 overlays
and security groups.  Just like OVS, our design goal is to have a
production-quality implementation that can operate at significant
scale.  This post outlines the proposed high level architecture for
OVN.

This document mainly discusses the design of OVN on hypervisors
(including container systems).  The complete OVN system will also
include support for software and hardware gateways for
logical-physical integration, and perhaps also for service nodes to
offload multicast replication from hypervisors.  Each of these classes
of devices has much in common, so our discussion here refers to them
collectively as "chassis" or "transport nodes".  


Layering
========

>From lowest to highest level, OVN comprises the following layers.

Open vSwitch
------------

The lowest layer is Open vSwitch, that is, ovs-vswitchd and
ovsdb-server.  OVN will use standard Open vSwitch, not some kind of
specially patched or modified version.  OVN will use some of the Open
vSwitch extensions to OpenFlow, since many of those extensions were
introduced to solve problems in network virtualization.  For that
reason, OVN will probably not work with OpenFlow implementations other
than Open vSwitch.

On hypervisors, we expect OVN to use the hypervisor integration
features described in IntegrationGuide.md in the OVS repository.
These features allow controllers, like ovn-controller, to associate
vifs instantiated on a given hypervisor with configured VMs and their
virtual interfaces.  The same interfaces allow for container
integration.

ovn-controller
--------------

The layer just above Open vSwitch proper consists of ovn-controller,
an additional daemon that runs on every chassis (hardware VTEPs are a
special case; they might use something different).  Southbound, it
talks to ovs-vswitchd over the OpenFlow protocol (with extensions) and
to ovsdb-server over the OVSDB protocol.


                               OVN Database
                                     |
                                     |
                             (OVSDB Protocol)
                                     |
   +-------------------------------------------------------------------+
   |                                 |                                 |
   |                                 |                                 |
   |                           ovn-controller                          |
   |                              |     |                              |
   |                              |     |                              |
   |               +--------------+     +--------------+               |
   |               |                                   |               |
   |               |                                   |               |
   |       (OVSDB Protocol)                        (OpenFlow)          |
   |               |                                   |               |
   |               |                                   |               |
   |         ovsdb-server                         ovs-vswitchd         |
   |                                                                   |
   +---------------------------- Hypervisor ---------------------------+


ovn-controller does not interact directly with the Open vSwitch kernel
module (or DPDK or any other datapath).  Instead, it uses the same
public OpenFlow and OVSDB interfaces used by any other controller.
This avoids entangling OVN and OVS.

Each version of ovn-controller will require some minimum version of
Open vSwitch.  It may be necessary to pair matching versions of
ovn-controller and OVS (which is likely feasible, since they run on
the same physical machine), but it is probably possible, and better,
to tolerate some version skew.

Northbound, ovn-controller talks to the OVN database (described in the
following section) using a database protocol.

ovn-controller has the following tasks:

  * Translate the configuration and state obtained from the OVN
    database into OpenFlow flows and other state, and push that state
    down into Open vSwitch over the OpenFlow and OVSDB protocols.
    This occurs in response to network configuration updates, not in
    reaction to data packets arriving from virtual or physical
    interfaces.  Examples include translation of logical datapath
    flows into OVS flows (see Logical Datapath Flows below) via
    OpenFlow to ovs-vswitchd, and instantiation of tunnels via OVSDB
    to the chassis's ovsdb-server.

  * Populate the bindings component of the OVN database (described
    later) with chassis state relevant to OVN.  On a hypervisor, this
    includes the vifs instantiated on the hypervisor at any given
    time, to allow other chassis to correctly forward packets destined
    to VMs on this hypervisor.  On a gateway (software or hardware),
    this includes MAC learning state from physical ports.

  * In corner cases, respond to packets arriving from virtual
    interfaces (via OpenFlow).  For example, ARP suppression may
    require observing packets from VMs through OpenFlow "packet-in"
    messages.

ovn-controller is not a centralized controller but what we refer to as
a "local controller", since an independent instance runs locally on
every hypervisor.  ovn-controller is not a general purpose "SDN
controller"--it performs the specific tasks outlined above in support
of the virtual networking functionality of OVN.

OVN database
------------

The OVN database contains three classes of data with different properties:

  * Physical Network (PN): information about the chassis nodes in the
    system.  This contains all the information necessary to wire the
    overlay, such as IP addresses, supported tunnel types, and
    security keys.

    The amount of PN data is small (O(n) in the number of chassis) and
    it changes infrequently, so it can be replicated to every chassis.

  * Logical Network (LN): the topology of logical switches and
    routers, ACLs, firewall rules, and everything needed to describe
    how packets traverse a logical network, represented as logical
    datapath flows (see Logical Datapath Flows, below).

    LN data may be large (O(n) in the number of logical ports, ACL
    rules, etc.).  Thus, to improve scaling, each chassis should
    receive only data related to logical networks in which that
    chassis participates.  Past experience shows that in the presence
    of large logical networks, even finer-grained partitioning of
    data, e.g. designing logical flows so that only the chassis
    hosting a logical port needs related flows, pays off scale-wise.
    (This is not necessary initially but it is worth bearing in mind
    in the design.)

    One may view LN data in at least two different ways.  In one view,
    it is an ordinary database that must support all the traditional
    transactional operations that databases ordinarily include.  From
    another viewpoint, the LN is a slave of the cloud management
    system running northbound of OVN.  That CMS determines the entire
    OVN logical configuration and therefore the LN's content at any
    given time is a deterministic function of the CMS's configuration.
    From that viewpoint, it might be necessary only to have a single
    master (the CMS) provide atomic changes to the LN.  Even
    durability may not be important, since the CMS can always provide
    a replacement snapshot.

    LN data is likely to change more quickly than PN data.  This is
    especially true in a container environment where VMs are created
    and destroyed (and therefore added to and deleted from logical
    switches) quickly.

  * Bindings: the current placement of logical components (such as VMs
    and vifs) onto chassis and the bindings between logical ports and
    MACs.

    Bindings change frequently, at least every time a VM powers up or
    down or migrates, and especially quickly in a container
    environment.  The amount of data per VM (or vif) is small.

    Each chassis is authoritative about the VMs and vifs that it hosts
    at any given time and can efficiently flood that state to a
    central location, so the consistency needs are minimal.


     +----------------------------------------+
     |        Cloud Management System         |     
     +----------------------------------------+
              |                     |
              |                     |
     +------------------+  +------------------+  +------------------+
     | Physical Network |  |  Logical Network |  |     Bindings     |
     |       (PN)       |  |       (LN)       |  |                  |
     +------------------+  +------------------+  +------------------+
             |  |                 |  |                   |  |
             |  |                 |  |                   |  |
             +----------+---------+----------------------+  |
                |       |            |                      |
                +-------|------------+----------+-----------+
                        |                       |
                +----------------+      +----------------+  
                |                |      |                |
                |  Hypervisor 1  |      |  Hypervisor 2  |
                |                |      |                |
                +----------------+      +----------------+


An important design decision is the choice of database.  It is also
possible to choose multiple databases, dividing the data according to
its different uses as described above.  Some factors behind the choice
of database:

  * Availability.  Clustering could be helpful, but the ability to
    resynchronize cheaply from a rebooted database server (e.g. using
    "Difference Digests" described in Epstein et al., "What's the
    Difference? Efficient Set Reconciliation Without Prior Context")
    might be just as important.

  * The bindings database and to a lesser extent the LN database
    should support a high write rate.

  * The database should scale to a large number of connections
    (thousands, for a large OVN deployment)

  * The database should have C bindings.

We initially plan to use OVSDB as the OVN database.  ovsdb-server does
not yet include clustering, nor does it have cheap resynchronization,
nor does it scale to thousands of connections.  None of these is
fundamental to its design, so as bottlenecks arise we will add the
necessary features as part of OVN development.  (None of these
features should cause backward incompatibility with existing OVSDB
clients.)  If this proves impracticable we will switch to an
alternative database. The interfaces and partitioning of the system
state are more important to get right; the implementations behind the
interfaces are then simple to change.

Cloud Management System
-----------------------

OVN requires integration with the cloud management system in use.  We
will write a plugin to integrate OVN into OpenStack.  The primary job
of the plugin is to translate the CMS configuration, which forms the
northbound API, into logical datapath flows in the OVN LN database.
The CMS plugin may also update the PN.

A significant amount of the code to translate the CMS configuration
into logical datapath flows may be independent of the CMS in use.  It
should be possible to reuse this code from one CMS plugin to another.


Logical Pipeline
================

In OVN, packet processing follows this process:

  * Physical ingress, from a VIF, a tunnel, or a gateway physical
    port.

  * Logical ingress. OVN identifies the packet's logical datapath and
    logical port.

  * Logical datapath processing.  The packet passes through each of
    the stages in the ingress logical datapath.  In the end, the
    logical datapath flows output the packet to zero or more logical
    egress ports.

  * Further logical datapath processing.  If an egress logical port
    connects to another logical datapath, then the packet passes
    through that logical datapath in the same way as the initial
    logical datapath.  A network of logical datapaths can connect into
    a logical topology, that e.g. represents a network of connected
    logical routers and switches.

  * Logical egress: Eventually, a packet that is not dropped is output
    to a logical port that has a physical realization.  OVN identifies
    how to send a packet to the physical egress.

  * Physical egress, to a VIF, a tunnel, or a gateway physical port.

The pipeline processing is split between the ingress and egress
transport nodes.  In particular, the logical egress processing may
occur at either hypervisor.  Processing the logical egress on the
ingress hypervisor requires more state about the egress vif's
policies, but reduces traffic on the wire that would eventually be
dropped.  Whereas, processing on the egress hypervisor can reduce
broadcast traffic on the wire by doing local replication.  We
initially plan to process logical egress on the egress hypervisor so
that less state needs to be replicated.  However, we may change this
behavior once we gain some experience writing the logical flows.

Logical Datapath Flows
----------------------

The LN database specifies the logical topology as a set of logical
datapath flows (as computed by OVN's CMS plugin).  A logical datapath
flow is much like an OpenFlow flow, except that the flows are written
in terms of logical ports and logical datapaths instead of physical
ports and physical datapaths.  ovn-controller translates logical flows
into physical flows.  The translation process helps to ensure
isolation between logical datapaths.

The Pipeline table in the LN database stores the logical datapath
flows.  It has the following columns:

  * table_id: An integer that designates a stage in the logical
    pipeline, analogous to an OpenFlow table number.

  * priority: An integer between 0 and 65535 that designates the
    flow's priority.  Flows with numerically higher priority take
    precedence over those with lower. If two logical datapath flows
    with the same priority both match, then the one actually applied
    to the packet is undefined.

  * match: A string specifying a matching expression (see below) that
    determines which packets the flow matches.

  * actions: A string specifying a sequence of actions (see below) to
    execute when the matching expression is satisfied.

The default action when no flow matches is to drop packets.

Matching Expressions
--------------------

Matching expressions provide a superset of OpenFlow matching
capabilities across packets in a logical datapath.  Expressions use a
syntax similar to Boolean expressions in a programming language.

Matching expressions have two kinds of primaries: fields and
constants.  A field names a piece of data or metadata.  The supported
fields are:

        metadata reg0 ... reg7 xreg0 ... xreg3
        inport outport queue
        eth.src eth.dst eth.type
        vlan.tci vlan.vid vlan.pcp vlan.present
        ip.proto ip.dscp ip.ecn ip.ttl ip.frag
        ip4.src ip4.dst
        ip6.src ip6.dst ip6.label
        arp.op arp.spa arp.tpa arp.sha arp.tha
        tcp.src tcp.dst tcp.flags
        udp.src udp.dst
        sctp.src sctp.dst
        icmp4.type icmp4.code
        icmp6.type icmp6.code
        nd.target nd.sll nd.tll

Subfields may be addressed using a [] suffix, e.g. tcp.src[0..7]
refers to the low 8 bits of the TCP source port.  A subfield may be
used in any context a field is allowed.

Some fields have prerequisites.  These are satisfied by implicitly
adding clauses.  For example, "arp.op == 1" is equivalent to "eth.type
== 0x0806 && arp.op == 1", and "tcp.src == 80" is equivalent to
"(eth.type == 0x0800 || eth.type == 0x86dd) && ip.proto == 6 &&
tcp.src == 80".

Constants may be expressed in several forms: decimal integers,
hexadecimal integers prefixed by 0x, dotted-quad IPv4 addresses, IPv6
addresses in their standard forms, and as Ethernet addresses as
colon-separated hex digits.  A constant in any of these forms may be
followed by a slash and a second constant (the mask) in the same form,
to form a masked constant.  IPv4 and IPv6 masks may be given as
integers, to express CIDR prefixes.

The available operators, from highest to lowest precedence, are:

        ()
        ==   !=   <   <=   >   >=   in   not in
        !
        &&
        ||

The () operator is used for grouping.

The equality operator == is the most important operator.  Its operands
must be a field and an optionally masked constant, in either order.
The == operator yields true when the field's value equals the
constant's value for all the bits included in the mask.  The ==
operator translates simply and naturally to OpenFlow.

The inequality operator != yields the inverse of == but its syntax and
use are the same.  Implementation of the inequality operator is
expensive.

The relational operators are <, <=, >, and >=.  Their operands must be
a field and a constant, in either order; the constant must not be
masked.  These operators are most commonly useful for L4 ports,
e.g. "tcp.src < 1024".  Implementation of the relational operators is
expensive.

The set membership operator "in", with syntax "<field> in {
<constant1>, <constant2>, ... }", is syntactic sugar for "(<field>
== <constant1> || <field> == <constant2> || ...)".  Conversely
"<field> not in { <constant1>, <constant2>, ... }" is syntactic
sugar for "(<field> != <constant1> && <field> != <constant2> &&
...)".

The unary prefix operator ! yields its operand's inverse.

The logical AND operator && yields true only if both of its operands
are true.

The logical OR operator || yields true if at least one of its operands
is true.

(The above is pretty ambitious.  It probably makes sense to initially
implement only a subset of this specification.  The full specification
is written out mainly to get an idea of what a fully general matching
expression language could include.)

Actions
-------

Below, a <value> is either a <constant> or a <field>.  The following
actions seem most likely to be useful:

    drop                    syntactic sugar for no actions
    output(<value>)         output to port
    broadcast               output to every logical port except ingress port
    resubmit                execute next logical datapath table as subroutine
    set(<field>=<value>)    set data or metadata field, or copy between fields

Following are not well thought out:

    learn
    conntrack
    with(<field>=<value) { <action>, ... }
                            execute actions with temporary changes to fields
    dec_ttl { <action>, ... } { <action>, ...}
                            decrement TTL; execute first set of actions if
                            successful, second set if TTL decrement fails
    icmp_reply { <action>, ... }
                            generate ICMP reply from packet, execute <action>s

Other actions can be added as needed (e.g. push_vlan, pop_vlan,
push_mpls, pop_mpls, ...).

Some of the OVN actions do not map directly to OpenFlow actions, e.g.:

  * with: Implemented as "stack_push", "set", <actions>, "stack_pop".

  * dec_ttl: Implemented as dec_ttl followed by the successful
    actions.  The failure case has to be implemented by ovn-controller
    interpreting packet-ins.  It might be difficult to identify the
    particular place in the processing pipeline in ovn-controller;
    maybe some restrictions will be necessary.

  * icmp_reply: Implemented by sending the packet to ovn-controller,
    which generates the ICMP reply and sends the packet back to
    ovs-vswitchd.


Implementing Features
=====================

Each of the OVN logical network features is implemented as a table
containing logical datapath flows and arranged into a pipeline.  These
are not fully fleshed out but here are some examples.

Ingress Admissibility Check
---------------------------

Some invariants of valid packets can be checked at ingress into the
pipeline, e.g.:

  * Discard packets with multicast source: eth.src[40] == 1

  * Discard packets with malformed VLAN header: eth.type == 0x8100 &&
    !vlan.present

  * Discard BPDUs: eth.type == 01:80:c2:00:00:00/ff:ff:ff:ff:ff:f0

  * We don't plan to implement logical switch VLANs for the first
    version of OVN, so drop VLAN-tagged packets: vlan.present

A low-priority flow resubmits to the next pipeline stage.

ACLs
----

Logical datapath flows for ACLs correspond closely to the ACLs
themselves.  "deny" ACLs drop packets, "allow" ACLs resubmit to the
next pipeline stage, and default drop or allow are expressed as a low
priority flow that drops or resubmits.

L2 Switching
------------

Many logical L2 switches do not need to do MAC learning, because the
MAC addresses of all of the VMs or logical routers on the switch are
known.  The flows required to process packets in this case are very
simple:

    For each known (<mac>, <logical-port>) pair: eth.dst=<mac>,
    actions=set(reg0=<logical-port>), resubmit

Multicast and broadcast are handled by repeating the actions above for
every logical port (the "broadcast" action may be useful in some
cases):

    eth.dst[40] == 1, actions=set(reg0=<logical-port-1>), resubmit,
    set(reg0=<logical-port-2>), resubmit, ...

The above assumes that we use reg0 to designate the logical output
port, but the particular register assignment doesn't matter as long as
datapath logical flows are consistent.

OpenFlow by default prevents a packet received on a particular
OpenFlow port from being output back to the same OpenFlow port.  We
will want to do the same thing for logical ports in logical datapath
switching; it could be implemented either in the definition of the
logical datapath "output" and "broadcast" actions or in the logical
datapath flows themselves.


More information about the dev mailing list