ovn-architecture(7)           Open vSwitch Manual          ovn-architecture(7)



NAME
       ovn-architecture - Open Virtual Network architecture

DESCRIPTION
       OVN,  the  Open Virtual Network, is a system to support virtual network
       abstraction. OVN complements the existing capabilities of  OVS  to  add
       native support for virtual network abstractions, such as virtual L2 and
       L3 overlays and security groups. Services such as DHCP are also  desir‐
       able  features.  Just  like OVS, OVN’s design goal is to have a produc‐
       tion-quality implementation that can operate at significant scale.

       An OVN deployment consists of several components:

              ·      A Cloud Management System (CMS), which is OVN’s  ultimate
                     client  (via  its users and administrators). OVN integra‐
                     tion  requires  installing  a  CMS-specific  plugin   and
                     related software (see below). OVN initially targets Open‐
                     Stack as CMS.

                     We generally speak of ``the’’ CMS, but  one  can  imagine
                     scenarios  in which multiple CMSes manage different parts
                     of an OVN deployment.

              ·      An OVN Database physical or virtual node (or, eventually,
                     cluster) installed in a central location.

              ·      One  or more (usually many) hypervisors. Hypervisors must
                     run Open vSwitch and implement the interface described in
                     IntegrationGuide.rst in the OVS source tree. Any hypervi‐
                     sor platform supported by Open vSwitch is acceptable.

              ·      Zero or more gateways. A gateway extends  a  tunnel-based
                     logical  network  into a physical network by bidirection‐
                     ally forwarding packets between tunnels  and  a  physical
                     Ethernet  port.  This  allows non-virtualized machines to
                     participate in logical networks. A gateway may be a phys‐
                     ical  host,  a virtual machine, or an ASIC-based hardware
                     switch that supports the vtep(5) schema. (Support for the
                     latter will come later in OVN implementation.)

                     Hypervisors  and  gateways  are together called transport
                     node or chassis.

       The diagram below shows how the major components  of  OVN  and  related
       software interact. Starting at the top of the diagram, we have:

              ·      The Cloud Management System, as defined above.

              ·      The  OVN/CMS  Plugin  is  the  component  of the CMS that
                     interfaces to OVN. In OpenStack, this is a Neutron  plug‐
                     in.  The  plugin’s main purpose is to translate the CMS’s
                     notion of logical network configuration,  stored  in  the
                     CMS’s  configuration  database  in a CMS-specific format,
                     into an intermediate representation understood by OVN.

                     This component is  necessarily  CMS-specific,  so  a  new
                     plugin  needs  to be developed for each CMS that is inte‐
                     grated with OVN. All of the components below this one  in
                     the diagram are CMS-independent.

              ·      The  OVN  Northbound  Database  receives the intermediate
                     representation of logical  network  configuration  passed
                     down  by the OVN/CMS Plugin. The database schema is meant
                     to be ``impedance matched’’ with the concepts used  in  a
                     CMS,  so  that  it  directly  supports notions of logical
                     switches, routers, ACLs, and so  on.  See  ovn-nb(5)  for
                     details.

                     The  OVN  Northbound  Database  has only two clients: the
                     OVN/CMS Plugin above it and ovn-northd below it.

              ·      ovn-northd(8) connects to  the  OVN  Northbound  Database
                     above  it  and  the  OVN Southbound Database below it. It
                     translates the logical network configuration in terms  of
                     conventional  network concepts, taken from the OVN North‐
                     bound Database, into logical datapath flows  in  the  OVN
                     Southbound Database below it.

              ·      The  OVN Southbound Database is the center of the system.
                     Its clients  are  ovn-northd(8)  above  it  and  ovn-con
                     troller(8) on every transport node below it.

                     The OVN Southbound Database contains three kinds of data:
                     Physical Network (PN) tables that specify  how  to  reach
                     hypervisor  and  other nodes, Logical Network (LN) tables
                     that describe the logical network in terms  of  ``logical
                     datapath  flows,’’  and  Binding tables that link logical
                     network components’ locations to  the  physical  network.
                     The  hypervisors populate the PN and Port_Binding tables,
                     whereas ovn-northd(8) populates the LN tables.

                     OVN Southbound Database performance must scale  with  the
                     number  of transport nodes. This will likely require some
                     work on  ovsdb-server(1)  as  we  encounter  bottlenecks.
                     Clustering for availability may be needed.

       The remaining components are replicated onto each hypervisor:

              ·      ovn-controller(8)  is  OVN’s agent on each hypervisor and
                     software gateway. Northbound,  it  connects  to  the  OVN
                     Southbound  Database to learn about OVN configuration and
                     status and to populate the PN table and the Chassis  col‐
                     umn in Binding table with the hypervisor’s status. South‐
                     bound, it connects to ovs-vswitchd(8) as an OpenFlow con‐
                     troller,  for  control  over  network traffic, and to the
                     local ovsdb-server(1) to allow it to monitor and  control
                     Open vSwitch configuration.

              ·      ovs-vswitchd(8) and ovsdb-server(1) are conventional com‐
                     ponents of Open vSwitch.

                                         CMS
                                          |
                                          |
                              +-----------|-----------+
                              |           |           |
                              |     OVN/CMS Plugin    |
                              |           |           |
                              |           |           |
                              |   OVN Northbound DB   |
                              |           |           |
                              |           |           |
                              |       ovn-northd      |
                              |           |           |
                              +-----------|-----------+
                                          |
                                          |
                                +-------------------+
                                | OVN Southbound DB |
                                +-------------------+
                                          |
                                          |
                       +------------------+------------------+
                       |                  |                  |
         HV 1          |                  |    HV n          |
       +---------------|---------------+  .  +---------------|---------------+
       |               |               |  .  |               |               |
       |        ovn-controller         |  .  |        ovn-controller         |
       |         |          |          |  .  |         |          |          |
       |         |          |          |     |         |          |          |
       |  ovs-vswitchd   ovsdb-server  |     |  ovs-vswitchd   ovsdb-server  |
       |                               |     |                               |
       +-------------------------------+     +-------------------------------+


   Information Flow in OVN
       Configuration data in OVN flows from north to south. The  CMS,  through
       its  OVN/CMS  plugin,  passes  the  logical  network  configuration  to
       ovn-northd via the northbound database. In  turn,  ovn-northd  compiles
       the  configuration  into a lower-level form and passes it to all of the
       chassis via the southbound database.

       Status information in OVN flows from south to north. OVN currently pro‐
       vides  only  a few forms of status information. First, ovn-northd popu‐
       lates the up column in the northbound Logical_Switch_Port table:  if  a
       logical  port’s  chassis column in the southbound Port_Binding table is
       nonempty, it sets up to true, otherwise to false. This allows  the  CMS
       to detect when a VM’s networking has come up.

       Second, OVN provides feedback to the CMS on the realization of its con‐
       figuration, that is, whether the configuration provided by the CMS  has
       taken  effect.  This  feature  requires  the  CMS  to  participate in a
       sequence number protocol, which works the following way:

              1.
                When the CMS updates the configuration in the northbound data‐
                base, as part of the same transaction, it increments the value
                of the nb_cfg column in the NB_Global  table.  (This  is  only
                necessary  if the CMS wants to know when the configuration has
                been realized.)

              2.
                When ovn-northd updates the southbound  database  based  on  a
                given  snapshot  of  the northbound database, it copies nb_cfg
                from  northbound  NB_Global  into  the   southbound   database
                SB_Global  table,  as  part of the same transaction. (Thus, an
                observer monitoring both  databases  can  determine  when  the
                southbound database is caught up with the northbound.)

              3.
                After  ovn-northd  receives  confirmation  from the southbound
                database server that its changes have  committed,  it  updates
                sb_cfg in the northbound NB_Global table to the nb_cfg version
                that was pushed down. (Thus, the CMS or another  observer  can
                determine  when the southbound database is caught up without a
                connection to the southbound database.)

              4.
                The  ovn-controller  process  on  each  chassis  receives  the
                updated  southbound  database,  with  the updated nb_cfg. This
                process in turn updates the physical flows  installed  in  the
                chassis’s  Open  vSwitch instances. When it receives confirma‐
                tion from Open vSwitch  that  the  physical  flows  have  been
                updated,  it  updates  nb_cfg in its own Chassis record in the
                southbound database.

              5.
                ovn-northd monitors the nb_cfg column in all  of  the  Chassis
                records in the southbound database. It keeps track of the min‐
                imum value among all the records and copies it into the hv_cfg
                column  in  the  northbound NB_Global table. (Thus, the CMS or
                another observer can determine when  all  of  the  hypervisors
                have caught up to the northbound configuration.)

   Chassis Setup
       Each  chassis  in  an  OVN  deployment  must be configured with an Open
       vSwitch bridge dedicated for OVN’s use, called the integration  bridge.
       System  startup  scripts  may  create  this  bridge  prior  to starting
       ovn-controller if desired. If this bridge does not exist when  ovn-con‐
       troller  starts, it will be created automatically with the default con‐
       figuration  suggested  below.  The  ports  on  the  integration  bridge
       include:

              ·      On  any  chassis,  tunnel ports that OVN uses to maintain
                     logical  network   connectivity.   ovn-controller   adds,
                     updates, and removes these tunnel ports.

              ·      On a hypervisor, any VIFs that are to be attached to log‐
                     ical networks. The hypervisor itself, or the  integration
                     between  Open  vSwitch  and  the hypervisor (described in
                     IntegrationGuide.rst) takes care of this.  (This  is  not
                     part  of OVN or new to OVN; this is pre-existing integra‐
                     tion work that has already been done on hypervisors  that
                     support OVS.)

              ·      On  a gateway, the physical port used for logical network
                     connectivity. System startup scripts add this port to the
                     bridge  prior  to  starting ovn-controller. This can be a
                     patch port to another bridge, instead of a physical port,
                     in more sophisticated setups.

       Other  ports  should not be attached to the integration bridge. In par‐
       ticular, physical ports attached to the underlay network (as opposed to
       gateway  ports,  which are physical ports attached to logical networks)
       must not be attached to the integration bridge. Underlay physical ports
       should instead be attached to a separate Open vSwitch bridge (they need
       not be attached to any bridge at all, in fact).

       The integration bridge should be configured  as  described  below.  The
       effect    of    each    of    these    settings    is   documented   in
       ovs-vswitchd.conf.db(5):

              fail-mode=secure
                     Avoids switching packets between  isolated  logical  net‐
                     works  before  ovn-controller  starts  up. See Controller
                     Failure Settings in ovs-vsctl(8) for more information.

              other-config:disable-in-band=true
                     Suppresses in-band  control  flows  for  the  integration
                     bridge.  It  would  be  unusual for such flows to show up
                     anyway, because OVN uses a local controller (over a  Unix
                     domain  socket) instead of a remote controller. It’s pos‐
                     sible, however, for some other bridge in the same  system
                     to  have  an  in-band remote controller, and in that case
                     this suppresses the  flows  that  in-band  control  would
                     ordinarily  set  up.  Refer to the documentation for more
                     information.

       The customary name for the integration bridge is  br-int,  but  another
       name may be used.

   Logical Networks
       A  logical  network  implements the same concepts as physical networks,
       but they are insulated from the physical network with tunnels or  other
       encapsulations.  This  allows  logical networks to have separate IP and
       other address spaces that overlap, without conflicting, with those used
       for physical networks. Logical network topologies can be arranged with‐
       out regard for the topologies of the physical networks  on  which  they
       run.

       Logical network concepts in OVN include:

              ·      Logical   switches,   the  logical  version  of  Ethernet
                     switches.

              ·      Logical routers, the logical version of IP routers. Logi‐
                     cal  switches and routers can be connected into sophisti‐
                     cated topologies.

              ·      Logical datapaths are the logical version of an  OpenFlow
                     switch. Logical switches and routers are both implemented
                     as logical datapaths.

              ·      Logical ports represent the points of connectivity in and
                     out  of logical switches and logical routers. Some common
                     types of logical ports are:

                     ·      Logical ports representing VIFs.

                     ·      Localnet ports represent the points of  connectiv‐
                            ity between logical switches and the physical net‐
                            work. They are  implemented  as  OVS  patch  ports
                            between  the  integration  bridge and the separate
                            Open vSwitch bridge that underlay  physical  ports
                            attach to.

                     ·      Logical  patch  ports represent the points of con‐
                            nectivity between  logical  switches  and  logical
                            routers,  and  in  some cases between peer logical
                            routers. There is a pair of logical patch ports at
                            each such point of connectivity, one on each side.

   Life Cycle of a VIF
       Tables and their schemas presented in isolation are difficult to under‐
       stand. Here’s an example.

       A VIF on a hypervisor is a virtual network interface attached either to
       a  VM  or a container running directly on that hypervisor (This is dif‐
       ferent from the interface of a container running inside a VM).

       The steps in this example refer often to details of  the  OVN  and  OVN
       Northbound  database  schemas.  Please  see  ovn-sb(5)  and  ovn-nb(5),
       respectively, for the full story on these databases.

              1.
                A VIF’s life cycle begins when a CMS administrator  creates  a
                new  VIF  using the CMS user interface or API and adds it to a
                switch (one implemented by OVN as a logical switch).  The  CMS
                updates  its  own  configuration.  This  includes  associating
                unique, persistent identifier vif-id and Ethernet address  mac
                with the VIF.

              2.
                The  CMS plugin updates the OVN Northbound database to include
                the new VIF, by adding a row to the Logical_Switch_Port table.
                In  the  new row, name is vif-id, mac is mac, switch points to
                the OVN logical switch’s Logical_Switch record, and other col‐
                umns are initialized appropriately.

              3.
                ovn-northd  receives  the  OVN  Northbound database update. In
                turn, it makes the corresponding updates to the OVN Southbound
                database,  by adding rows to the OVN Southbound database Logi
                cal_Flow table to reflect the new port, e.g.  add  a  flow  to
                recognize  that packets destined to the new port’s MAC address
                should be delivered to it, and update the flow  that  delivers
                broadcast  and  multicast  packets to include the new port. It
                also creates a record in the Binding table and  populates  all
                its columns except the column that identifies the chassis.

              4.
                On  every hypervisor, ovn-controller receives the Logical_Flow
                table updates that ovn-northd made in the  previous  step.  As
                long  as  the  VM  that  owns the VIF is powered off, ovn-con
                troller cannot do much; it cannot,  for  example,  arrange  to
                send  packets  to or receive packets from the VIF, because the
                VIF does not actually exist anywhere.

              5.
                Eventually, a user powers on the VM that owns the VIF. On  the
                hypervisor where the VM is powered on, the integration between
                the  hypervisor  and  Open  vSwitch  (described  in   Integra
                tionGuide.rst)  adds the VIF to the OVN integration bridge and
                stores vif-id in external_ids:iface-id to  indicate  that  the
                interface  is  an  instantiation of the new VIF. (None of this
                code is new in OVN; this is pre-existing integration work that
                has already been done on hypervisors that support OVS.)

              6.
                On  the  hypervisor where the VM is powered on, ovn-controller
                notices  external_ids:iface-id  in  the  new   Interface.   In
                response, in the OVN Southbound DB, it updates the Binding ta‐
                ble’s chassis column for the row that links the  logical  port
                from  external_ids:  iface-id  to  the  hypervisor. Afterward,
                ovn-controller updates the local hypervisor’s OpenFlow  tables
                so that packets to and from the VIF are properly handled.

              7.
                Some  CMS  systems, including OpenStack, fully start a VM only
                when its networking is  ready.  To  support  this,  ovn-northd
                notices  the chassis column updated for the row in Binding ta‐
                ble and pushes this upward by updating the up  column  in  the
                OVN  Northbound  database’s Logical_Switch_Port table to indi‐
                cate that the VIF is now up. The CMS, if it uses this feature,
                can then react by allowing the VM’s execution to proceed.

              8.
                On  every  hypervisor  but  the  one  where  the  VIF resides,
                ovn-controller notices the completely  populated  row  in  the
                Binding table. This provides ovn-controller the physical loca‐
                tion of the logical port, so each instance updates  the  Open‐
                Flow  tables of its switch (based on logical datapath flows in
                the OVN DB Logical_Flow table) so that packets to and from the
                VIF can be properly handled via tunnels.

              9.
                Eventually, a user powers off the VM that owns the VIF. On the
                hypervisor where the VM was powered off, the  VIF  is  deleted
                from the OVN integration bridge.

              10.
                On the hypervisor where the VM was powered off, ovn-controller
                notices that the VIF was deleted. In response, it removes  the
                Chassis  column  content  in the Binding table for the logical
                port.

              11.
                On every hypervisor, ovn-controller notices the empty  Chassis
                column  in  the Binding table’s row for the logical port. This
                means that ovn-controller no longer knows the  physical  loca‐
                tion  of  the logical port, so each instance updates its Open‐
                Flow table to reflect that.

              12.
                Eventually, when the VIF (or  its  entire  VM)  is  no  longer
                needed  by  anyone, an administrator deletes the VIF using the
                CMS user interface or API. The CMS updates its own  configura‐
                tion.

              13.
                The  CMS  plugin removes the VIF from the OVN Northbound data‐
                base, by deleting its row in the Logical_Switch_Port table.

              14.
                ovn-northd receives the OVN  Northbound  update  and  in  turn
                updates  the  OVN Southbound database accordingly, by removing
                or updating the rows from the OVN  Southbound  database  Logi
                cal_Flow table and Binding table that were related to the now-
                destroyed VIF.

              15.
                On every hypervisor, ovn-controller receives the  Logical_Flow
                table  updates  that  ovn-northd  made  in  the previous step.
                ovn-controller updates OpenFlow tables to reflect the  update,
                although  there  may  not  be  much  to  do, since the VIF had
                already become unreachable when it was removed from the  Bind
                ing table in a previous step.

   Life Cycle of a Container Interface Inside a VM
       OVN  provides  virtual  network  abstractions by converting information
       written in OVN_NB database to OpenFlow flows in each hypervisor. Secure
       virtual  networking  for multi-tenants can only be provided if OVN con‐
       troller is the only entity that can modify flows in Open vSwitch.  When
       the  Open vSwitch integration bridge resides in the hypervisor, it is a
       fair assumption to make that tenant workloads running inside VMs cannot
       make any changes to Open vSwitch flows.

       If  the infrastructure provider trusts the applications inside the con‐
       tainers not to break out and modify the Open vSwitch flows,  then  con‐
       tainers  can be run in hypervisors. This is also the case when contain‐
       ers are run inside the VMs and Open  vSwitch  integration  bridge  with
       flows  added  by  OVN  controller  resides in the same VM. For both the
       above cases, the workflow is the same as explained with an  example  in
       the previous section ("Life Cycle of a VIF").

       This  section talks about the life cycle of a container interface (CIF)
       when containers are created in the VMs and the Open vSwitch integration
       bridge resides inside the hypervisor. In this case, even if a container
       application breaks out, other tenants are not affected because the con‐
       tainers  running  inside  the  VMs  cannot modify the flows in the Open
       vSwitch integration bridge.

       When multiple containers are created inside a VM,  there  are  multiple
       CIFs  associated  with  them. The network traffic associated with these
       CIFs need to reach the Open vSwitch integration bridge running  in  the
       hypervisor  for OVN to support virtual network abstractions. OVN should
       also be able to distinguish network traffic coming from different CIFs.
       There are two ways to distinguish network traffic of CIFs.

       One  way  is  to  provide one VIF for every CIF (1:1 model). This means
       that there could be a lot of network devices in  the  hypervisor.  This
       would slow down OVS because of all the additional CPU cycles needed for
       the management of all the VIFs. It would also mean that the entity cre‐
       ating  the  containers in a VM should also be able to create the corre‐
       sponding VIFs in the hypervisor.

       The second way is to provide a single VIF  for  all  the  CIFs  (1:many
       model).  OVN could then distinguish network traffic coming from differ‐
       ent CIFs via a tag written in every packet. OVN uses this mechanism and
       uses VLAN as the tagging mechanism.

              1.
                A CIF’s life cycle begins when a container is spawned inside a
                VM by the either the same CMS that created the VM or a  tenant
                that  owns  that  VM  or even a container Orchestration System
                that is different than the CMS that initially created the  VM.
                Whoever the entity is, it will need to know the vif-id that is
                associated with the network interface of the VM through  which
                the  container  interface’s  network traffic is expected to go
                through. The entity that creates the container interface  will
                also need to choose an unused VLAN inside that VM.

              2.
                The  container spawning entity (either directly or through the
                CMS that manages the underlying  infrastructure)  updates  the
                OVN  Northbound  database  to include the new CIF, by adding a
                row to the Logical_Switch_Port table. In the new row, name  is
                any  unique  identifier,  parent_name  is the vif-id of the VM
                through which the CIF’s network  traffic  is  expected  to  go
                through  and  the tag is the VLAN tag that identifies the net‐
                work traffic of that CIF.

              3.
                ovn-northd receives the OVN  Northbound  database  update.  In
                turn, it makes the corresponding updates to the OVN Southbound
                database, by adding rows to the OVN Southbound database’s Log
                ical_Flow table to reflect the new port and also by creating a
                new row in the Binding table and populating  all  its  columns
                except the column that identifies the chassis.

              4.
                On  every hypervisor, ovn-controller subscribes to the changes
                in the Binding table. When a new row is created by  ovn-northd
                that  includes a value in parent_port column of Binding table,
                the ovn-controller in the  hypervisor  whose  OVN  integration
                bridge  has that same value in vif-id in external_ids:iface-id
                updates the local hypervisor’s OpenFlow tables so that packets
                to  and from the VIF with the particular VLAN tag are properly
                handled. Afterward it updates the chassis column of the  Bind
                ing to reflect the physical location.

              5.
                One  can only start the application inside the container after
                the underlying network is ready. To support  this,  ovn-northd
                notices  the  updated  chassis  column  in  Binding  table and
                updates the up column in the OVN Northbound  database’s  Logi
                cal_Switch_Port  table to indicate that the CIF is now up. The
                entity responsible to start the container application  queries
                this value and starts the application.

              6.
                Eventually  the entity that created and started the container,
                stops it. The entity, through the CMS  (or  directly)  deletes
                its row in the Logical_Switch_Port table.

              7.
                ovn-northd  receives  the  OVN  Northbound  update and in turn
                updates the OVN Southbound database accordingly,  by  removing
                or  updating  the  rows from the OVN Southbound database Logi
                cal_Flow table that were related to the now-destroyed CIF.  It
                also deletes the row in the Binding table for that CIF.

              8.
                On  every hypervisor, ovn-controller receives the Logical_Flow
                table updates that  ovn-northd  made  in  the  previous  step.
                ovn-controller updates OpenFlow tables to reflect the update.

   Architectural Physical Life Cycle of a Packet
       This section describes how a packet travels from one virtual machine or
       container to another through OVN. This description focuses on the phys‐
       ical treatment of a packet; for a description of the logical life cycle
       of a packet, please refer to the Logical_Flow table in ovn-sb(5).

       This section mentions several data and  metadata  fields,  for  clarity
       summarized here:

              tunnel key
                     When  OVN encapsulates a packet in Geneve or another tun‐
                     nel, it attaches extra data to it to allow the  receiving
                     OVN  instance to process it correctly. This takes differ‐
                     ent forms depending on the particular encapsulation,  but
                     in  each  case we refer to it here as the ``tunnel key.’’
                     See Tunnel Encapsulations, below, for details.

              logical datapath field
                     A field that denotes the logical datapath through which a
                     packet  is being processed. OVN uses the field that Open‐
                     Flow 1.1+ simply (and confusingly) calls ``metadata’’  to
                     store  the logical datapath. (This field is passed across
                     tunnels as part of the tunnel key.)

              logical input port field
                     A field that denotes the  logical  port  from  which  the
                     packet  entered  the logical datapath. OVN stores this in
                     Open vSwitch extension register number 14.

                     Geneve and STT tunnels pass this field  as  part  of  the
                     tunnel  key.  Although  VXLAN  tunnels  do not explicitly
                     carry a logical input port, OVN only uses VXLAN to commu‐
                     nicate  with gateways that from OVN’s perspective consist
                     of only a single logical port, so that OVN  can  set  the
                     logical  input  port  field to this one on ingress to the
                     OVN logical pipeline.

              logical output port field
                     A field that denotes the  logical  port  from  which  the
                     packet  will leave the logical datapath. This is initial‐
                     ized to 0 at the beginning of the logical  ingress  pipe‐
                     line.  OVN stores this in Open vSwitch extension register
                     number 15.

                     Geneve and STT tunnels pass this field  as  part  of  the
                     tunnel  key.  VXLAN  tunnels  do not transmit the logical
                     output port field. Since VXLAN tunnels  do  not  carry  a
                     logical  output  port  field  in  the  tunnel key, when a
                     packet is received from VXLAN tunnel by an  OVN  hypervi‐
                     sor,  the  packet is resubmitted to table 16 to determine
                     the output port(s); when the  packet  reaches  table  32,
                     these  packets  are  resubmitted  to  table  33 for local
                     delivery by checking a MLF_RCV_FROM_VXLAN flag, which  is
                     set when the packet arrives from a VXLAN tunnel.

              conntrack zone field for logical ports
                     A  field  that  denotes  the connection tracking zone for
                     logical ports. The value only has local significance  and
                     is not meaningful between chassis. This is initialized to
                     0 at the beginning of the logical ingress  pipeline.  OVN
                     stores this in Open vSwitch extension register number 13.

              conntrack zone fields for routers
                     Fields  that  denote  the  connection  tracking zones for
                     routers. These values only have  local  significance  and
                     are  not  meaningful between chassis. OVN stores the zone
                     information for DNATting in Open vSwitch extension regis‐
                     ter  number  11  and zone information for SNATing in Open
                     vSwitch extension register number 12.

              logical flow flags
                     The logical flags are intended to handle keeping  context
                     between  tables  in order to decide which rules in subse‐
                     quent tables are matched. These values  only  have  local
                     significance  and are not meaningful between chassis. OVN
                     stores the logical flags in Open vSwitch extension regis‐
                     ter number 10.

              VLAN ID
                     The  VLAN ID is used as an interface between OVN and con‐
                     tainers nested inside a VM (see Life Cycle of a container
                     interface inside a VM, above, for more information).

       Initially,  a  VM or container on the ingress hypervisor sends a packet
       on a port attached to the OVN integration bridge. Then:

              1.
                OpenFlow table 0 performs physical-to-logical translation.  It
                matches  the  packet’s  ingress port. Its actions annotate the
                packet with logical metadata, by setting the logical  datapath
                field  to  identify  the  logical  datapath that the packet is
                traversing and the logical input port field  to  identify  the
                ingress port. Then it resubmits to table 16 to enter the logi‐
                cal ingress pipeline.

                Packets that originate from a container nested within a VM are
                treated in a slightly different way. The originating container
                can be distinguished based on the VIF-specific VLAN ID, so the
                physical-to-logical  translation  flows  additionally match on
                VLAN ID and the actions strip the VLAN header. Following  this
                step,  OVN  treats packets from containers just like any other
                packets.

                Table 0 also processes packets that arrive from other chassis.
                It  distinguishes  them  from  other  packets by ingress port,
                which is a tunnel. As with packets just entering the OVN pipe‐
                line, the actions annotate these packets with logical datapath
                and logical ingress port metadata. In  addition,  the  actions
                set  the logical output port field, which is available because
                in OVN tunneling occurs  after  the  logical  output  port  is
                known. These three pieces of information are obtained from the
                tunnel encapsulation metadata (see Tunnel  Encapsulations  for
                encoding  details).  Then  the actions resubmit to table 33 to
                enter the logical egress pipeline.

              2.
                OpenFlow tables 16 through  31  execute  the  logical  ingress
                pipeline  from  the  Logical_Flow  table in the OVN Southbound
                database. These tables are expressed entirely in terms of log‐
                ical  concepts like logical ports and logical datapaths. A big
                part of ovn-controller’s job is to translate them into equiva‐
                lent  OpenFlow (in particular it translates the table numbers:
                Logical_Flow tables 0 through 15  become  OpenFlow  tables  16
                through 31).

                Each  logical  flow  maps  to  one  or more OpenFlow flows. An
                actual packet ordinarily matches only one of  these,  although
                in some cases it can match more than one of these flows (which
                is not a problem because all of them have the  same  actions).
                ovn-controller  uses  the  first 32 bits of the logical flow’s
                UUID as the cookie for its OpenFlow flow or  flows.  (This  is
                not  necessarily  unique, since the first 32 bits of a logical
                flow’s UUID is not necessarily unique.)

                Some logical flows can map to the Open  vSwitch  ``conjunctive
                match’’  extension  (see ovs-fields(7)). Flows with a conjunc
                tion action use an OpenFlow cookie of 0, because they can cor‐
                respond  to  multiple  logical  flows. The OpenFlow flow for a
                conjunctive match includes a match on conj_id.

                Some logical flows may not  be  represented  in  the  OpenFlow
                tables  on  a  given  hypervisor, if they could not be used on
                that hypervisor. For example, if no VIF in  a  logical  switch
                resides  on  a given hypervisor, and the logical switch is not
                otherwise reachable on that hypervisor (e.g. over a series  of
                hops  through logical switches and routers starting from a VIF
                on the hypervisor), then the logical flow may  not  be  repre‐
                sented there.

                Most  OVN actions have fairly obvious implementations in Open‐
                Flow (with OVS  extensions),  e.g.  next;  is  implemented  as
                resubmit,  field  =  constant;  as  set_field. A few are worth
                describing in more detail:

                output:
                       Implemented by resubmitting the packet to table 32.  If
                       the pipeline executes more than one output action, then
                       each one is separately resubmitted to  table  32.  This
                       can  be  used  to send multiple copies of the packet to
                       multiple ports. (If the packet was not modified between
                       the output actions, and some of the copies are destined
                       to the same hypervisor, then using a logical  multicast
                       output port would save bandwidth between hypervisors.)

                get_arp(P, A);
                get_nd(P, A);
                     Implemented  by  storing  arguments into OpenFlow fields,
                     then resubmitting to table 66, which ovn-controller popu‐
                     lates  with flows generated from the MAC_Binding table in
                     the OVN Southbound database. If there is a match in table
                     66,  then its actions store the bound MAC in the Ethernet
                     destination address field.

                     (The OpenFlow  actions  save  and  restore  the  OpenFlow
                     fields used for the arguments, so that the OVN actions do
                     not have to be aware of this temporary use.)

                put_arp(P, A, E);
                put_nd(P, A, E);
                     Implemented  by  storing  the  arguments  into   OpenFlow
                     fields, then outputting a packet to ovn-controller, which
                     updates the MAC_Binding table.

                     (The OpenFlow  actions  save  and  restore  the  OpenFlow
                     fields used for the arguments, so that the OVN actions do
                     not have to be aware of this temporary use.)

              3.
                OpenFlow tables 32 through 47 implement the output  action  in
                the  logical  ingress pipeline. Specifically, table 32 handles
                packets to remote hypervisors, table 33 handles packets to the
                local  hypervisor,  and  table 34 checks whether packets whose
                logical ingress and egress port are the same  should  be  dis‐
                carded.

                Logical patch ports are a special case. Logical patch ports do
                not have a physical location and effectively reside  on  every
                hypervisor.  Thus,  flow  table 33, for output to ports on the
                local hypervisor, naturally implements output to unicast logi‐
                cal  patch  ports  too.  However, applying the same logic to a
                logical patch port that is part of a logical  multicast  group
                yields  packet  duplication, because each hypervisor that con‐
                tains a logical port in the multicast group will  also  output
                the  packet  to the logical patch port. Thus, multicast groups
                implement output to logical patch ports in table 32.

                Each flow in table 32 matches on a  logical  output  port  for
                unicast or multicast logical ports that include a logical port
                on a remote hypervisor. Each flow’s actions implement  sending
                a  packet  to  the port it matches. For unicast logical output
                ports on remote hypervisors, the actions set the tunnel key to
                the  correct value, then send the packet on the tunnel port to
                the correct hypervisor. (When the remote  hypervisor  receives
                the  packet,  table  0  there  will recognize it as a tunneled
                packet and pass it along to table 33.) For  multicast  logical
                output  ports, the actions send one copy of the packet to each
                remote hypervisor, in the same way  as  for  unicast  destina‐
                tions.  If  a multicast group includes a logical port or ports
                on the local hypervisor, then its actions also resubmit to ta‐
                ble  33. Table 32 also includes a fallback flow that resubmits
                to table 33 if there is no other match. Table 32 also contains
                a  higher  priority  rule to match packets received from VXLAN
                tunnels, based on flag MLF_RCV_FROM_VXLAN and  resubmit  these
                packets  to table 33 for local delivery. Packets received from
                VXLAN tunnels reach here because of a lack of  logical  output
                port  field in the tunnel key and thus these packets needed to
                be submitted to table 16 to determine the output port.

                Flows in table 33 resemble those in table 32 but  for  logical
                ports  that  reside  locally rather than remotely. For unicast
                logical output ports on the local hypervisor, the actions just
                resubmit  to table 34. For multicast output ports that include
                one or more logical ports on the local  hypervisor,  for  each
                such  logical  port  P,  the actions change the logical output
                port to P, then resubmit to table 34.

                A special case is that when a  localnet  port  exists  on  the
                datapath,  remote port is connected by switching to the local‐
                net port. In this case, instead of adding a flow in  table  32
                to  reach  the  remote  port,  a  flow is added in table 33 to
                switch the logical outport to the localnet port, and  resubmit
                to  table  33 as if it were unicasted to a logical port on the
                local hypervisor.

                Table 34 matches and drops packets for which the logical input
                and  output ports are the same and the MLF_ALLOW_LOOPBACK flag
                is not set. It resubmits other packets to table 48.

              4.
                OpenFlow tables 48 through 63 execute the logical egress pipe‐
                line  from  the Logical_Flow table in the OVN Southbound data‐
                base. The egress pipeline can perform a final stage of valida‐
                tion  before  packet  delivery.  Eventually, it may execute an
                output action, which ovn-controller implements by resubmitting
                to  table  64.  A packet for which the pipeline never executes
                output is effectively  dropped  (although  it  may  have  been
                transmitted through a tunnel across a physical network).

                The  egress  pipeline cannot change the logical output port or
                cause further tunneling.

              5.
                Table 64 bypasses OpenFlow loopback when MLF_ALLOW_LOOPBACK is
                set. Logical loopback was handled in table 34, but OpenFlow by
                default also prevents loopback to the OpenFlow  ingress  port.
                Thus,  when MLF_ALLOW_LOOPBACK is set, OpenFlow table 64 saves
                the OpenFlow ingress port, sets it to zero, resubmits to table
                65  for  logical-to-physical transformation, and then restores
                the OpenFlow  ingress  port,  effectively  disabling  OpenFlow
                loopback  prevents. When MLF_ALLOW_LOOPBACK is unset, table 64
                flow simply resubmits to table 65.

              6.
                OpenFlow table 65  performs  logical-to-physical  translation,
                the  opposite  of  table  0.  It  matches the packet’s logical
                egress port.  Its  actions  output  the  packet  to  the  port
                attached  to  the  OVN integration bridge that represents that
                logical port. If the logical egress port is a container nested
                with  a VM, then before sending the packet the actions push on
                a VLAN header with an appropriate VLAN ID.

   Logical Routers and Logical Patch Ports
       Typically logical routers and logical patch ports do not have a  physi‐
       cal  location  and  effectively reside on every hypervisor. This is the
       case for logical  patch  ports  between  logical  routers  and  logical
       switches behind those logical routers, to which VMs (and VIFs) attach.

       Consider a packet sent from one virtual machine or container to another
       VM or container that resides on a different  subnet.  The  packet  will
       traverse  tables 0 to 65 as described in the previous section Architec
       tural Physical Life Cycle of a Packet, using the logical datapath  rep‐
       resenting  the  logical switch that the sender is attached to. At table
       32, the packet will use the fallback flow that resubmits locally to ta‐
       ble 33 on the same hypervisor. In this case, all of the processing from
       table 0 to table 65 occurs on the hypervisor where the sender resides.

       When the packet reaches table 65, the logical egress port is a  logical
       patch port. The implementation in table 65 differs depending on the OVS
       version, although the observed behavior is meant to be the same:

              ·      In OVS versions 2.6 and earlier, table 65 outputs  to  an
                     OVS  patch  port  that represents the logical patch port.
                     The packet re-enters the OpenFlow flow table from the OVS
                     patch  port’s peer in table 0, which identifies the logi‐
                     cal datapath and logical input  port  based  on  the  OVS
                     patch port’s OpenFlow port number.

              ·      In  OVS  versions 2.7 and later, the packet is cloned and
                     resubmitted directly to OpenFlow flow table  16,  setting
                     the  logical ingress port to the peer logical patch port,
                     and using the peer logical patch port’s logical  datapath
                     (that represents the logical router).

       The  packet  re-enters the ingress pipeline in order to traverse tables
       16 to 65 again, this time using the logical datapath  representing  the
       logical  router.  The processing continues as described in the previous
       section Architectural Physical Life Cycle of a Packet. When the  packet
       reachs  table  65, the logical egress port will once again be a logical
       patch port. In the same manner as described above, this  logical  patch
       port  will  cause the packet to be resubmitted to OpenFlow tables 16 to
       65, this time using  the  logical  datapath  representing  the  logical
       switch that the destination VM or container is attached to.

       The  packet  traverses  tables  16 to 65 a third and final time. If the
       destination VM or container resides on a remote hypervisor, then  table
       32  will  send the packet on a tunnel port from the sender’s hypervisor
       to the remote hypervisor. Finally  table  65  will  output  the  packet
       directly to the destination VM or container.

       The  following  sections describe two exceptions, where logical routers
       and/or logical patch ports are associated with a physical location.

     Gateway Routers

       A gateway router is a logical router that is bound to a physical  loca‐
       tion.  This  includes  all  of  the  logical patch ports of the logical
       router, as well as all of the  peer  logical  patch  ports  on  logical
       switches.  In the OVN Southbound database, the Port_Binding entries for
       these logical patch ports use the type l3gateway rather than patch,  in
       order  to  distinguish  that  these  logical patch ports are bound to a
       chassis.

       When a hypervisor processes a packet on a logical datapath representing
       a  logical switch, and the logical egress port is a l3gateway port rep‐
       resenting connectivity to a gateway router, the  packet  will  match  a
       flow  in table 32 that sends the packet on a tunnel port to the chassis
       where the gateway router resides. This processing in table 32  is  done
       in the same manner as for VIFs.

       Gateway  routers  are  typically  used  in  between distributed logical
       routers and physical networks. The distributed logical router  and  the
       logical  switches behind it, to which VMs and containers attach, effec‐
       tively reside on each hypervisor. The distributed router and the  gate‐
       way  router are connected by another logical switch, sometimes referred
       to as a join logical switch. On the other side, the gateway router con‐
       nects  to another logical switch that has a localnet port connecting to
       the physical network.

       When using gateway routers, DNAT and SNAT rules are associated with the
       gateway  router, which provides a central location that can handle one-
       to-many SNAT (aka IP masquerading).

     Distributed Gateway Ports

       Distributed gateway ports are logical router patch ports that  directly
       connect  distributed  logical routers to logical switches with localnet
       ports.

       The primary design goal of distributed gateway ports  is  to  allow  as
       much  traffic as possible to be handled locally on the hypervisor where
       a VM or container resides. Whenever possible, packets from  the  VM  or
       container  to  the outside world should be processed completely on that
       VM’s or container’s hypervisor, eventually traversing a  localnet  port
       instance on that hypervisor to the physical network. Whenever possible,
       packets from the outside world to a VM or container should be  directed
       through the physical network directly to the VM’s or container’s hyper‐
       visor, where the packet will enter the  integration  bridge  through  a
       localnet port.

       In  order  to allow for the distributed processing of packets described
       in the paragraph above, distributed gateway ports need  to  be  logical
       patch  ports  that  effectively reside on every hypervisor, rather than
       l3gateway ports that are bound to a particular  chassis.  However,  the
       flows  associated with distributed gateway ports often need to be asso‐
       ciated with physical locations, for the following reasons:

              ·      The physical network that the localnet port  is  attached
                     to  typically uses L2 learning. Any Ethernet address used
                     over the distributed gateway port must be restricted to a
                     single  physical location so that upstream L2 learning is
                     not confused. Traffic sent out  the  distributed  gateway
                     port  towards  the localnet port with a specific Ethernet
                     address must be sent out one  specific  instance  of  the
                     distributed gateway port on one specific chassis. Traffic
                     received from the localnet port (or from  a  VIF  on  the
                     same logical switch as the localnet port) with a specific
                     Ethernet address must be directed to the logical switch’s
                     patch port instance on that specific chassis.

                     Due  to  the  implications  of  L2 learning, the Ethernet
                     address and IP address of the  distributed  gateway  port
                     need  to be restricted to a single physical location. For
                     this reason, the user must specify one chassis associated
                     with  the  distributed  gateway  port.  Note that traffic
                     traversing the distributed gateway port using other  Eth‐
                     ernet addresses and IP addresses (e.g. one-to-one NAT) is
                     not restricted to this chassis.

                     Replies to ARP and ND requests must be  restricted  to  a
                     single  physical  location, where the Ethernet address in
                     the reply resides. This includes ARP and ND  replies  for
                     the IP address of the distributed gateway port, which are
                     restricted to the chassis that the user  associated  with
                     the distributed gateway port.

              ·      In  order  to support one-to-many SNAT (aka IP masquerad‐
                     ing), where multiple logical IP addresses  spread  across
                     multiple  chassis  are  mapped  to  a  single external IP
                     address, it will be necessary to handle some of the logi‐
                     cal router processing on a specific chassis in a central‐
                     ized manner. Since the SNAT external IP address is  typi‐
                     cally  the  distributed  gateway port IP address, and for
                     simplicity, the same chassis associated with the distrib‐
                     uted gateway port is used.

       The  details  of flow restrictions to specific chassis are described in
       the ovn-northd documentation.

       While most of the physical location dependent  aspects  of  distributed
       gateway  ports  can  be  handled  by restricting some flows to specific
       chassis, one additional mechanism is required. When a packet leaves the
       ingress pipeline and the logical egress port is the distributed gateway
       port, one of two different sets of actions is required at table 32:

              ·      If the packet can be  handled  locally  on  the  sender’s
                     hypervisor (e.g. one-to-one NAT traffic), then the packet
                     should just be resubmitted locally to table  33,  in  the
                     normal manner for distributed logical patch ports.

              ·      However, if the packet needs to be handled on the chassis
                     associated with the distributed gateway port  (e.g.  one-
                     to-many  SNAT  traffic or non-NAT traffic), then table 32
                     must send the packet on a tunnel port to that chassis.

       In order to trigger the second set of actions, the chassisredirect type
       of  southbound  Port_Binding has been added. Setting the logical egress
       port to the type chassisredirect logical port is simply a way to  indi‐
       cate  that  although the packet is destined for the distributed gateway
       port, it needs to be redirected to a different chassis.  At  table  32,
       packets  with  this logical egress port are sent to a specific chassis,
       in the same way that table 32 directs packets whose logical egress port
       is a VIF or a type l3gateway port to different chassis. Once the packet
       arrives at that chassis, table 33 resets the logical egress port to the
       value  representing  the distributed gateway port. For each distributed
       gateway port, there is one type chassisredirect port,  in  addition  to
       the distributed logical patch port representing the distributed gateway
       port.

   Life Cycle of a VTEP gateway
       A gateway is a chassis that forwards traffic  between  the  OVN-managed
       part of a logical network and a physical VLAN, extending a tunnel-based
       logical network into a physical network.

       The steps below refer often to details of the  OVN  and  VTEP  database
       schemas. Please see ovn-sb(5), ovn-nb(5) and vtep(5), respectively, for
       the full story on these databases.

              1.
                A VTEP gateway’s life cycle begins with the administrator reg‐
                istering  the VTEP gateway as a Physical_Switch table entry in
                the VTEP database. The ovn-controller-vtep connected  to  this
                VTEP  database, will recognize the new VTEP gateway and create
                a new Chassis table entry for it in the  OVN_Southbound  data‐
                base.

              2.
                The  administrator  can then create a new Logical_Switch table
                entry, and bind a particular vlan on a VTEP gateway’s port  to
                any  VTEP  logical switch. Once a VTEP logical switch is bound
                to a VTEP gateway, the ovn-controller-vtep will detect it  and
                add  its name to the vtep_logical_switches column of the Chas
                sis table in the OVN_Southbound database. Note, the tunnel_key
                column  of  VTEP logical switch is not filled at creation. The
                ovn-controller-vtep will set the column when the  correponding
                vtep logical switch is bound to an OVN logical network.

              3.
                Now,  the  administrator can use the CMS to add a VTEP logical
                switch to the OVN logical network. To do that,  the  CMS  must
                first  create  a  new  Logical_Switch_Port  table entry in the
                OVN_Northbound database. Then, the type column of  this  entry
                must be set to "vtep". Next, the vtep-logical-switch and vtep-
                physical-switch keys in the options column must also be speci‐
                fied, since multiple VTEP gateways can attach to the same VTEP
                logical switch.

              4.
                The newly created logical port in the OVN_Northbound  database
                and  its  configuration  will be passed down to the OVN_South
                bound database as a new Port_Binding table entry. The ovn-con
                troller-vtep  will  recognize  the change and bind the logical
                port to the corresponding VTEP gateway chassis.  Configuration
                of  binding  the  same  VTEP logical switch to a different OVN
                logical networks is not allowed and a warning will  be  gener‐
                ated in the log.

              5.
                Beside  binding  to  the  VTEP  gateway  chassis, the ovn-con
                troller-vtep will update the tunnel_key  column  of  the  VTEP
                logical  switch  to  the  corresponding Datapath_Binding table
                entry’s tunnel_key for the bound OVN logical network.

              6.
                Next, the ovn-controller-vtep will keep reacting to  the  con‐
                figuration  change  in  the Port_Binding in the OVN_Northbound
                database, and updating the Ucast_Macs_Remote table in the VTEP
                database.  This allows the VTEP gateway to understand where to
                forward the unicast traffic coming from the extended  external
                network.

              7.
                Eventually, the VTEP gateway’s life cycle ends when the admin‐
                istrator unregisters the VTEP gateway from the VTEP  database.
                The  ovn-controller-vtep  will  recognize the event and remove
                all related configurations (Chassis table entry and port bind‐
                ings) in the OVN_Southbound database.

              8.
                When  the  ovn-controller-vtep is terminated, all related con‐
                figurations in the OVN_Southbound database and the VTEP  data‐
                base  will be cleaned, including Chassis table entries for all
                registered VTEP gateways and  their  port  bindings,  and  all
                Ucast_Macs_Remote  table entries and the Logical_Switch tunnel
                keys.

DESIGN DECISIONS
   Tunnel Encapsulations
       OVN annotates logical network packets that it sends from one hypervisor
       to  another  with  the  following  three  pieces of metadata, which are
       encoded in an encapsulation-specific fashion:

              ·      24-bit logical datapath identifier, from  the  tunnel_key
                     column in the OVN Southbound Datapath_Binding table.

              ·      15-bit  logical ingress port identifier. ID 0 is reserved
                     for internal use within OVN. IDs 1 through 32767,  inclu‐
                     sive,  may  be  assigned  to  logical ports (see the tun
                     nel_key column in the OVN Southbound Port_Binding table).

              ·      16-bit logical egress  port  identifier.  IDs  0  through
                     32767 have the same meaning as for logical ingress ports.
                     IDs 32768 through 65535, inclusive, may  be  assigned  to
                     logical  multicast  groups  (see the tunnel_key column in
                     the OVN Southbound Multicast_Group table).

       For hypervisor-to-hypervisor traffic, OVN supports only Geneve and  STT
       encapsulations, for the following reasons:

              ·      Only STT and Geneve support the large amounts of metadata
                     (over 32 bits per packet) that  OVN  uses  (as  described
                     above).

              ·      STT  and  Geneve  use  randomized UDP or TCP source ports
                     that allows efficient distribution among  multiple  paths
                     in environments that use ECMP in their underlay.

              ·      NICs  are  available to offload STT and Geneve encapsula‐
                     tion and decapsulation.

       Due to its flexibility, the preferred encapsulation between hypervisors
       is Geneve. For Geneve encapsulation, OVN transmits the logical datapath
       identifier in the Geneve VNI. OVN transmits  the  logical  ingress  and
       logical  egress  ports  in  a  TLV  with class 0x0102, type 0x80, and a
       32-bit value encoded as follows, from MSB to LSB:

         1       15          16
       +---+------------+-----------+
       |rsv|ingress port|egress port|
       +---+------------+-----------+
         0


       Environments whose NICs lack Geneve offload may prefer  STT  encapsula‐
       tion  for  performance  reasons. For STT encapsulation, OVN encodes all
       three pieces of logical metadata in the STT 64-bit tunnel  ID  as  fol‐
       lows, from MSB to LSB:

           9          15          16         24
       +--------+------------+-----------+--------+
       |reserved|ingress port|egress port|datapath|
       +--------+------------+-----------+--------+
           0


       For connecting to gateways, in addition to Geneve and STT, OVN supports
       VXLAN, because only  VXLAN  support  is  common  on  top-of-rack  (ToR)
       switches. Currently, gateways have a feature set that matches the capa‐
       bilities as defined by the VTEP schema, so fewer bits of  metadata  are
       necessary.  In  the future, gateways that do not support encapsulations
       with large amounts of metadata may continue to have a  reduced  feature
       set.



Open vSwitch 2.7.90            OVN Architecture            ovn-architecture(7)