[ovs-dev] [PATCH] datapath: Increase maximum number of actions per flow.
jesse at nicira.com
Thu Sep 16 14:26:17 PDT 2010
On Thu, Sep 16, 2010 at 2:01 PM, Ben Pfaff <blp at nicira.com> wrote:
> On Thu, Sep 16, 2010 at 01:15:34PM -0700, Jesse Gross wrote:
>> On Thu, Sep 16, 2010 at 11:28 AM, Ben Pfaff <blp at nicira.com> wrote:
>> > On Thu, Sep 16, 2010 at 12:42:24AM +0000, Jesse Gross wrote:
>> >> On Thu, Sep 16, 2010 at 12:18 AM, Ben Pfaff <blp at nicira.com> wrote:
>> >> > Outputting to 1024 ports is going to be slow and expensive no matter
>> >> > what we do. Despite that, I have a few ideas to make a bad situation a
>> >> > little bit better:
>> >> >
>> >> > * Hash action sets and merge duplicates, to reduce duplication. (This
>> >> > is a good idea in userspace too.)
>> >> This seems like a cool idea but I don't know how often it will
>> >> actually help in practice.
>> > It'll help, I think, if there's a lot of multicast or broadcast traffic
>> > with lots of ports, or whenever the MAC learning table gets flushed.
>> Normally the set of actions for broadcasts won't be the same across
>> different ports because the actions won't include the input port. If
>> you just naively hash them then there won't be any overlap in the
>> normal case. You could do something more sophisticated but at that
>> point it seems to be nearly identical to port groups. Unless I'm
>> misunderstanding what you are suggesting here?
> Yes, it's clear that it won't make any difference if they have different
> source ports. But most of our broadcast traffic, I guess, comes in from
> a physical port and goes to all of the virtual ports, and it would help
> in that case and in any other case where there was more than one such
> flow originating from a single port.
Yes, I suppose that is true.
>> >> > * Allow multiple pages of actions to be chained together in a linked
>> >> > list or array, so that we don't require contiguous pages.
>> >> Why not just use vmalloc() for large allocations?
>> > vmalloc() is slow, hard on the TLB, and it is a limited resource. For
>> > example, my desktop box here has only 113 MB of virtual address space
>> > available for vmalloc(). Documentation/flexible-arrays.txt has some
>> > more details:
>> That is true but I don't think flexible arrays are much better.
> I'm not suggesting flexible arrays. That file just documented the
> problems, I'm not saying that it documents the solution.
> The solution that comes to mind would be to create a new action that
> chains to a new set of flows in a new page. Userspace wouldn't be able
> to use the action; it would be inserted by the kernel when it breaks
> apart sets of flows. This shouldn't have a penalty for flows that are
> smaller than a page.
>> Having multiple independent page allocations will be just as hard on
>> the TLB.
> The bulk of kernel memory can be and as I read the code (seems hard to
> verify on a running system) usually is mapped as 4-MB pages, but vmalloc
> is always mapped as 4-kB pages. According to info I found with Google a
> Core i5 TLB has 32 4-MB entries covering 128 MB, which is used almost
> exclusively by the kernel, and 64 4-kB entries covering just 256 kB.
> Since the latter is essentially shared between user and kernel we can
> expect that every access to a vmalloc page causes a TLB fault.
Yeah, I was thinking that each page of the array would be
independently mapped or would need to be physically contiguous to use
superpages. However, in reality since the chunks would be allocated
with kmalloc() the pages could be interleaved with busy pages and the
whole thing still mapped as a super page.
> There's still the cost of IPIs setting up vmalloc entries, too.
The cost of allocating this seems fairly negligible compared to the
cost of using it repeatedly.
>> Fragmented arrays may be faster to allocate but are slower on access.
>> About the only benefit that I see is not requiring virtually
>> contiguous addresses but that doesn't worry me all that much.
> I don't think that if properly implemented this would make the normal
> case of actions that fit in a page slower to access. Actions that span
> pages would be immeasurably slower; outputting to thousands of ports
> will always be slow.
That's probably true.
More information about the dev