With HTTP the primary protocol for fast-growing IoT applications – everything from static web content delivery networks (CDNs) to web cache servers and HTTP security – developers need a cost-effective way to handle lots of sessions with low-bandwidth traffic.


From a server total cost of ownership point-of-view, it’s preferable to handle tens of millions of these low-bandwidth sessions running on a single server rather than a team of servers each running hundreds of thousands of low-bandwidth sessions. But how can you do that?


The answer comes in three parts. The first part is 6WINDGate* packet-processing software. This software from 6WIND, an Affiliate member of the Intel® Internet of Things Solutions Alliance, implements a “fast path” stack that delivers industry-leading TCP termination performance on standard servers.


The second part involves deploying 6WINDGate on a server using the latest Intel® Xeon processors . With up to 24 cores in an embedded dual-socket configuration and up to 36 cores in enterprise-class, this new processor family provides up to a 28-percent improvement over the previous generation on the TPC-H* benchmark @ 1000 GB (Figure 1) for the best two-socket decision support. In addition, it delivers industry-leading energy efficiency.


performance chart.png

Figure 1. The TPC-H*@ 100-GB benchmark measures the composite query-per-hour performance metric (QphH@Size), including query throughput when queries are submitted by concurrent users. The test machine using the Intel® Xeon® processor E5-2699 v3 was an HP Proliant DL380 Gen 9* platform.

The third part of the answer is using the Data Plane Development Kit (DPDK), a set of open source BSD-licensed software libraries that can improve packet processing performance on Intel® processors by 25 times over conventional Linux stack’s performance.


Let’s take a deeper at each of these.


Accelerating Network Performance with 6WINDGate

Imagine a server that can respond to 6.5 million HTTP requests per second and sustain 107 million open connections. That could support an immense amount of IoT traffic. This performance was achieved by 6WIND in a demo – before the release of the latest Intel Xeon processors – run on a four-socket HP DL580 server. The server used the Intel® Xeon® processor E7-4870 v2, running 60 cores at 2.3GHz with 256GB RAM and 7x40 Gbps Ethernet ports.


In the demo, 6WINDGate was able to create 5 million TCP sockets per second and deliver 242 Gpbs of outbound HTTP application throughput. Performance like this increases the capacity/cost/square foot ratio for data centers by enabling a powerful scale-up strategy based on a single server.


The 6WINDGate fast path TCP stack enables packets to bypass the Linux networking stack for low-level packet processing directly on dedicated cores (Figure 2). In the example above, 28 cores were configured to run in the fast path.



Figure 2. A look at the inner workings of 6WINDGate packet-processing software.


By implementing such a fast path TCP stack, 6WINDGate allows developers to create dynamic applications that the company claims are 10x to 100x more powerful than the most high performing applications available today. A large part of why is the 6WINDGate TCP Termination module. The module helps accelerate networking applications by terminating TCP connections in the fast path instead of the Linux kernel.


6WINDGate delivers a socket programming model that takes advantage of multi-core environments, enabling fully parallel operations for both session creation and data path. As a result, linear performance scalability can be achieved as more processor cores are allocated to the application. TCP applications create a socket in the 6WINDGate fast path to receive TCP packets for a specific IP address and port. Concurrent active TCP sockets are only limited by memory on the system.


A New Platform for High Performance HTTP Networking  

Not everyone needs such world record-setting HTTP networking prowess as 6WINDGate running on a high-performance four-socket server. For most applications, a two-socket server is preferable and more cost effective. That’s where the new Intel® Xeon® processor E5-2600 v3 product family comes in. With up to 48 threads, a machine like the HP Proliant DL380 equipped two of these processors can spare quite a few of cores for fast path operations and deliver exceptional performance.


Arguments for a server refresh with these new processors are strong. Based on Intel’s Haswell microarchitecture, these processors meet modern and future needs of compute, storage, and networking across a broad set of data center workloads. Optimizations include:

  • New Intel® Advanced Vector Extensions 2 (Intel® AVX2) instructions that deliver up to a 6x boost in performance over four-year old Intel Xeon processors.
  • Support for DDR4 memory technology to provide up to 3x memory bandwidth while consuming as little as half the power.
  • Ability to support up to three times more VMs than servers four years ago, enabling greater consolidation on fewer servers using less power.
  • Built-in intelligent power management capabilities that improve energy efficiency and frequency optimization.
  • Advanced measurement and telemetry features that help maximize operational efficiency through virtualization and data center orchestration.


Particularly interesting from a packet-processing point of view is all Intel has done to keep the cores fed with data. The Intel Xeon processor E5-2600 v3 product family includes a massive L3 cache—up to 45MB—as well as branch prediction improvements and enlarged translation lookaside buffers (TLBs). The on-die bus has been updated to include two fully buffered rings, a necessity to support the larger core counts (Figure 3). It's somewhat analogous to how Ethernet switches divide a network into segments. Each ring can act independently, and as result the effective bandwidth increases. A corresponding QPI interface frequency increase improves multi-socket coherence performance and Last Level Cache (LLC) changes reduce latency and increase bandwidth.


cache compaison.jpg

Figure 3. On-die interconnect enhancements over the previous generation include two fully buffered independent rings designed to help boost bandwidth and reduce latency.

A number of cache enhancements further improve performance. Intel® Data Directed I/O enables PCI Express* devices to target a processor’s last level cache as a primary destination to increase throughput and reduce latency. Now, to meet the demands of high performance packet processing application, the Intel Xeon processor E5 2600 v3 product family supports placement of data originating from I/O devices in up to 12 ways (depending on SKU). In addition, a new cache monitoring technology allows an operating system or VMM to determine the usage of the last level cache on a per application or thread basis so more accurate scheduling decisions can be made using this information. What’s more, a cache allocation technology enables partitioning the last level cache to protect key applications or virtual machines from noisy neighbors. This feature is available on five SKUs on the Communications Infrastructure Division roadmap.


For even greater performance, pairing the Intel® Communications Chipset 89xx Series with Intel® QuickAssist Technology with the Intel Xeon E5 2600 v3 product family offers hardware-assisted acceleration for workload optimization. Applications that use Intel QuickAssist Technology increase workload efficiency by offloading servers from having to handle compute-intensive security, compression and packet operations.


For rapidly moving the packets that need processing, Intel Integrated I/O provides up to 80 PCIe* lanes per two-socket server, and supports the PCIe 3.0 specification with atomic operations support for improved peer-to-peer (P2P) bandwidth.

Enhanced Version of Data Plane Development Kit

The third part of this solution is DPDK. This development kit significantly reduces overhead of a standard Linux OS through performance-increasing concepts such as:

  • Core affinity
  • Disabling interrupts generated by packet I/O
  • Lockless implementation
  • Cache alignment
  • Implementing enormous pages to reduce translation lookaside buffer (TLB) misses
  • Prefetching


DPDK facilitates network application programming in user space, which reduces the cost of software development and maintenance, as well as the potential for single points of failure in the system. All the key benefits of DPDK are available for purpose-built as well as virtualized implementations.


6WIND provides an enhanced version of DPDK integrated within 6WINDGate. 6WIND’s module provides a set of advanced data plane libraries, optimized multi-vendor NIC drivers, accelerated crypto support, and commercial support.


Take HTTP Networking to the Next Level

Developers interested in advancing their HTTP networking solutions will find 6WINDGate easy to add to their solution stack. 6WINDGate is compatible with commercial and open source Linux distributions and does not require any modification of the Linux kernel. In addition to industry-standard servers offered by companies like HP, many Alliance members offer motherboards featuring embedded versions of the Intel Xeon processor E5-2600 v3 product family.



Learn More

Contact featured member:


Solutions in this blog:


Related topics:

6WIND is an Affiliate member of the Intel® Internet of Things Solutions Alliance.


Mark Scantlebury

Roving Reporter (Intel Contractor), Intel® Internet of Things Solutions Alliance

Associate Editor, Embedded Innovator magazine