Performance and Footprint Tuning

Performance

There are many changes in configuration that can affect performance. For example, the number and size of buffers, how checksum calculations are implementated, etc.

The CYGDBG_LWIP_STATS option can be enabled to allow for a variety of statistics counts to be gathered during execution. The various options are all prefixed with CYGDBG_LWIP_STATS_, and a sub-system specific suffix.

These statistics can help with the tuning of the lwIP world during development, since monitoring the minimum and maximum usage counts of resources along with the error counts can indicate resource starvation issues. Note: Some error counts are indicative of a temporary inability to claim a resource, and are not necessarily a fatal error for the stack, just a potential slowdown.

In order to determine the number of resources used in practice, during development it is recommended that testing is performed under the expected maximum load expected to need to be handled, in order to understand the resource requirements at that load. To get useful information for this, temporarily configure lwIP with a higher number of resources than would be expected to be needed, memory permitting. Then the application should be tested under the expected network load, at the end of which, the statistics can be inspected, and attention paid to the "max" fields which show the maximum number of each resource used in practice in that sample scenario. This can then be used to inform decisions into the appropriate allocation of reduced resources set in the configuration of lwIP for the final product, without unduly compromising performance.

If CYGDBG_LWIP_STATS is enabled then the function:

#include <lwip/stats.h>

void stats_display(void);

can be used to dump all of the statistics gathered via the output routine defined by the LWIP_PLATFORM_DIAG function wrapper (currently defined to use diag_printf() in the eCos specific arch/cc.h header file).

See the Section called Memory Footprint for more information about tuning the lwIP memory footprint.

TCP

If the CYGPKG_LWIP_TCP option is configured then various TCP specific options are available for tuning the performance. The main options are covered in the subsections below.

Receive Window

The CYGNUM_LWIP_TCP_WND option defines the maximum TCP receive window size. This size is advertised to remote peers to indicate how much data they can send. While larger values are faster, you should not advertise more than you can receive, which means you must have sufficient capacity in the pbuf pool used for received data for all your connections.

Maximum Segment Size

The CYGNUM_LWIP_TCP_MSS option defines the Maximum Segment Size (MSS) advertised to peers to constrain the amount of TCP data they send in each packet. This is recommended not to be more than the interface MTU less 40 bytes. The 40 bytes are the sum of a TCP header and IP header, neither with any options. If any options are used regularly, this value should be reduced further.

If the MSS has been set too large, it will result in IP fragmentation and consequent inefficient network operation. If the MSS is too large and IP fragmentation has been disabled (CYGFUN_LWIP_IP_FRAG), incorrect stack operation will likely result including oversize packets never getting sent, or even a failure in the ethernet driver. The most common MTU size is 1500 bytes (leading to a recommended MSS of up to 1460 bytes) but is certainly not universal: some routers, and especially VPNs, can have lower MTUs and will in turn fragment packets leading to lower efficiency. For best resource utilisation by lwIP, it is a good idea for the MSS to be set so that incoming packets can fit into a whole number of pbufs from the packet buffer pool. As such the default MSS is that of the pbuf pool packet buffer size (CYGNUM_LWIP_PBUF_POOL_BUFSIZE), less 40 bytes to allow room for TCP and IP headers without options.

Sending Data

The CYGNUM_LWIP_TCP_SND_BUF option defines the amount of buffer space in bytes allowed for outstanding (unacked) sent data for each TCP connection. This option is complementary to CYGNUM_LWIP_TCP_SND_QUEUELEN which defines the number of packet buffers allowed for outstanding (unacked) sent data for each TCP connection. The TCP layer will refuse to queue a buffer to be sent if either the total quantity of data in bytes waiting to be sent would then exceed CYGNUM_LWIP_TCP_SND_BUF, or there are already at least CYGNUM_LWIP_TCP_SND_QUEUELEN buffers in the queue waiting to be sent.

Optimizations

The following sections detail some optimization hints that could be useful on certain target platforms to maximise lwIP data throughput.

Checksums

A major performance bottle-neck for lwIP is the software checksum code, since it is executed frequently. If the underlying ethernet device driver provides hardware checksum support then the appropriate CHECKSUM_GEN_* and CHECKSUM_CHECK_* options can be disabled. However if software checksums are needed then you may want to override the standard checksum implementation. This can be achieved by adding a LWIP_CHKSUM definition to a header file included by lwIP, e.g. adding the following to lwipopts.h:

#define LWIP_CHKSUM your_checksum_routine
       
The standard lwip_standard_chksum() implementations from src/core/inet_chksum.c provide some C examples, though you might want to craft an assembly function for this specific case. RFC#1071 is a good introduction to this subject. A highly optimized assembler routine will provide the greatest improvement in overall lwIP performance for software checksum based systems.

If the CYGIMP_LWIP_CHECKSUM_ON_COPY functionality is enabled then support for calculating checksums when data is copied into the stack (from application buffers into packet buffers) and can result in fewer checksum calculations if a packet buffer is going to be used multiple times, or if pre-calculated checksums are available for pre-built packets.

The memcpy()-alike function:

u16_t lwip_chksum_copy(void *dest, const void *src, u16_tlen);

can be used to copy data, and return the checksum of the data copied. The extra TCP TF_SEG_DATA_CHECKSUMMED flag is used internally by the lwIP TCP support to track whether a checksum has been set on the payload data.

Network-vs-Host

Since network byte order is big-endian, other significant improvements can be made by supplying assembly or inline replacements for htons() and htonl() if you're using a little-endian architecture.

#define LWIP_PLATFORM_BYTESWAP 1
#define LWIP_PLATFORM_HTONS(x) your_htons
#define LWIP_PLATFORM_HTONL(x) your_htonl
       

If the lwIP CYGIMP_LWIP_HAL_BYTESWAP configuration option is enabled then lwIP will use the HAL supplied support. The CYGIMP_LWIP_HAL_BYTESWAP option is enabled by default if the architecture indicates that optimised byte-swap implementations are available, otherwise the option is disabled by default and for little-endian architectures lwIP will provide byte-swap functions.

Device Driver

The ethernet MAC device driver should ideally use interrupts and DMA to avoid busy loops wherever possible. Hardware support for scatter-gather DMA should be used if available, since multiple packet buffers can then be used to hold the different sections of a frame, allowing for zero-copy of payload data.

Release Builds

For a production release it is highly recommended to disable CYGDBG_LWIP_STATS.

Memory Footprint

The setting of the CYGNUM_LWIP_THREAD_STACK_SIZE configuration option and the memory configuration options described in the Section called Performance will all affect the overall RAM footprint required by lwIP.

However, as long as the option to use the standard run-time allocator (CYGFUN_LWIP_MEM_LIB_MALLOC) is NOT enabled, the memory footprint of lwIP is deterministic and fixed by the selected configuration.

The major memory configuration options are listed below. Setting these configuration values is usually a compromise between the amount of physical RAM available on the target platform, and the lwIP throughput (performance) requirements.

Heap size (CYGNUM_LWIP_MEM_SIZE)

This option defines the size of the heap that lwIP maintains separate from the system heap so that the resource requirements of one do not affect the other. It is primarily (although not exclusively) used as the memory pool from which packet buffers for transmission are allocated, when the data to be sent needs to be copied (type PBUF_RAM). It is also used to allocate space for dynamically created messages boxes and semaphores. This option can be increased to improve performance when sending large amounts of data.

Packet buffer size (CYGNUM_LWIP_PBUF_POOL_BUFSIZE)

This option specifies the maximum size of data which a single packet buffer (pbuf) allocated from the packet buffer pool for incoming packets can contain. The overall memory footprint of each packet buffer is slightly larger to account for metadata. Incoming packets larger than this size are chained together, using additional packet buffers. If only short packets are usually received, memory efficiency may be improved by reducing the packet buffer size, even if this is accompanied by an increase in the number of packets in the pool using the CYGNUM_LWIP_PBUF_POOL_SIZE option. If larger packets tend to be received, the converse is true.

Note: Some network drivers set constraints on the value of this option, in order to better integrate with hardware properties.

Incoming packet messages (CYGNUM_LWIP_MEMP_NUM_TCPIP_MSG), API messages (CYGNUM_LWIP_MEMP_NUM_API_MSG)

When using the sequential API these options define the simultaneous number of, respectively, the packet input and API messages. These messages are used for communicating between external threads and the core lwIP network stack.

Netbufs (CYGNUM_LWIP_MEMP_NUM_NETBUF)

This option defines the maximum number of netbuf structures which may be in use simultaneously with the sequential API (which in turn are used by the BSD sockets API). Each netbuf structure corresponds to a chain of packet buffers to be used for sending or receiving data. This option may be set to 0 if the application will only be using the raw API.

Netconns (CYGNUM_LWIP_MEMP_NUM_NETCONNS)

This option defines the maximum number of netconn structures which may be in use simultaneously with the sequential API. Each netconn structure corresponds to a connection, whether active or inactive. This option may be set to 0 if the application will only be using the raw API.

Packet buffer pool size (CYGNUM_LWIP_PBUF_POOL_SIZE)

This option specifies the number of packet buffers (pbufs) present in the packet buffer pool. This pool is used to provide space for incoming data packets, and so this option limits the number of incoming data packets being processed, or pending (including those not yet read out from the stack by the application). It is also used to hold packet fragments if the option CYGFUN_LWIP_IP_REASS is enabled, and so must be large enough to cover the CYGNUM_LWIP_IP_REASS_MAX_PBUFS requirement. Note that additional buffers are used in a chain when incoming packets are received which exceed the maximum size of each packet buffer. This option may be adjusted depending on the anticipated peak network traffic. Incoming packets are dropped when the pool is depleted.

Number of memp packet buffers (CYGNUM_LWIP_MEMP_NUM_PBUF)

The lwIP API allows packets to be transmitted which only contain a reference to the data being sent, instead of copying the data into a separate buffer. This can be useful when sending a lot of data out of ROM (or other static memory). This option specifies the number of such packets that can be used simultaneously. You may wish to increase the value of this option if the application sends a lot of such data, or reduce if not sending any of this form. These buffers are also used when IP fragmentation support is enabled, but a static buffer is not used (CYGIMP_LWIP_IP_FRAG_USES_STATIC_BUF disabled), so may also need increasing if fragmentation is common.

RAW protocol control blocks (CYGNUM_LWIP_MEMP_NUM_RAW_PCB)

This option defines the number of RAW protocol control blocks that may be used simultaneously. One is required for each active RAW “connection”.

UDP control blocks (CYGNUM_LWIP_MEMP_NUM_UDP_PCB)

This option defines the number of UDP protocol control blocks that may be used simultaneously. One is required for each active UDP “connection”.

TCP control blocks (CYGNUM_LWIP_MEMP_NUM_TCP_PCB)

This option defines the number of TCP protocol control blocks that may be used simultaneously. One is required for each TCP connection. Hence this option defines the maximum number of TCP connections that may be open simultaneously. Increase the value of this option if more simultaneous TCP connections are required.

Listening TCP control blocks (CYGNUM_LWIP_MEMP_NUM_TCP_PCB_LISTEN)

This option defines the number of protocol control blocks dedicated to listening for incoming TCP connection requests. This corresponds to the maximum number of TCP ports which may be simultaneously listened on.

Queued TCP segments (CYGNUM_LWIP_MEMP_NUM_TCP_SEG)

This option defines the maximum number of TCP segments which may be simultaneously queued. This option may need to be adjusted if the stack reports memory failure errors when attempting to send large quantities of data through TCP connections simultaneously, or when individual TCP writes are so large that the number of MSS-sized segments exceeds the value of this option. If the option to allow out-of-order incoming packets (CYGIMP_LWIP_TCP_QUEUE_OOSEQ) is enabled, then such segments may also be dropped if the maximum number of TCP segments specified in this option has been reached.

Queued packets for ARP resolve (CYGNUM_LWIP_MEMP_NUM_ARP_QUEUE)

The number of simultaneously queued outgoing packet buffers that are waiting for an ARP request to finish to resolve their destination address.

Queued IP reassembly packets (CYGNUM_LWIP_MEMP_NUM_REASSDATA), Simultaneous IP fragments (CYGNUM_LWIP_MEMP_NUM_FRAG_PBUF)

These options provide respectively the number of packets that can simultaneously be queued for reassembly, and the number of fragments (not packets) that can be simultaneously queued for sending.

System timeouts (CYGNUM_LWIP_MEMP_NUM_INTERNAL_TIMEOUTS), User timeouts (CYGNUM_LWIP_MEMP_NUM_USER_TIMEOUTS)

The INTERNAL value is the number of timeout objects required to support the configured lwIP features. The USER value defines the maximum number of user timeouts that may be pending simultaneously. The value of this option may need to be increased if there are more threads using the raw API, or if there are more threads calling the select() BSD compatibility function.

Multicast group members (CYGNUM_LWIP_MEMP_NUM_IGMP_GROUP)

This option defines the number of multicast groups whose network interfaces can be members at the same time. This value must be at least twice the number of active network interfaces active in the configuration.

Leaf nodes (CYGNUM_LWIP_MEMP_NUM_SNMP_NODE), Root Node branches (CYGNUM_LWIP_MEMP_NUM_SNMP_ROOTNODE), Variable bindings (CYGNUM_LWIP_MEMP_NUM_SNMP_VARBIND), OIDs (CYGNUM_LWIP_MEMP_NUM_SNMP_VALUE)

These options control the size and number of the SNMP agent related memory allocations.

Active lwip_addrinfo() calls (CYGNUM_LWIP_MEMP_NUM_NETDB), Local host list entries (CYGNUM_LWIP_MEMP_NUM_LOCALHOSTLIST)

If DNS support is enabled then these options respectively control the number of concurrent lwip_addrinfo() calls supported, and the number of host entries in the dynamic local host list.

Simultaneous PPP connections (CYGNUM_LWIP_MEMP_NUM_PPP_PCB), Concurrent PPPoE interfaces (CYGNUM_LWIP_MEMP_NUM_PPPOE_INTERFACES)

These options respectively control the number of simultaneously active PPP connections, and the number of concurrently active PPPoE connections.

lwIP Footprint

The following size information was gathered from a CortexM3 targeted configuration using the eCosCentric GNU tools (version 4.4.5c) with gcc -O2 optimization selected. The byte sizes are provided to give an example overview of the lwIP footprint that can be expected, and are purely for informational purposes.

In the following builds “Basic” refers to a sequential API configuration with UDP and TCP support, but with most options disabled (no fragmentation or reassembly support, static address, no SNMP agent, no IGMP, etc.). The builds marked “Reassembly” refers to the addition of fragmented packet reassembly code to the “Basic” builds. The “Full” entry is a configuration with all the lwIP ethernet features enabled (excluding SNMP, SLIP and PPP) to give an idea of the upper footprint for a fully-featured ethernet build.

The values given are for the complete lwIP “library” package, so specific application linkage (due to the eCos use of -ffunction-sections) means that not all of the code and data measured in the sizes given below may actually be included in the final executable. The footprint can be made even smaller by explicit use of the raw API.

Note: The bss values below do NOT include the stack requirement for the sequential API thread, nor the main configurable lwIP “heap” space. This is because the aim is to present an example of the base lwIP requirement, independent of the configured heap and stack space required for a particular application or target environment.

CortexM3 (STM32F2xx)text + rodatadatabss
Basic IPv4 static4022416516
Basic IPv4 AutoIP4166016516
Basic IPv4 DHCP4671216520
Basic IPv4 & IPv65868024613
Reassembly IPv4 static4192816526
Reassembly IPv4 & IPv66048824627
Full IPv4 & IPv680512241843

Note: Configurations built with the options CYGDBG_LWIP_DEBUG, CYGDBG_LWIP_ASSERTS or CYGDBG_LWIP_STATS enabled will have a significantly larger code footprint. Similarly configurations built with the CYGPKG_INFRA_DEBUG option or the compiler -O0 optimisation flag will also have a significant effect on the footprint.

Example "small" footprint

The example described in this section targets the STM3220G-EVAL platform, but similar figures have also been obtained for other platforms (e.g. AT91SAM7XEK).

With careful tuning it is possible to implement a simple raw API webserver using the httpd2 test example in ~32K of ROM and ~10K of RAM. This is for the complete application, thread stacks, network buffers, etc.

Even though httpd2 is a simple application it does provide a real-world useful working data point for a minimal footprint system. Note: For this example build the httpd2.c source was modified to use the minimal STACK_SIZE definition.

The small_rom_stm3220g_httpd2.ecm example template used is provided in the lwIP package doc directory. The steps needed to build the minimal example binary are:

$ mkdir small_httpd2
$ cd small_httpd2
$ ecosconfig new stm3220g_eval
[ … ecosconfig output elided … ]
$ ecosconfig import $ECOS_REPOSITORY/net/lwip_tcpip/VERSION/doc/small_rom_stm3220g_httpd2.ecm
$ ecosconfig resolve
$ ecosconfig tree
$ make tests
[ … make output elided … ]
$ arm-eabi-objcopy -O binary install/tests/net/lwip_tcpip/VERSION/tests/httpd2 httpd2.bin
        

The produced httpd2.bin binary can then be loaded into the flash of the STM3220G-EVAL at address 0x08000000.

2017-02-09
Documentation license for this page: eCosPro License