skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An analysis of 10-gigabit ethernet protocol stacks in multicore environments.

Abstract

This paper analyzes the interactions between the protocol stack (TCP/IP or iWARP over 10-Gigabit Ethernet) and its multicore environment. Specifically, for host-based protocols such as TCP/IP, we notice that a significant amount of processing is statically assigned to a single core, resulting in an imbalance of load on the different cores of the system and adversely impacting the performance of many applications. For host-offloaded protocols such as iWARP, on the other hand, the portions of the communication stack that are performed on the host, such as buffering of messages and memory copies, are closely tied with the associated process, and hence do not create such load imbalances. Thus, in this paper, we demonstrate that by intelligently mapping different processes of an application to specific cores, the imbalance created by the TCP/IP protocol stack can be largely countered and application performance significantly improved. At the same time, since the load is a better balanced in host-offloaded protocols such as iWARP, such mapping does not adversely affect their performance, thus keeping the mapping generic enough to be used with multiple protocol stacks.

Authors:
; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC); National Science Foundation (NSF); Virginia Tech
OSTI Identifier:
971470
Report Number(s):
ANL/MCS/CP-59628
TRN: US201004%%26
DOE Contract Number:
DE-AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 15th Annual IEEE Symposium on High-Performance Interconnects; Aug. 22, 2007 - Aug. 24, 2007; Palo Alto, CA
Country of Publication:
United States
Language:
ENGLISH
Subject:
97 MATHEMATICAL METHODS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTER NETWORKS; DATA TRANSMISSION; PERFORMANCE; PARALLEL PROCESSING; MEMORY MANAGEMENT

Citation Formats

Narayanaswamy, G., Balaji, P., Feng, W., and Virginia Tech. An analysis of 10-gigabit ethernet protocol stacks in multicore environments.. United States: N. p., 2007. Web. doi:10.1109/HOTI.2007.14.
Narayanaswamy, G., Balaji, P., Feng, W., & Virginia Tech. An analysis of 10-gigabit ethernet protocol stacks in multicore environments.. United States. doi:10.1109/HOTI.2007.14.
Narayanaswamy, G., Balaji, P., Feng, W., and Virginia Tech. Mon . "An analysis of 10-gigabit ethernet protocol stacks in multicore environments.". United States. doi:10.1109/HOTI.2007.14.
@article{osti_971470,
title = {An analysis of 10-gigabit ethernet protocol stacks in multicore environments.},
author = {Narayanaswamy, G. and Balaji, P. and Feng, W. and Virginia Tech},
abstractNote = {This paper analyzes the interactions between the protocol stack (TCP/IP or iWARP over 10-Gigabit Ethernet) and its multicore environment. Specifically, for host-based protocols such as TCP/IP, we notice that a significant amount of processing is statically assigned to a single core, resulting in an imbalance of load on the different cores of the system and adversely impacting the performance of many applications. For host-offloaded protocols such as iWARP, on the other hand, the portions of the communication stack that are performed on the host, such as buffering of messages and memory copies, are closely tied with the associated process, and hence do not create such load imbalances. Thus, in this paper, we demonstrate that by intelligently mapping different processes of an application to specific cores, the imbalance created by the TCP/IP protocol stack can be largely countered and application performance significantly improved. At the same time, since the load is a better balanced in host-offloaded protocols such as iWARP, such mapping does not adversely affect their performance, thus keeping the mapping generic enough to be used with multiple protocol stacks.},
doi = {10.1109/HOTI.2007.14},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • A Link Source Card (LSC) has been developed which employs Gigabit Ethernet as the physical medium. The LSC is implemented as a mezzanine card compliant with the S-Link specifications, and is intended for use in development of the Region of Interest Building (ROIB) in the Level 2 Trigger of ATLAS. The LSC will be used to bring Region of Internet Fragments from Level 1 Trigger elements to the ROIB, and to transfer compiled Region of Interest Records to Supervisor Processors. The card uses the LSI 8101/8104 Media Access Controller (MAC) [1] and the Agilent HDMP-1636 Transceiver. An Altera 10K50A FPGAmore » [2] is configured to provide several state machines which perform all the tasks on the card, such as formulating the Ethernet header, read/write registers in the MAC, etc. An on-card static RAM provides storage for 512K S-Link words, and a FIFO provides 4K buffering of input S-Link words. The LSC has been tested in a setup where it transfers data to a NIC in the PCI bus of a PC.« less
  • Multiple copper-based commodity Gigabit Ethernet (GigE) interconnects (adapters) on a single host can lead to Linux clusters with mesh/torus connections without using expensive switches and high speed network interconnects (NICs). However traditional message passing systems based on TCP for GigE will not perform well for this type of clusters because of the overhead of TCP for multiple GigE links. In this paper, we present two os-bypass message passing systems that are based on a modified M-VIA (an implementation of VIA specification) for two production GigE mesh clusters: one is constructed as a 4 x 8 x 8 (256 nodes) torusmore » and has been in production use for a year; the other is constructed as a 6 x 8 x 8 (384 nodes) torus and was deployed recently. One of the message passing systems targets to a specific application domain and is called QMP and the other is an implementation of MPI specification 1.1. The GigE mesh clusters using these two message passing systems achieve about 18.5 {micro}s half-way round trip latency and 400MB/s total bandwidth, which compare reasonably well to systems using specialized high speed adapters in a switched architecture at much lower costs.« less
  • Recent progress in performance coupled with a decline in price for copper-based gigabit Ethernet (GigE) interconnects makes them an attractive alternative to expensive high speed network interconnects (NIC) when constructing Linux clusters. However traditional message passing systems based on TCP for GigE interconnects cannot fully utilize the raw performance of today's GigE interconnects due to the overhead of kernel involvement and multiple memory copies during sending and receiving messages. The overhead is more evident in the case of mesh connected Linux clusters using multiple GigE interconnects in a single host. We present a general message passing system called QMP-MVIA (QCDmore » Message Passing over M-VIA) for Linux clusters with mesh connections using GigE interconnects. In particular, we evaluate and compare the performance characteristics of TCP and M-VIA (an implementation of the VIA specification) software for a mesh communication architecture to demonstrate the feasibility of using M-VIA as a point-to-point communication software, on which QMP-MVIA is based. Furthermore, we illustrate the design and implementation of QMP-MVIA for mesh connected Linux clusters with emphasis on both point-to-point and collective communications, and demonstrate that QMP-MVIA message passing system using GigE interconnects achieves bandwidth and latency that are not only better than systems based on TCP but also compare favorably to systems using some of the specialized high speed interconnects in a switched architecture at much lower cost.« less
  • No abstract prepared.
  • The CMS Data Acquisition system is designed to build and filter events originating from approximately 500 data sources from the detector at a maximum Level 1 trigger rate of 100 kHz and with an aggregate throughput of 100 GByte/s. For this purpose different architectures and switch technologies have been evaluated. Events will be built in two stages: the first stage, the FED Builder, will be based on Myrinet technology and will pre-assemble groups of about 8 data sources. The next stage, the Readout Builder, will perform the building of full events. The requirement of one Readout Builder is to buildmore » events at 12.5 kHz with average size of 16 kBytes from 64 sources. In this paper we present the prospects of a Readout Builder based on TCP/IP over Gigabit Ethernet. Various Readout Builder architectures that we are considering are discussed. The results of throughput measurements and scaling performance are outlined as well as the preliminary estimates of the final performance. All these studies have been carried out at our test-bed farms that are made up of a total of 130 dual Xeon PCs interconnected with Myrinet and Gigabit Ethernet networking and switching technologies.« less