skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: PetaScale calculations of the electronic structures ofnanostructures with hundreds of thousands of processors

Abstract

Density functional theory (DFT) is the most widely used ab initio method in material simulations. It accounts for 75% of the NERSC allocation time in the material science category. The DFT can be used to calculate the electronic structure, the charge density, the total energy and the atomic forces of a material system. With the advance of the HPC power and new algorithms, DFT can now be used to study thousand atom systems in some limited ways (e.g, a single selfconsistent calculation without atomic relaxation). But there are many problems which either requires much larger systems (e.g, >100,000 atoms), or many total energy calculation steps (e.g. for molecular dynamics or atomic relaxations). Examples include: grain boundary, dislocation energies and atomic structures, impurity transport and clustering in semiconductors, nanostructure growth, electronic structures of nanostructures and their internal electric fields. Due to the O(N{sup 3}) scaling of the conventional DFT algorithms (as implemented in codes like Qbox, Paratec, Petots), these problems are beyond the reach even for petascale computers. As the proposed petascale computers might have millions of processors, new computational paradigms and algorithms are needed to solve the above large scale problems. In particular, O(N) scaling algorithms with parallelization capability upmore » to millions of processors are needed. For a large material science problem, a natural approach to achieve this goal is by divide-and-conquer method: to spatially divide the system into many small pieces, and solve each piece by a small local group of processors. This solves the O(N) scaling and the parallelization problem at the same time. However, the challenge of this approach is for how to divide the system into small pieces and how to patch them up without the trace of the spatial division. Here, we present a linear scaling 3 dimensional fragment (LS3DF) method which uses a novel division-patching scheme that cancels out the artificial boundary effects of the spatial division. As a result, the LS3DF results are essential the same as the original full system DFT results (with the difference smaller than chemical accuracy and smaller than other numerical uncertainties, e.g, due to numerical grids), while with a required floating point operation thousands of times smaller, and computational time thousands of times shorter, than the conventional DFT method. For example, using a few thousand processors, the LS3DF can calculate a >10,000 atom system within an hour while the conventional method might take more than a month to finish. The LS3DF method is applicable to insulator and semiconductor systems, it covers a current gap in DOE's material science code portfolio for ab initio ultrascale simulation. We will use it here to solve the internal electric field problems for composite nanostructures.« less

Authors:
; ;
Publication Date:
Research Org.:
Ernest Orlando Lawrence Berkeley NationalLaboratory, Berkeley, CA (US)
Sponsoring Org.:
USDOE Director. Office of Science. Advanced ScientificComputing Research
OSTI Identifier:
929688
Report Number(s):
LBNL-63793
R&D Project: KX1310; BnR: KJ0102000; TRN: US0806640
DOE Contract Number:
DE-AC02-05CH11231
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
75; ALGORITHMS; CHARGE DENSITY; ELECTRIC FIELDS; ELECTRONIC STRUCTURE; NANOSTRUCTURES; SCALING

Citation Formats

Wang, Lin-Wang, Zhao, Zhengji, and Meza, Juan. PetaScale calculations of the electronic structures ofnanostructures with hundreds of thousands of processors. United States: N. p., 2006. Web. doi:10.2172/929688.
Wang, Lin-Wang, Zhao, Zhengji, & Meza, Juan. PetaScale calculations of the electronic structures ofnanostructures with hundreds of thousands of processors. United States. doi:10.2172/929688.
Wang, Lin-Wang, Zhao, Zhengji, and Meza, Juan. Sat . "PetaScale calculations of the electronic structures ofnanostructures with hundreds of thousands of processors". United States. doi:10.2172/929688. https://www.osti.gov/servlets/purl/929688.
@article{osti_929688,
title = {PetaScale calculations of the electronic structures ofnanostructures with hundreds of thousands of processors},
author = {Wang, Lin-Wang and Zhao, Zhengji and Meza, Juan},
abstractNote = {Density functional theory (DFT) is the most widely used ab initio method in material simulations. It accounts for 75% of the NERSC allocation time in the material science category. The DFT can be used to calculate the electronic structure, the charge density, the total energy and the atomic forces of a material system. With the advance of the HPC power and new algorithms, DFT can now be used to study thousand atom systems in some limited ways (e.g, a single selfconsistent calculation without atomic relaxation). But there are many problems which either requires much larger systems (e.g, >100,000 atoms), or many total energy calculation steps (e.g. for molecular dynamics or atomic relaxations). Examples include: grain boundary, dislocation energies and atomic structures, impurity transport and clustering in semiconductors, nanostructure growth, electronic structures of nanostructures and their internal electric fields. Due to the O(N{sup 3}) scaling of the conventional DFT algorithms (as implemented in codes like Qbox, Paratec, Petots), these problems are beyond the reach even for petascale computers. As the proposed petascale computers might have millions of processors, new computational paradigms and algorithms are needed to solve the above large scale problems. In particular, O(N) scaling algorithms with parallelization capability up to millions of processors are needed. For a large material science problem, a natural approach to achieve this goal is by divide-and-conquer method: to spatially divide the system into many small pieces, and solve each piece by a small local group of processors. This solves the O(N) scaling and the parallelization problem at the same time. However, the challenge of this approach is for how to divide the system into small pieces and how to patch them up without the trace of the spatial division. Here, we present a linear scaling 3 dimensional fragment (LS3DF) method which uses a novel division-patching scheme that cancels out the artificial boundary effects of the spatial division. As a result, the LS3DF results are essential the same as the original full system DFT results (with the difference smaller than chemical accuracy and smaller than other numerical uncertainties, e.g, due to numerical grids), while with a required floating point operation thousands of times smaller, and computational time thousands of times shorter, than the conventional DFT method. For example, using a few thousand processors, the LS3DF can calculate a >10,000 atom system within an hour while the conventional method might take more than a month to finish. The LS3DF method is applicable to insulator and semiconductor systems, it covers a current gap in DOE's material science code portfolio for ab initio ultrascale simulation. We will use it here to solve the internal electric field problems for composite nanostructures.},
doi = {10.2172/929688},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sat Apr 01 00:00:00 EST 2006},
month = {Sat Apr 01 00:00:00 EST 2006}
}

Technical Report:

Save / Share:
  • In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.
  • The newest generation of computers uses parallelism to enhance processing rates. This new technology makes possible speed improvements of a factor of 10 to 50 over even the fastest of the serial computers. However, these high processing rates are achieved only when the full capacity of the computer to perform operations in parallel is utilized. EPRI research project RP-670 includes several tasks that investigate the degree to which the computations of dynamic stability analysis can be performed in parallel. Both transient stability and small signal stability are considered. The transient stability problem is studied in significantly greater detail than themore » small signal stability problem. After a preliminary feasibility study produced positive results, the most computationally intensive part of a (simplified) transient stability program was coded and tested on the Floating Point Systems AP-120B. The code ran 82 times faster than a similar code running on an IBM 370/168.« less
  • Array processors can add substantially to the computation speed of low-cost supermini computers. In tests using a Bonneville Power Administration code, the devices showed promise in the solution portion of system power flow calculations but also displayed operating characteristics that might offset that advantage.
  • Petascale platforms with O(10{sup 5}) and O(10{sup 6}) processing cores are driving advancements in a wide range of scientific disciplines. These large systems create unprecedented application development challenges. Scalable correctness tools are critical to shorten the time-to-solution on these systems. Currently, many DOE application developers use primitive manual debugging based on printf or traditional debuggers such as TotalView or DDT. This paradigm breaks down beyond a few thousand cores, yet bugs often arise above that scale. Programmers must reproduce problems in smaller runs to analyze them with traditional tools, or else perform repeated runs at scale using only primitive techniques.more » Even when traditional tools run at scale, the approach wastes substantial effort and computation cycles. Continued scientific progress demands new paradigms for debugging large-scale applications. The Correctness on Petascale Systems (CoPS) project is developing a revolutionary debugging scheme that will reduce the debugging problem to a scale that human developers can comprehend. The scheme can provide precise diagnoses of the root causes of failure, including suggestions of the location and the type of errors down to the level of code regions or even a single execution point. Our fundamentally new strategy combines and expands three relatively new complementary debugging approaches. The Stack Trace Analysis Tool (STAT), a 2011 R&D 100 Award Winner, identifies behavior equivalence classes in MPI jobs and highlights behavior when elements of the class demonstrate divergent behavior, often the first indicator of an error. The Cooperative Bug Isolation (CBI) project has developed statistical techniques for isolating programming errors in widely deployed code that we will adapt to large-scale parallel applications. Finally, we are developing a new approach to parallelizing expensive correctness analyses, such as analysis of memory usage in the Memgrind tool. In the first two years of the project, we have successfully extended STAT to determine the relative progress of different MPI processes. We have shown that the STAT, which is now included in the debugging tools distributed by Cray with their large-scale systems, substantially reduces the scale at which traditional debugging techniques are applied. We have extended CBI to large-scale systems and developed new compiler based analyses that reduce its instrumentation overhead. Our results demonstrate that CBI can identify the source of errors in large-scale applications. Finally, we have developed MPIecho, a new technique that will reduce the time required to perform key correctness analyses, such as the detection of writes to unallocated memory. Overall, our research results are the foundations for new debugging paradigms that will improve application scientist productivity by reducing the time to determine which package or module contains the root cause of a problem that arises at all scales of our high end systems. While we have made substantial progress in the first two years of CoPS research, significant work remains. While STAT provides scalable debugging assistance for incorrect application runs, we could apply its techniques to assertions in order to observe deviations from expected behavior. Further, we must continue to refine STAT's techniques to represent behavioral equivalence classes efficiently as we expect systems with millions of threads in the next year. We are exploring new CBI techniques that can assess the likelihood that execution deviations from past behavior are the source of erroneous execution. Finally, we must develop usable correctness analyses that apply the MPIecho parallelization strategy in order to locate coding errors. We expect to make substantial progress on these directions in the next year but anticipate that significant work will remain to provide usable, scalable debugging paradigms.« less
  • This project investigated novel techniques for debugging scientific applications on petascale architectures. In particular, we developed lightweight tools that narrow the problem space when bugs are encountered. We also developed techniques that either limit the number of tasks and the code regions to which a developer must apply a traditional debugger or that apply statistical techniques to provide direct suggestions of the location and type of error. We extend previous work on the Stack Trace Analysis Tool (STAT), that has already demonstrated scalability to over one hundred thousand MPI tasks. We also extended statistical techniques developed to isolate programming errorsmore » in widely used sequential or threaded applications in the Cooperative Bug Isolation (CBI) project to large scale parallel applications. Overall, our research substantially improved productivity on petascale platforms through a tool set for debugging that complements existing commercial tools. Previously, Office Of Science application developers relied either on primitive manual debugging techniques based on printf or they use tools, such as TotalView, that do not scale beyond a few thousand processors. However, bugs often arise at scale and substantial effort and computation cycles are wasted in either reproducing the problem in a smaller run that can be analyzed with the traditional tools or in repeated runs at scale that use the primitive techniques. New techniques that work at scale and automate the process of identifying the root cause of errors were needed. These techniques significantly reduced the time spent debugging petascale applications, thus leading to a greater overall amount of time for application scientists to pursue the scientific objectives for which the systems are purchased. We developed a new paradigm for debugging at scale: techniques that reduced the debugging scenario to a scale suitable for traditional debuggers, e.g., by narrowing the search for the root-cause analysis to a small set of nodes or by identifying equivalence classes of nodes and sampling our debug targets from them. We implemented these techniques as lightweight tools that efficiently work on the full scale of the target machine. We explored four lightweight debugging refinements: generic classification parameters, such as stack traces, application-specific classification parameters, such as global variables, statistical data acquisition techniques and machine learning based approaches to perform root cause analysis. Work done under this project can be divided into two categories, new algorithms and techniques for scalable debugging, and foundation infrastructure work on our MRNet multicast-reduction framework for scalability, and Dyninst binary analysis and instrumentation toolkits.« less