skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information
  1. DeepPDF: A Deep Learning Approach to Extracting Text from PDFs

    Scientific publications contain a plethora of important information, not only for researchers but also for their managers and institutions. Many researchers try to collect and extract this information in large enough quantities that it requires machine automation, but because publications were historically intended for print and not machine consumption, the digital document formats used today (primarily PDF) have created many hurdles for text extraction. Primarily, tools have relied on trying to convert PDF's to plain text documents for machine processing by reverse engineering the PDF standard. A complex process because once a PDF is created it is more closely relatedmore » to an image file than a document markup language. In this paper we explore the feasibility of treating these PDF documents as images as opposed to a proprietary markup language. We believe that by using deep learning and image analysis we can create more accurate PDF to text extraction tools than those that currently exist. \\ \newline \Keywords{deep learning, text extraction, information extraction, PDF extraction, scholarly publications.« less
  2. High-Throughput Computing on High-Performance Platforms: A Case Study

    The computing systems used by LHC experiments has historically consisted of the federation of hundreds to thousands of distributed resources, ranging from small to mid-size resource. In spite of the impressive scale of the existing distributed computing solutions, the federation of small to mid-size resources will be insufficient to meet projected future demands. This paper is a case study of how the ATLAS experiment has embraced Titan -- a DOE leadership facility in conjunction with traditional distributed high- throughput computing to reach sustained production scales of approximately 52M core-hours a years. The three main contributions of this paper are: (i)more » a critical evaluation of design and operational considerations to support the sustained, scalable and production usage of Titan; (ii) a preliminary characterization of a next generation executor for PanDA to support new workloads and advanced execution modes; and (iii) early lessons for how current and future experimental and observational systems can be integrated with production supercomputers and other platforms in a general and extensible manner.« less
  3. Integration of Titan supercomputer at OLCF with ATLAS Production System

    The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious programmore » to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan's batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan's multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency.« less
  4. Announcing Supercomputer Summit

    Summit is the next leap in leadership-class computing systems for open science. With Summit we will be able to address, with greater complexity and higher fidelity, questions concerning who we are, our place on earth, and in our universe. Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 3,400 nodes when it arrives in 2017. Like Titan, Summit will have a hybrid architecture, and each node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink. Each node will have over half a terabyte ofmore » coherent memory (high bandwidth memory + DDR4) addressable by all CPUs and GPUs plus 800GB of non-volatile RAM that can be used as a burst buffer or as extended memory. To provide a high rate of I/O throughput, the nodes will be connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InfiniBand interconnect. Upon completion, Summit will allow researchers in all fields of science unprecedented access to solving some of the world’s most pressing challenges.« less
  5. Crosscut report: Exascale Requirements Reviews, March 9–10, 2017 – Tysons Corner, Virginia. An Office of Science review sponsored by: Advanced Scientific Computing Research, Basic Energy Sciences, Biological and Environmental Research, Fusion Energy Sciences, High Energy Physics, Nuclear Physics

    The mission of the U.S. Department of Energy Office of Science (DOE SC) is the delivery of scientific discoveries and major scientific tools to transform our understanding of nature and to advance the energy, economic, and national security missions of the United States. To achieve these goals in today’s world requires investments in not only the traditional scientific endeavors of theory and experiment, but also in computational science and the facilities that support large-scale simulation and data analysis. The Advanced Scientific Computing Research (ASCR) program addresses these challenges in the Office of Science. ASCR’s mission is to discover, develop, andmore » deploy computational and networking capabilities to analyze, model, simulate, and predict complex phenomena important to DOE. ASCR supports research in computational science, three high-performance computing (HPC) facilities — the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory and Leadership Computing Facilities at Argonne (ALCF) and Oak Ridge (OLCF) National Laboratories — and the Energy Sciences Network (ESnet) at Berkeley Lab. ASCR is guided by science needs as it develops research programs, computers, and networks at the leading edge of technologies. As we approach the era of exascale computing, technology changes are creating challenges for science programs in SC for those who need to use high performance computing and data systems effectively. Numerous significant modifications to today’s tools and techniques will be needed to realize the full potential of emerging computing systems and other novel computing architectures. To assess these needs and challenges, ASCR held a series of Exascale Requirements Reviews in 2015–2017, one with each of the six SC program offices,1 and a subsequent Crosscut Review that sought to integrate the findings from each. Participants at the reviews were drawn from the communities of leading domain scientists, experts in computer science and applied mathematics, ASCR facility staff, and DOE program managers in ASCR and the respective program offices. The purpose of these reviews was to identify mission-critical scientific problems within the DOE Office of Science (including experimental facilities) and determine the requirements for the exascale ecosystem that would be needed to address those challenges. The exascale ecosystem includes exascale computing systems, high-end data capabilities, efficient software at scale, libraries, tools, and other capabilities. This effort will contribute to the development of a strategic roadmap for ASCR compute and data facility investments and will help the ASCR Facility Division establish partnerships with Office of Science stakeholders. It will also inform the Office of Science research needs and agenda. The results of the six reviews have been published in reports available on the web at http://exascaleage.org/. This report presents a summary of the individual reports and of common and crosscutting findings, and it identifies opportunities for productive collaborations among the DOE SC program offices.« less
  6. Advanced Scientific Computing Research Exascale Requirements Review. An Office of Science review sponsored by Advanced Scientific Computing Research, September 27-29, 2016, Rockville, Maryland

    The widespread use of computing in the American economy would not be possible without a thoughtful, exploratory research and development (R&D) community pushing the performance edge of operating systems, computer languages, and software libraries. These are the tools and building blocks — the hammers, chisels, bricks, and mortar — of the smartphone, the cloud, and the computing services on which we rely. Engineers and scientists need ever-more specialized computing tools to discover new material properties for manufacturing, make energy generation safer and more efficient, and provide insight into the fundamentals of the universe, for example. The research division of themore » U.S. Department of Energy’s (DOE’s) Office of Advanced Scientific Computing and Research (ASCR Research) ensures that these tools and building blocks are being developed and honed to meet the extreme needs of modern science. See also http://exascaleage.org/ascr/ for additional information.« less
  7. High Performance Computing Facility Operational Assessment 2016 - Oak Ridge Leadership Computing Facility

    Oak Ridge National Laboratory’s (ORNL’s) Leadership Computing Facility (OLCF) continues to surpass its operational target goals: supporting users; delivering fast, reliable systems; creating innovative solutions for high-performance computing (HPC) needs; and managing risks, safety, and security associated with operating one of the most powerful computers in the world. The results can be seen in the cutting-edge science delivered by users and the praise from the research community. Calendar year (CY) 2016 was filled with outstanding operational results and accomplishments: a very high rating from users on overall satisfaction for the third year in a row; the greatest number of coremore » hours delivered to research projects; the largest number of research publications for the life of Titan; and success in delivering on the allocation of 60%, 30%, and 10% of core hours offered for the Innovative and Novel Computational Impact on Theory and Experiment (INCITE), Advanced Scientific Computing Research Leadership Computing Challenge (ALCC), and Director’s Discretionary (DD) programs, respectively. These accomplishments, coupled with the extremely high utilization rate, represent the fulfillment of the promise of Titan: maximum use by maximum-size simulations.« less
  8. Integration Of PanDA Workload Management System With Supercomputers for ATLAS and Data Intensive Science

    The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific explorations. Experiments at the LHC explore the fundamental nature of matter and the basic forces that shape our universe, and were recently credited for the discovery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in the sciences, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Data Analysis) Workload Managementmore » System for managing the workflow for all data processing on over 150 data centers. Through PanDA, ATLAS physicists see a single computing facility that enables rapid scientific breakthroughs for the experiment, even though the data centers are physically scattered all over the world. While PanDA currently uses more than 250,000 cores with a peak performance of 0.3 petaFLOPS, LHC data taking runs require more resources than Grid computing can possibly provide. To alleviate these challenges, LHC experiments are engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with supercomputers in United States, Europe and Russia (in particular with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF), MIRA supercomputer at Argonne Leadership Computing Facilities (ALCF), Supercomputer at the National Research Center Kurchatov Institute , IT4 in Ostrava and others). Current approach utilizes modified PanDA pilot framework for job submission to the supercomputers batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on LCFs multi-core worker nodes. This implementation was tested with a variety of Monte-Carlo workloads on several supercomputing platforms for ALICE and ATLAS experiments and it is in full production for the ATLAS experiment since September 2015. We will present our current accomplishments with running PanDA WMS at supercomputers and demonstrate our ability to use PanDA as a portal independent of the computing facilities infrastructure for High Energy and Nuclear Physics as well as other data-intensive science applications, such as bioinformatics and astro-particle physics.« less
  9. Measuring Scientific Impact Beyond Citation Counts

  10. INTEGRATION OF PANDA WORKLOAD MANAGEMENT SYSTEM WITH SUPERCOMPUTERS

    Abstract The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific explorations. Experiments at the LHC explore the funda- mental nature of matter and the basic forces that shape our universe, and were recently credited for the dis- covery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in the sciences, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Datamore » Analysis) Workload Management System for managing the workflow for all data processing on over 140 data centers. Through PanDA, ATLAS physicists see a single computing facility that enables rapid scientific breakthroughs for the experiment, even though the data cen- ters are physically scattered all over the world. While PanDA currently uses more than 250000 cores with a peak performance of 0.3+ petaFLOPS, next LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, LHC experiments are engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with supercomputers in United States, Europe and Russia (in particular with Titan supercomputer at Oak Ridge Leadership Com- puting Facility (OLCF), Supercomputer at the National Research Center Kurchatov Institute , IT4 in Ostrava, and others). The current approach utilizes a modified PanDA pilot framework for job submission to the supercomputers batch queues and local data management, with light-weight MPI wrappers to run single- threaded workloads in parallel on Titan s multi-core worker nodes. This implementation was tested with a variety of Monte-Carlo workloads on several supercomputing platforms. We will present our current accom- plishments in running PanDA WMS at supercomputers and demonstrate our ability to use PanDA as a portal independent of the computing facility s infrastructure for High Energy and Nuclear Physics, as well as other data-intensive science applications, such as bioinformatics and astro-particle physics.« less
...

Search for:
All Records
Creator / Author
"Wells, Jack"

Refine by:
Resource Type
Availability
Publication Date
Creator / Author
Research Organization