Search Results

Searched:  Inventor(s) Must Contain (Gara, Alan)
Sorted By:  Relevance, Descending
Results:  1–25 of exactly 38 matches.
 
Page 1 of 2     Next »
Show only (√) Items Clear all (√) Items Refine Search
  Patent Title Inventor(s) Issue Date Patent Number Full Text
A hybrid counter array device for counting events. The hybrid counter array includes a first counter portion comprising N counter devices, each counter device for receiving signals representing occurrences of events from an event source and providing a first count value corresponding to a lower order bits of the hybrid counter array. The hybrid counter array includes a second counter portion comprising a memory array device having N addressable memory locations in correspondence with the N counter devices, each addressable memory location for storing a second count value representing higher order bits of the hybrid counter array. A control device monitors each of the N counter devices of the first counter portion and initiates updating a value of a corresponding second count value stored at the corresponding addressable memory location in the second counter portion. Thus, a combination of the first and second count values provide an instantaneous measure of number of events received.
Space and power efficient hybrid counters array
Gara, Alan G. , Salapura, Valentina 05/12/2009 7,532,700
View USPTO link (Link will open in a new window)
A hybrid counter array device for counting events with interrupt indication includes a first counter portion comprising N counter devices, each for counting signals representing event occurrences and providing a first count value representing lower order bits. An overflow bit device associated with each respective counter device is additionally set in response to an overflow condition. The hybrid counter array includes a second counter portion comprising a memory array device having N addressable memory locations in correspondence with the N counter devices, each addressable memory location for storing a second count value representing higher order bits. An operatively coupled control device monitors each associated overflow bit device and initiates incrementing a second count value stored at a corresponding memory location in response to a respective overflow bit being set. The incremented second count value is compared to an interrupt threshold value stored in a threshold register, and, when the second counter value is equal to the interrupt threshold value, a corresponding "interrupt arm" bit is set to enable a fast interrupt indication. On a subsequent roll-over of the lower bits of that counter, the interrupt will be fired.
Low latency counter event indication
Gara, Alan G. , Salapura, Valentina 08/24/2010 7,782,995
View USPTO link (Link will open in a new window)
A hybrid counter array device for counting events. The hybrid counter array includes a first counter portion comprising N counter devices, each counter device for receiving signals representing occurrences of events from an event source and providing a first count value corresponding to a lower order bits of the hybrid counter array. The hybrid counter array includes a second counter portion comprising a memory array device having N addressable memory locations in correspondence with the N counter devices, each addressable memory location for storing a second count value representing higher order bits of the hybrid counter array. A control device monitors each of the N counter devices of the first counter portion and initiates updating a value of a corresponding second count value stored at the corresponding addressable memory location in the second counter portion. Thus, a combination of the first and second count values provide an instantaneous measure of number of events received.
Space and power efficient hybrid counters array
Gara, Alan G. , Salapura, Valentina 03/30/2010 7,688,931
View USPTO link (Link will open in a new window)
A hybrid counter array device for counting events with interrupt indication includes a first counter portion comprising N counter devices, each for counting signals representing event occurrences and providing a first count value representing lower order bits. An overflow bit device associated with each respective counter device is additionally set in response to an overflow condition. The hybrid counter array includes a second counter portion comprising a memory array device having N addressable memory locations in correspondence with the N counter devices, each addressable memory location for storing a second count value representing higher order bits. An operatively coupled control device monitors each associated overflow bit device and initiates incrementing a second count value stored at a corresponding memory location in response to a respective overflow bit being set. The incremented second count value is compared to an interrupt threshold value stored in a threshold register, and, when the second counter value is equal to the interrupt threshold value, a corresponding "interrupt arm" bit is set to enable a fast interrupt indication. On a subsequent roll-over of the lower bits of that counter, the interrupt will be fired.
Low latency counter event indication
Gara, Alan G. , Salapura, Valentina 09/16/2008 7,426,253
View USPTO link (Link will open in a new window)
An apparatus and method for providing a data eye monitor. The data eye monitor apparatus utilizes an inverter/latch string circuit and a set of latches to save the data eye for providing an infinite persistent data eye. In operation, incoming read data signals are adjusted in the first stage individually and latched to provide the read data to the requesting unit. The data is also simultaneously fed into a balanced XOR tree to combine the transitions of all incoming read data signals into a single signal. This signal is passed along a delay chain and tapped at constant intervals. The tap points are fed into latches, capturing the transitions at a delay element interval resolution. Using XORs, differences between adjacent taps and therefore transitions are detected. The eye is defined by segments that show no transitions over a series of samples. The eye size and position can be used to readjust the delay of incoming signals and/or to control environment parameters like voltage, clock speed and temperature.
Data eye monitor method and apparatus
Gara, Alan G. , Marcella, James A. , Ohmacht, Martin 01/31/2012 8,108,738
View USPTO link (Link will open in a new window)
The present in invention is directed to a checkpointing filesystem of a distributed-memory parallel supercomputer comprising a node that accesses user data on the filesystem, the filesystem comprising an interface that is associated with a disk for storing the user data. The checkpointing filesystem provides for taking and checkpoint of the filesystem and rolling back to a previously taken checkpoint, as well as for writing user data to and deleting user data from the checkpointing filesystem. The checkpointing filesystem provides a recently written file allocation table (WFAT) for maintaining information regarding the user data written since a previously taken checkpoint and a recently deleted file allocation table (DFAT) for maintaining information regarding user data deleted from since the previously taken checkpoint, both of which are utilized by the checkpointing filesystem to take a checkpoint of the filesystem and rollback the filesystem to a previously taken checkpoint, as well as to write and delete user data from the checkpointing filesystem.
Checkpointing filesystem
Gara, Alan G. , Giampapa, Mark E. , Steinmacher-Burow, Burkhard D. 05/17/2005 6,895,416
View USPTO link (Link will open in a new window)
A fault isolation technique for checking the accuracy of data packets transmitted between nodes of a parallel processor. An independent crc is kept of all data sent from one processor to another, and received from one processor to another. At the end of each checkpoint, the crcs are compared. If they do not match, there was an error. The crcs may be cleared and restarted at each checkpoint. In the preferred embodiment, the basic functionality is to calculate a CRC of all packet data that has been successfully transmitted across a given link. This CRC is done on both ends of the link, thereby allowing an independent check on all data believed to have been correctly transmitted. Preferably, all links have this CRC coverage, and the CRC used in this link level check is different from that used in the packet transfer protocol. This independent check, if successfully passed, virtually eliminates the possibility that any data errors were missed during the previous transfer period.
Fault isolation through no-overhead link level CRC
Chen, Dong , Coteus, Paul W. , Gara, Alan G. 04/24/2007 7,210,088
View USPTO link (Link will open in a new window)
A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.
Fault tolerance in a supercomputer through dynamic repartitioning
Chen, Dong , Coteus, Paul W. , Gara, Alan G. , Takken, Todd E. 02/27/2007 7,185,226
View USPTO link (Link will open in a new window)
A memory system and method for providing atomic memory-based counter operations to operating systems and applications that make most efficient use of counter-backing memory and virtual and physical address space, while simplifying operating system memory management, and enabling the counter-backing memory to be used for purposes other than counter-backing storage when desired. The encoding and address decoding enabled by the invention provides all this functionality through a combination of software and hardware.
Configurable memory system and method for providing atomic counting operations in a memory device
Bellofatto, Ralph E. , Gara, Alan G. , Giampapa, Mark E. , Ohmacht, Martin 09/14/2010 7,797,503
View USPTO link (Link will open in a new window)
Method and apparatus of prefetching streams of varying prefetch depth dynamically changes the depth of prefetching so that the number of multiple streams as well as the hit rate of a single stream are optimized. The method and apparatus in one aspect monitor a plurality of load requests from a processing unit for data in a prefetch buffer, determine an access pattern associated with the plurality of load requests and adjust a prefetch depth according to the access pattern.
Method and apparatus of prefetching streams of varying prefetch depth
Gara, Alan , Ohmacht, Martin , Salapura, Valentina , Sugavanam, Krishnan , Hoenicke, Dirk 01/24/2012 8,103,832
View USPTO link (Link will open in a new window)
The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
Bhanot, Gyan V. , Chen, Dong , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Steinmacher-Burow, Burkhard D. , Vranas, Pavlos M. 01/01/2008 7,315,877
View USPTO link (Link will open in a new window)
A method for maintaining full performance of a file system in the presence of a failure is provided. The file system having N storage devices, where N is an integer greater than zero and N primary file servers where each file server is operatively connected to a corresponding storage device for accessing files therein. The file system further having a secondary file server operatively connected to at least one of the N storage devices. The method including: switching the connection of one of the N storage devices to the secondary file server upon a failure of one of the N primary file servers; and switching the connections of one or more of the remaining storage devices to a primary file server other than the failed file server as necessary so as to prevent a loss in performance and to provide each storage device with an operating file server.
Twin-tailed fail-over for fileservers maintaining full performance in the presence of a failure
Coteus, Paul W. , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Steinmacher-Burow, Burkhard D. 02/12/2008 7,330,996
View USPTO link (Link will open in a new window)
A parallel computer system is constructed as a network of interconnected compute nodes to operate a global message-passing application for performing communications across the network. Each of the compute nodes includes one or more individual processors with memories which run local instances of the global message-passing application operating at each compute node to carry out local processing operations independent of processing operations carried out at other compute nodes. Each compute node also includes a DMA engine constructed to interact with the application via Injection FIFO Metadata describing multiple Injection FIFOs where each Injection FIFO may containing an arbitrary number of message descriptors in order to process messages with a fixed processing overhead irrespective of the number of message descriptors included in the Injection FIFO.
DMA engine for repeating communication patterns
Chen, Dong , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Steinmacher-Burow, Burkhard , Vranas, Pavlos 09/21/2010 7,802,025
View USPTO link (Link will open in a new window)
A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
System and method for programmable bank selection for banked memory subsystems
Blumrich, Matthias A. , Chen, Dong , Gara, Alan G. , Giampapa, Mark E. , Hoenicke, Dirk , Ohmacht, Martin , Salapura, Valentina , Sugavanam, Krishnan 09/07/2010 7,793,038
View USPTO link (Link will open in a new window)
The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.
Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer
Bhanot, Gyan V. , Chen, Dong , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Steinmacher-Burow, Burkhard D. , Vranas, Pavlos M. 01/10/2012 8,095,585
View USPTO link (Link will open in a new window)
A list prefetch engine improves a performance of a parallel computing system. The list prefetch engine receives a current cache miss address. The list prefetch engine evaluates whether the current cache miss address is valid. If the current cache miss address is valid, the list prefetch engine compares the current cache miss address and a list address. A list address represents an address in a list. A list describes an arbitrary sequence of prior cache miss addresses. The prefetch engine prefetches data according to the list, if there is a match between the current cache miss address and the list address.
List based prefetch
Boyle, Peter , Christ, Norman , Gara, Alan , Kim , ,Changhoan , Mawhinney, Robert , Ohmacht, Martin , Sugavanam, Krishnan 08/28/2012 8,255,633
View USPTO link (Link will open in a new window)
In a massively parallel computing system having a plurality of nodes configured in m multi-dimensions, each node including a computing device, a method for routing packets towards their destination nodes is provided which includes generating at least one of a 2m plurality of compact bit vectors containing information derived from downstream nodes. A multilevel arbitration process in which downstream information stored in the compact vectors, such as link status information and fullness of downstream buffers, is used to determine a preferred direction and virtual channel for packet transmission. Preferred direction ranges are encoded and virtual channels are selected by examining the plurality of compact bit vectors. This dynamic routing method eliminates the necessity of routing tables, thus enhancing scalability of the switch.
Optimized scalable network switch
Blumrich, Matthias A. , Chen, Dong , Coteus, Paul W. , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Steinmacher-Burow, Burkhard D. , Takken, Todd E. , Vranas, Pavlos M. 12/04/2007 7,305,487
View USPTO link (Link will open in a new window)
Methods and systems for performing arithmetic functions. In accordance with a first aspect of the invention, methods and apparatus are provided, working in conjunction of software algorithms and hardware implementation of class network routing, to achieve a very significant reduction in the time required for global arithmetic operation on the torus. Therefore, it leads to greater scalability of applications running on large parallel machines. The invention involves three steps in improving the efficiency and accuracy of global operations: (1) Ensuring, when necessary, that all the nodes do the global operation on the data in the same order and so obtain a unique answer, independent of roundoff error; (2) Using the topology of the torus to minimize the number of hops and the bidirectional capabilities of the network to reduce the number of time steps in the data transfer operation to an absolute minimum; and (3) Using class function routing to reduce latency in the data transfer. With the method of this invention, every single element is injected into the network only once and it will be stored and forwarded without any further software overhead. In accordance with a second aspect of the invention, methods and systems are provided to efficiently implement global arithmetic operations on a network that supports the global combining operations. The latency of doing such global operations are greatly reduced by using these methods.
Arithmetic functions in torus and tree networks
Bhanot, Gyan , Blumrich, Matthias A. , Chen, Dong , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Steinmacher-Burow, Burkhard D. , Vranas, Pavlos M. 12/25/2007 7,313,582
View USPTO link (Link will open in a new window)
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.
Low latency memory access and synchronization
Blumrich, Matthias A. , Chen, Dong , Coteus, Paul W. , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Hoenicke, Dirk , Ohmacht, Martin , Steinmacher-Burow, Burkhard D. , Takken, Todd E. , Vranas, Pavlos M. 02/06/2007 7,174,434
View USPTO link (Link will open in a new window)
In a massively parallel computing system having a plurality of nodes configured in m multi-dimensions, each node including a computing device, a method for routing packets towards their destination nodes is provided which includes generating at least one of a 2m plurality of compact bit vectors containing information derived from downstream nodes. A multilevel arbitration process in which downstream information stored in the compact vectors, such as link status information and fullness of downstream buffers, is used to determine a preferred direction and virtual channel for packet transmission. Preferred direction ranges are encoded and virtual channels are selected by examining the plurality of compact bit vectors. This dynamic routing method eliminates the necessity of routing tables, thus enhancing scalability of the switch.
Optimized scalable network switch
Blumrich, Matthias A. , Chen, Dong , Coteus, Paul W. , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Steinmacher-Burow, Burkhard D. , Takken, Todd E. , Vranas, Pavlos M. 12/04/2007 7,305,487
View USPTO link (Link will open in a new window)
A data capture technique for high speed signaling to allow for optimal sampling of an asynchronous data stream. This technique allows for extremely high data rates and does not require that a clock be sent with the data as is done in source synchronous systems. The present invention also provides a hardware mechanism for automatically adjusting transmission delays for optimal two-bit simultaneous bi-directional (SiBiDi) signaling.
Data Capture Technique for High Speed Signaling
Barrett, Wayne Melvin , Chen, Dong , Coteus, Paul William , Gara, Alan Gene , Jackson, Rory , Kopcsay, Gerard Vincent , Nathanson, Ben Jesse , Vranas, Paylos Michael , Takken, Todd E. 08/26/2008 7,418,068
View USPTO link (Link will open in a new window)
A system and method for generating global asynchronous signals in a computing structure. Particularly, a global interrupt and barrier network is implemented that implements logic for generating global interrupt and barrier signals for controlling global asynchronous operations performed by processing elements at selected processing nodes of a computing structure in accordance with a processing algorithm; and includes the physical interconnecting of the processing nodes for communicating the global interrupt and barrier signals to the elements via low-latency paths. The global asynchronous signals respectively initiate interrupt and barrier operations at the processing nodes at times selected for optimizing performance of the processing algorithms. In one embodiment, the global interrupt and barrier network is implemented in a scalable, massively parallel supercomputing device structure comprising a plurality of processing nodes interconnected by multiple independent networks, with each node including one or more processing elements for performing computation or communication activity as required when performing parallel algorithm operations. One multiple independent network includes a global tree network for enabling high-speed global tree communications among global tree network nodes or sub-trees thereof. The global interrupt and barrier network may operate in parallel with the global tree network for providing global asynchronous sideband signals.
Global interrupt and barrier networks
Blumrich, Matthias A. , Chen, Dong , Coteus, Paul W. , Gara, Alan G. , Giampapa, Mark E , Heidelberger, Philip , Kopcsay, Gerard V. , Steinmacher-Burow, Burkhard D. , Takken, Todd E. 10/28/2008 7,444,385
View USPTO link (Link will open in a new window)
Methods and apparatus perform fault isolation in multiple node computing systems using commutative error detection values for--example, checksums--to identify and to isolate faulty nodes. When information associated with a reproducible portion of a computer program is injected into a network by a node, a commutative error detection value is calculated. At intervals, node fault detection apparatus associated with the multiple node computer system retrieve commutative error detection values associated with the node and stores them in memory. When the computer program is executed again by the multiple node computer system, new commutative error detection values are created and stored in memory. The node fault detection apparatus identifies faulty nodes by comparing commutative error detection values associated with reproducible portions of the application program generated by a particular node from different runs of the application program. Differences in values indicate a possible faulty node.
Methods and apparatus using commutative error detection values for fault isolation in multiple node computers
Almasi, Gheorghe , Blumrich, Matthias Augustin , Chen, Dong , Coteus, Paul , Gara, Alan , Giampapa, Mark E. , Heidelberger, Philip , Hoenicke, Dirk I. , Singh, Sarabjeet , Steinmacher-Burow, Burkhard D. , Takken, Todd , Vranas, Pavlos 06/03/2008 7,383,490
View USPTO link (Link will open in a new window)
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.
Method for prefetching non-contiguous data structures
Blumrich, Matthias A. , Chen, Dong , Coteus, Paul W. , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Hoenicke, Dirk , Ohmacht, Martin , Steinmacher-Burow, Burkhard D. , Takken, Todd E. , Vranas, Pavlos M. 05/05/2009 7,529,895
View USPTO link (Link will open in a new window)
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows
Blumrich, Matthias A. , Chen, Dong , Coteus, Paul W. , Gara, Alan G. , Giampapa, Mark E. , Heidelberger, Philip , Hoenicke, Dirk , Ohmacht, Martin 01/11/2011 7,870,343
View USPTO link (Link will open in a new window)
Top
Return to Original Search Page
Page 1 of 2     Next »