News | May 18, 1999

Shared-Memory Computing Architectures for Real-Time Simulation-Simplicity and Elegance

Source: Systran Corporation

Connecting together multiple computers to solve a single real-time problem is certainly a very challenging engineering task. An aircraft simulator is an example of one of the most demanding of these real-time applications. This paper addresses some of the unique requirements of distributed computing architectures for simulators, and it summarizes the two major approaches which have been used to date. The concept of a new approach -- shared-memory networking -- is introduced and compared to the traditional approaches. Next, the design and performance parameters of a new comprehensive shared-memory network implementation are presented. Details of this new design are presented along with implementation considerations.

Computational systems for the real-time aircraft simulation domain have always presented some of the most demanding requirements of all real-time applications. Algorithm computational loads are extremely high due to the complexity of today's aircraft subsystems and due to the high speed avionics which must be simulated. Added to this are the ever increasing performance envelopes of modern aircraft platforms. This increasing performance compounds the computational tasks by increasing the iteration rates on many models. Furthermore, many of these high-rate models are the more complex ones and most difficult to compute.

Aircraft simulators can be simplistically classified in two categories -- "simple" simulators and "complex" simulators. Simple simulators are those where only one computer is needed for the computational load. Here the "system architect's" job is rather straightforward. All the system inputs, outputs and all inter-task data reside in one computer, and there are no inter-task communication problems. Unfortunately, most aircraft simulators do not "fit" within this "simple" simulator category. Instead, most require multiple Central Processing Units (CPUs) to handle the massive algorithm loading and to take advantage of functionally specialized computers. To obtain the computational power needed, simulator system architects have turned to using distributed computing systems composed of numerous CPUs -- all connected with various real-time communication devices. This approach has permitted the system architect to add computer power in "chunks" as needed, and to mix-and-match specialized computers that best match the particular computational problem. For example, use high speed CPUs for algorithmic modules, high speed graphics engines for visuals, etc.

However, these distributed computing systems have presented some very challenging problems in themselves. Probably the most demanding of these has been how to interconnect these distributed computers in a fashion such that inter-processor communications do not adversely affect the computers themselves. Various schemes have been used, and each of these has its requisite strengths and weaknesses.

Traditional Connection Approaches

Basically two approaches have been used -- physical shared-memory architectures and message-passing networks. The physical shared-memory architecture utilizes a high-speed parallel memory bus through which all computers can access a common-memory module (a variant of this structure utilizes a multiported memory in lieu of the common bus).

This architecture has offered the requisite data communications speed along with other benefits:

  • virtually no software overhead is needed for data communication,
  • fast system reconfiguration,
  • fast error recovery,
  • and "tight" control over system resources.

However, the physical shared-memory architecture has its limitations. Specifically:

  • memory bus contention,
  • limited physical separation distance between computers (e.g., 50 feet),
  • typically must use a single vendor's hardware, and
  • can only connect a few computers (e.g., 5 to 10).

The second traditional approach used has been the message-passing local area network (LAN). This architecture uses a serial communication link through which all computers pass message packages. This LAN approach has helped system architects overcome several of the limitations of the physical shared-memory approach. Specifically, LANs:

  • allow for the integration of different vendors' CPU hardware,
  • allow for a much larger number of CPUs (e.g., tens or hundreds), and
  • permit the physical separation of CPUs to distances of hundreds or thousands of feet.
  • These LAN advantages are of great importance in some simulation applications.

However, the use of LANs has given the system architect some severe system design constraints, and these are intolerable in many simulator applications. Some of the major LAN problems are:

  • data communications are several orders-of-magnitude slower than for physical shared-memory approaches,
  • large software overheads are encountered, thus "eating up" valuable CPU time,
  • interrupting other computers and maintaining "tight" control of the distributed computer system is extremely difficult over a message-passing LAN
  • error recovery and system reconfiguration are difficult and time-consuming.

To summarize, the physical shared-memory and message-passing LAN approaches each has its strengths and weaknesses for aircraft simulators. In an area where one approach is good, the other tends to be poor, and vice versa.

Replicated Shared-Memory Network Alternative

Two traditional communication approaches are complementary.

What is really needed is a third alternative for the real-time system architect to use -- one which possesses the combined strengths of the two traditional approaches. This alternative would possess the favorable attributes of physical shared-memory -- data communication speed within microseconds, tight system control by passing interrupts in microseconds, virtually zero software overhead to pass data or control, recovery from communication errors within microseconds -- and also the favorable attributes of the message-passing network -- ability to integrate different computer vendors' CPUs, connection of tens (or even hundreds) of computers into one coordinated simulation system, and the ability to physically separate the computers into separate areas or buildings which may be hundreds of feet (or even miles) apart.

A new communication technology has been recently developed which does precisely this. It offers the strengths of both the physical shared-memory architecture and the message-passing network, yet it retains none of the inherent disadvantages of either. This approach -- termed replicated shared-memory networking -- appears to the software designer as if he is using a physical shared-memory system, yet it is actually a serial-ring network using replicated memory modules at each node.

Recent advances in several technologies have made this new approach feasible. Primarily these are the use of serial fiber optic links operating at very high data rates, and the use of extremely fast and dense electronic parts which enable the compact designs needed.

A replicated shared-memory network is a rather simple concept, and much of its elegance and power derive from this simplicity. This design operates by placing a network card in each computer to be connected -- just like a message-passing network. The cards then communicate over a serial ring architecture. However, this is where the similarity to message-passing networks stops.

Replicated shared-memory networks do not pack multiple data words into messages, attach lengthy protocol information, or pass these complex messages with cumbersome hardware and software protocols. Replicated shared-memory networks do not use the seven layer ISO implementation. Replicated shared-memory networks obviate the need for these cumbersome, time-consuming, and non-deterministic protocols found in message-passing networks.

In a replicated shared-memory network, a datum is communicated very quickly and directly -- the transmitting CPU simply makes a high-level language statement: A = B, where the variable A is located in the shared-memory window of the network card. Within microseconds all other CPUs on the network have the same value of the variable A in their shared-memory areas. The network card does this immediately and automatically by passing both the datum and the datum address over the network. The variable is written to the same relative shared-memory address in each computer's shared-memory area by the receiving network cards.

Control -- another critical performance requirement of simulators -- can be passed over a replicated shared-memory network by the same mechanisms as data are. Interrupts can be transmitted to all (or selected) CPUs within microseconds. For example, on a ten node network all ten CPUs can be interrupted in less than nine microseconds. Likewise, a new datum value can be transmitted to all ten CPUs in the same maximum of nine microseconds. Moreover, the typical time for a data or interrupt message to traverse a ten node ring is around four microseconds.

This replicated shared-memory network approach also provides a very deterministic performance envelope for the system designer. Control and data communications take place in a few microseconds for an upper-bound, and there are no non-deterministic software routines which must be invoked to complicate this process. Also, using the ring network with short, fixed-length transmissions, there are no "unknowns" in the transmission link transport times.

In short, the replicated shared-memory network offers a new "tool" to the simulator architect. This tool is one where all the advantages of a physical shared-memory approach can be combined with the best features of traditional LANs. The result is a tool specifically optimized for the distributed real-time computing domain.

THE SCRAMNet NETWORK IMPLEMENTATION

Overview

The Shared Common Random Access Memory Network (SCRAMNet) was the first commercial implementation of a truly general-purpose replicated shared-memory LAN. The SCRAMNet family of products was developed following the implementation of "point-designs" for the specialized needs of several real-time projects. The outstanding overall system performance and ease of implementation of these specialized designs, lead SYSTRAN to the development and introduction of the SCRAMNet Classic product line to the real-time community in September 1989.

The SCRAMNet product line, containing a full range of real-time functionality and features, was developed to satisfy the needs of a broad segment of the real-time community. Never intended to replace the general-purpose message-passing LANs in the non-real-time market, SCRAMNet is indeed a general-purpose LAN, completely focused on and optimized for the real-time domain. Thus, its real-time performance is clearly unmatched by other approaches.

Today, with hundreds of successful applications and thousands of nodes in the field, SCRAMNet products have become the defacto-standard in real-time shared-memory networking. Its performance, feature set, and reliability remain unmatched in the commercial market place. In addition to simulation, it has found wide application in many other types of real-time systems. These include: virtual reality, data acquisition, instrumentation and control, and telemetry.

SYSTRAN -- having been active in the implementation of replicated shared-memory networking technology for ten years, having implemented two very successful point designs, having developed the commercially successful SCRAMNet Classic product -- was uniquely positioned to further refine and advance the state of the art in replicated shared-memory networking.

SYSTRAN's new ASIC based, SCRAMNet+ Network design, developed from this historical perspective, was released in April 1995. SCRAMNet+ is compatible with SCRAMNet Classic, yet it is faster, smaller, less expensive, and expands on the feature set available to the real-time architect.

The rest of this paper highlights some of the many key real-time features and functions of SCRAMNet+, the newest member of SYSTRAN Corp.'s replicated shared-memory network line...the real-time solution of the 90's.

Current Design

Physical Layer: The SCRAMNet+ Network utilizes a ring topology which offers an excellent combination of networking characteristics that uniquely enhance its networking mission. The ring topology is easy to implement and add to, and it provides for a simple, efficient media access.

The SCRAMNet+ and SCRAMNet Classic networks use a register insertion technique for media access control. The register insertion media access protocol allows all network nodes to have messages outstanding on the ring at the same time. With this protocol, each node on the ring contains an insertion register of sufficient size to temporarily buffer circulating message packets while it is transmitting its own data. Once the node has completed its own data transmission, it empties the insertion register and begins passing the circulating message packets.

The network protocols employed by SCRAMNet+ are BURST, BURST+, PLATINUM and PLATINUM+ modes of operation. BURST allows each node the opportunity to place multiple fixed length messages simultaneously onto the network in a send and forget mode while maintaining error correction at the receiving nodes. BURST+ utilizes the same protocol as BURST but allows for variable length packets. PLATINUM mode provides fully automatic re-transmission of packets in the case of bit-errors or ring timeout while allowing each node to place multiple packets on the ring simultaneously. PLATINUM+ works like PLATINUM except it provides for variable length packets. Table 4 shows a network protocol comparison.

SCRAMNet+ has the capability to increase its network data bandwidth through the use of variable length data packets. With SCRAMNet Classic, data packets had a fixed size of 4 bytes per transmission, thus providing a maximum throughput of 6.5 MBytes/sec. SCRAMNet+ allows packet length to vary from 4 bytes up to 1 KByte, allowing for a maximum throughput of 16.7 MBytes/sec.

The SCRAMNet+ Network design utilizes a fiber optic transmission approach as an integral part of the basic design. This design approach has permitted both the speed and simplicity of the system to be optimized to the point that would not have been possible otherwise. This was an important design philosophy and has significantly helped the network's performance.

The waveguide used is a 62.5/125 um specification, using a dual fiber link. Although the 62.5/125 um cable is specified for SCRAMNet+ Network use, it will operate with a range of cable specifications. This includes the full range of FDDI specifications, both "allowed" and "preferred" cables. The particular cable can be optimized for the specific application environment and customers' needs. Using the 62.5/125 um cable, a maximum node separation of 300 meters is permitted within the allowed SCRAMNet+ Network attenuation/bandwidth budgets. For applications requiring distances of greater than 300 meters (up to 3,500 meters), optional 1300 nm LEDs are available.

In addition to fiber optic cabling, SCRAMNet+ can also employ coaxial cable as a transmission medium. This medium is useful for connecting nodes that are a short distance from each other (within 30 meters) in areas where EM interference is not a problem.

It is important to note that the different transmission mediums available on SCRAMNet+ are linked to the network card via a media access card (MAC) which plugs onto the main board. This allows a SCRAMNet+ card to be upgraded to fiber optic or converted to coaxial cable by changing only the MAC.

For applications where not all computers are powered-up for system operation, an optional fiber optic bypass switch is available. This switch is normally closed and will guarantee ring integrity when the computer is powered-down or when it chooses to eliminate itself from ring traffic. A bit in the node Control & Status Register (CSR) is used to allow the software control program to include/exclude itself from ring participation by operating the switch. This switch is recommended for some applications, but not needed for others.

The ring transmission methodology is a very novel and efficient protocol specifically tailored to replicated shared-memory communications. By concentrating only on replicated shared-memory communications with the design, the network has high throughput and low data latency characteristics far beyond those otherwise possible.

Data Communications: Communicating data over the SCRAMNet+ ring is both simple and fast. The software system architect needs only to configure the system-wide common data as a continuous dataset, just as he would a FORTRAN COMMON, JOVIAL COMPOOL, or Ada Package. Then, this common dataset is included in each application program compilation unit, and its starting address is linked to begin at the first address of the shared-memory area. Once this is accomplished, data communications are both fast and automatic. Each time a common variable is updated (that is, one located in the shared-memory window), process. The CPU "writes" a dataword (32 bits) to a shared-memory memory location (that is, a memory location on the SCRAMNet+ card) by simply executing a high-level language assignment statement, such as ALTITUDE = CURRENT_ALT. The SCRAMNet+ memory "looks" to the CPU as all other memory on this CPU memory bus. However, when the variable ALTITUDE has been linked to the shared-memory area, this memory "write" by the CPU results in automatic and "user-transparent" activity by the SCRAMNet+ Network electronics. These electronics continuously monitor all "writes" to this shared-memory region, and each such "write" results in a single and automatic ring transmission. This ring transmission consists primarily of the memory word address and word contents. Circling the network ring, this 82-bit long message is received and re-transmitted by each node on the ring, and each node places the memory word contents in its SCRAMNet+ memory at the same relative physical address as the other nodes (at the relative memory address passed with memory word contents).

Simplicity and speed are the key features of this technology. Using this replicated shared-memory scheme allows the application programmer to "view" the multi-computer system as one "virtual" multi-processor, where each CPU shares a common-memory module. This simplified software architecture is the source of many of the strengths offered by replicated shared-memory networking.

Error Correction: Under the PLATINUM protocols, SCRAMNet+ error detection and correction is exactly what real-time applications require for data integrity. Combining parity bit checking with a "source message packet removal and compare" error detection and correction scheme, SCRAMNet+ insures that data and interrupts arrive error free and in the proper order. It is fast, efficient and user transparent.

In the PLATINUM modes, the source or originating node (as identified by the ID field) is responsible for removing returning packets and comparing them with the packet retained in the insertion buffer. If no errors are detected, then the packet is removed from the network and the insertion buffer. However, in the event of errors, all or a portion of the data packet is re-transmitted from the insertion buffer. If the packet does not return in a pre-determined time frame, a "watch dog timer" is triggered which also causes re-transmission of the entire packet.

Synchronization and Control Communications: In many real-time applications, "positive" real-time synchronization and control over the distributed computers is a major requirement. This is particularly important where asynchronous events and external stimuli dictate this type of system operation. It is also a key requirement where synchronization is needed -- e.g., in synchronizing models to clock frames.

Hardware interrupts are the key tool of the system architect when handling these system designs. Interrupts must be communicated among computers efficiently, fast, and "deterministically" (that is, with a small variability in the communication delay time).

To satisfy these requirements the SCRAMNet+ Network provides a very flexible and powerful method of controlling interrupts over the network link. Interrupts are passed as data words -- with the exact same speed and efficiency as data are.

To effect this transmission of an interrupt, the application programmer merely makes a high-level language statement in a manner analogous to communicating data -- e.g., MISSILE_LAUNCH = GO. If the SCRAMNet+ Auxiliary Control RAM (ACR) bit has been previously set for the shared-memory word where MISSILE_LAUNCH is located, this "write" to this SCRAMNet+ physical memory location will result in an "Interrupt Bit" (IB) being sent over the SCRAMNet+ Network along with the memory word contents (D) and address (A).

The memory word contents, address, and interrupt bit are passed to all nodes on the network. In turn, any receiving node which has set its ACR to receive interrupts at this particular memory word location will be interrupted. Note that this is a very selective process -- both the transmitting and receiving nodes must cooperate in this process. This provides the system designer with a very "positive" way to communicate interrupts and to control the distributed system operation.

As with data communications, speed and simplicity of communications are the key features of this replicated shared-memory interrupt scheme. Interrupts pass in exactly the same way data do, at the same rates, and with the same deterministic performance.

Integer Reformatting: With the diversity of computer vendors, hardware data format compatibility has become a problem for the real-time designer. One aspect of this format problem is found with integer data representations. For example, an Intel processor may store integer data in a byte-ordering format which is different from a Motorola processor. Thus, communicating data between these processors is difficult and time-consuming if these conversions must be handled in software.

There are two major integer data formats describing how a processor associates the address of a byte with its significance in the data type in which it is contained; (1) Little Endian, and (2) Big Endian. Little Endian is the integer format where the least significant byte of data is stored at the least significant address. In the Big Endian integer format, the most significant byte of data is stored at the least significant address. DEC and Intel both manufacture processors and mini/micro systems which use the Little Endian integer format, where Motorola manufactures processors and systems using the Big Endian format.

The SCRAMNet+ Network provides a hardware reformatting solution to this problem. It is implemented by the application programmer by setting a bit in the SCRAMNet+ control register. The SCRAMNet+ network has been defined to use a Big Endian integer format. If a SCRAMNet+ card is to be installed on a machine which uses the Little Endian format, a bit, which resides in the host interface portion of that card, can be set which will cause byte swapping to occur at that node.

The key points for the real-time architect are its speed and flexibility. It is programmable at each node. It is fast -- completely transparent to the user and accomplishes the byte swapping within the normal memory cycle time.

Data Filtering: In many cyclical real-time systems, algorithms and tasks execute at fixed, clock-driven frame rates. This is typical of many aircraft simulation designs, where each model is assigned a particular rate-group (for example, 100 Hz, 50 Hz, 25 Hz, etc.). This synchronous architecture scheme leads to a simple and efficient executive software structure, as well as a very deterministic system performance.

However, cyclical simulation systems such as these result in new model output data values being computed more frequently than needed in many instances. This stems from the basic nature of the fixed scheduler -- models are put into frame rate-groups for the maximum execution rates ever needed by the models. Thus, when models are operating in their predominant "quiescent" environmental states, many model outputs change slowly. There may be several model iterations in a row where the model outputs are actually unchanged.

Without some method of detecting these invariant output data, these data are needlessly transmitted among the computers. These redundant transmissions consume network bandwidth and result in poorer utilization of system hardware and software resources. To provide the real-time system architect with a solution to this problem, the SCRAMNet+ design incorporates the implementation of a programmable data filter.

The SCRAMNet+ Network is unique in this respect, because only those "writes" to the shared-memory area that produce a data value change are actually transmitted to the other nodes on the network. For example, if the shared-memory location containing CURRENT_ALTITUDE in a node's shared-memory contains the value "2000" and the CPU writes the value "2000" to that location, then no network traffic will be generated. However, if any other value is written to the CURRENT_ALTITUDE location or if the location is set to cause an interrupt in the ACR, then the new value will be passed around the network to update the other shared-memory copies. This data filtering technique is a powerful technique for cyclical, real-time systems where not all data change on each system timing frame.

This technique has been shown to significantly increase the effective throughput of the network. Actual measurements on an aircraft flight simulation application have shown that this technique filtered out approximately 75% of the network traffic. This alone increased the "effective" bandwidth of the network by 400% for this particular application.

The filter is completely controllable by the application programmer, and it can be turned on or off to suit the application at hand.

Timing Considerations

Real-time systems are unique in the following respect: Outputs must be generated on-time, every time. This is the essence of the real-time domain and what makes it such a demanding engineering assignment.

Unlike many other high speed computing environments where "most of the time" is good enough or where "on the average" is good enough, most real-time applications demand correct outputs within rather stringent upper-bounds on the permissible time delays -- and these must occur every time. For the inter-processor communications network, this demands both speed and "deterministic" performance. Usually deterministic means maintenance of very tight upper-bounds on data delays.

For an aircraft simulator, delays may be very critical for some parameters. Microseconds or a few milliseconds may make the difference.

The SCRAMNet+ Network protocol and architecture was specifically optimized for this deterministic situation, and its performance parameters reflect this design thrust. For example, if the value of the parameter CURRENT_ALTITUDE is updated in one computer of a ten node ring, this new value will be available in the memory of all other computers on the ring within 9.0 usec. This is the total application-to-application (A-T-A) communications time -- that is, for instance, the time from when a FORTRAN program computes the parameter value and stores it in the variable CURRENT_ALTITUDE on one computer to when the value is available in memory to the Ada application module on the farthest computer on the ring.

The particular ring design used is very deterministic and no lengthy network access contention is involved. The following characteristics apply when a 4 byte data packet is employed. Messages are fixed in length; 82 bits long. The ring access time is between 100 nsec and 800 nsec. The node latency time is between 250 nsec and 800 nsec per node, depending on whether the node is "active" at the time of data receipt. Nominal delays are around 250 nsecs.

Like data, interrupts are communicated over the ring with the same speed and simplicity of data. For the 10 node network example above, a master CPU could interrupt all nine other nodes within the same 9.0 usec elapsed time.

Much of the speed advantage of the replicated shared-memory network comes from the fact that it is a "softwareless" communications link. Once the network node is set up, no additional software is required to make it operate as a real-time link. No real-time drivers are needed. Unlike a traditional message-passing protocol, there are no time-consuming software routines needed to pack, queue, transmit, de-queue, and unpack messages.

Combining and Isolating Networks: Often a user has a need to isolate or combine SCRAMNet nodes or entire network rings. The SCRAMNet Quad Switch was designed to allow local clusters of up to four SCRAMNet nodes to be switched in or out of a primary SCRAMNet ring independently and dynamically. It is equally useful when a critical real-time resource must be shared and easily re-allocated between many independent real-time systems. The Quad Switch acts as an extremely fast (less than 1 microsecond) optical bypass switch; as a repeater to extend the length of a media connection; and as a media converter to allow different types of media (coaxial, standard fiber, or long link fiber) to be interconnected.

SUMMARY

Replicated shared-memory networking, a "refreshing" new technology, offers the simulator system architect a powerful and efficient tool for interconnecting computer systems. This technology has been made feasible by advances in fiber optics, high speed electronics and electronic packaging technologies.

With a replicated shared-memory network design, the system integrator accrues many benefits:

  • Communications of data and control can take place at memory speeds -- order-of-magnitude improvement over message-passing networks.
  • The overall system design can be conceptually "simple", easy to understand by all project personnel, easy to modify, and simple to test.
  • Large savings can be made in both fiscal and schedule budgets, since system software and communications programming is greatly simplified.
  • Deterministic performance can easily be achieved, due to the speed of communications and small variations in delay times.
  • Real-time reconfiguration, real-time error recovery, fault tolerance and graceful degradation -- all are easily implemented by-products of a shared-memory architecture.
  • Computer hardware costs can be reduced by eliminating the CPU time needed for time-consuming communications software which is typical of message-passing systems.

  • Systems can be integrated, expanded and modified quickly -- thus saving both time and money for the system integrator.

Deciding whether a replicated shared-memory network is the right solution for a project is a straight forward matter. Simply ask the question: Would this problem be best solved by a collection of computers all connected to a single physical shared-memory module? Pictorially this is shown in Figure 12. If this figure depicts the best "logical" configuration to solve the problem at hand, then a replicated shared-memory network is the best "physical" approach. A replicated shared-memory network makes a distributed computer system "look" to the system architect and software engineers as though it was configured.