- Foreword
- Proposals Process
- 200x Membership
- 1 Introduction
- 2 Terms, notation, and references
- 3 Usage requirements
- 4 Documentation requirements
- 5 Compliance and labeling
- 6 Glossary
- 7 The optional Block word set
- 8 The optional Double-Number word set
- 9 The optional Exception word set
- 10 The optional Facility word set
- 11 The optional File-Access word set
- 12 The optional Floating-Point word set
- 13 The optional Locals word set
- 14 The optional Memory-Allocation word set
- 15 The optional Programming-Tools word set
- 16 The optional Search-Order word set
- 17 The optional String word set
- 18 The optional Extended-Character word set
- Annex A: Rationale
- Annex B: Bibliography
- Annex C: Compatibility analysis
- Annex D: Portability guide
- Annex E: Reference Implementations
- Annex F: Test Suite
- Annex H: Alphabetic list of words
Annex D: Portability guide
D.1 Introduction
A primary goal of Forth 94 was to enable a programmer to write Forth programs that work on a wide variety of machines, Forth-2012 continues this practice. This goal is accomplished by allowing some key Forth terms to be implementation defined (e.g., cell size) and by providing Forth operators (words) that conceal the implementation. This allows the implementor to produce the Forth system that most effectively uses the native hardware. The machine independent operators, together with some programmer discipline, support program portability.
It can be difficult for someone familiar with only one machine architecture to imagine the problems caused by transporting programs between dissimilar machines. This Annex provides guidelines for writing portable Forth programs. The first section describes ways to make a program hardware independent.
The second section describes assumptions about Forth implementations that many programmers make, but can't be relied upon in a portable program.
D.2 Hardware peculiarities
D.2.1 Data/memory abstraction
This standard gives definitions for data and memory that apply to a wide variety of computers. These definitions give us a way to talk about the common elements of data and memory while ignoring the details of specific hardware. Similarly, Forth programs that use data and memory in ways that conform to these definitions can also ignore hardware details. The following sections discuss the definitions and describe how to write programs that are independent of the data and memory peculiarities of different computers.
D.2.2 Definitions
Three terms defined by this standard are address unit, cell, and character.The address space of a Forth system is divided into an array of address units; an address unit is the smallest collection of bits that can be addressed. In other words, an address unit is the number of bits spanned by the addresses addr and addr+1. The most prevalent machines use 8-bit address units, but other address unit sizes exist.
In this standard, the size of a cell is an implementation-defined number of address units. Forth implemented on a 16-bit microprocessor could use a 16-bit cell and an implementation on a 32-bit machine could use a 32-bit cell. Less common cell sizes (e.g., 18-bit or 36-bit machines, etc.) could implement Forth systems with their native cell sizes. In all of these systems, Forth words such as DUP and ! do the same things (duplicate the top cell on the stack and store the second cell into the address given by the first cell, respectively).
Similarly, the definition of a character has been generalized to be an implementation-defined number of address units (but at least eight bits). This removes the need for a Forth implementor to provide 8-bit characters on processors where it is inappropriate. For example, on an 18-bit machine with a 9-bit address unit, a 9-bit character would be most convenient. Since, by definition, you can't address anything smaller than an address unit, a character must be at least as big as an address unit. This will result in big characters on machines with large address units. An example is a 16-bit cell addressed machine where a 16-bit character makes the most sense.
D.2.3 Addressing memory
One of the most common portability problems is the addressing of successive cells in memory. Given the memory address of a cell, how do you find the address of the next cell? On a byte-addressed machine with 32-bit cells the code to find the next cell would be4 +
.
The code would be 1+ on a cell-addressed processor and
16 +
on a bit-addressed processor with 16-bit cells.
This standard provides a
next-cell operator named CELL+ that can be used in all of these cases.
Given an address, CELL+ adjusts the address by the size of a cell
(measured in address units).
A related problem is that of addressing an array of cells in an
arbitrary order. This standard provides a portable scaling operator named CELLS.
Given a number n, CELLS returns the number of address
units needed to hold n cells. Using CELLS, we can make
a portable definition of an ARRAY
defining word:
There are also portability problems with addressing arrays of
characters.
In a byte-addressed machine, the size of a character equals the
size of an address unit. Addresses of successive characters
in memory can be found using 1+ and scaling indices into a character
array is a no-op (i.e., 1 *
). However, there could be
implementations where a character is larger than an address unit.
The CHAR+ and CHARS operators, analogous to
CELL+ and CELLS are available to allow maximum portability.
This standard generalizes the definition of some Forth words that operate on regions of memory to use address units. One example is ALLOT. By prefixing ALLOT with the appropriate scaling operator (CELLS, CHARS, etc.), space for any desired data structure can be allocated (see definition of array above). For example:
D.2.4 Alignment problems
Some processors have restrictions on the addresses that can be used by memory access instructions. This standard does not require an implementor of a Forth to make alignment transparent; on the contrary, it requires (in Section 3.3.3.1 Address alignment) that a standard Forth program assume that character and cell alignment may be required. One pitfall caused by alignment restrictions is in creating tables containing both characters and cells. When , (comma) or C, is used to initialize a table, data are stored at the data-space pointer. Consequently, it must be suitably aligned. For example, a non-portable table definition would be:
On a machine that restricts memory fetches to aligned addresses,
CREATE would leave the data space pointer at an aligned address.
However, the first C, would leave the data space pointer at an
unaligned address, and the subsequent , (comma) would violate
the alignment restriction by storing X
at an unaligned address.
A portable way to create the table is:
ALIGN adjusts the data space pointer to the first aligned
address greater than or equal to its current address. An aligned
address is suitable for storing or fetching characters, cells, cell
pairs, or double-cell numbers.
After initializing the table, we would also like to read values from
the table. For example, assume we want to fetch the first cell,
X
, from the table. ATABLE
CHAR+ gives the
address of the first thing after the character. However this may not
be the address of X
since we aligned the dictionary pointer
between the C, and the ,. The portable way to get the
address of X
is:
ALIGNED adjusts the address on top of the stack to the first aligned address greater than or equal to its current value.
D.3 Number representation
D.3.1 Big endian vs. little endian
The constituent bits of a number in memory are kept in different orders on different machines. Some machines place the most-significant part of a number at an address in memory with less-significant parts following it at higher addresses; this is known as big-endian ording. Other machines do the opposite; the least-significant part is stored at the lowest address (little-endian ordering).
For example, the following code for a 16-bit little endian Forth would produce the answer 1:
The same code on a 16-bit big-endian Forth would produce the answer 0. A portable program cannot exploit the representation of a number in memory.
A related issue is the representation of cell pairs and double-cell numbers in memory. When a cell pair is moved from the stack to memory with 2!, the cell that was on top of the stack is placed at the lower memory address. It is useful and reasonable to manipulate the individual cells when they are in memory.
D.3.2 ALU organization
Different computers use different bit patterns to represent integers. Possibilities include binary representations (two's complement, one's complement, sign magnitude, etc.) and decimal representations (BCD, etc.). Each of these formats creates advantages and disadvantages in the design of a computer's arithmetic logic unit (ALU). The most commonly used representation, two's complement, is popular because of the simplicity of its addition and subtraction operations.Programmers who have grown up on two's complement machines tend to become intimate with their representation of numbers and take some properties of that representation for granted. For example, a trick to find the remainder of a number divided by a power of two is to mask off some bits with AND. A common application of this trick is to test a number for oddness using 1 AND. However, this will not work on a one's complement machine if the number is negative (a portable technique is 2 MOD).
The remainder of this section is a (non-exhaustive) list of things to watch for when portability between machines with binary representations other than two's complement is desired.
To convert a single-cell number to a double-cell number, Forth 94 provided the operator S>D. To convert a double-cell number to single-cell, Forth programmers have traditionally used DROP. However, this trick doesn't work on sign-magnitude machines. For portability a D>S operator is available. Converting an unsigned single-cell number to a double-cell number can be done portably by pushing a zero on the stack.
D.4 Forth system implementation
During Forth's history, an amazing variety of implementation techniques have been developed. The ANS Forth Standard encourages this diversity and consequently restricts the assumptions a user can make about the underlying implementation of an ANS Forth system. Users of a particular Forth implementation frequently become accustomed to aspects of the implementation and assume they are common to all Forths. This section points out many of these incorrect assumptions.
D.4.1 Definitions
Traditionally, Forth definitions have consisted of the name of the Forth word, a dictionary search link, data describing how to execute the definition, and parameters describing the definition itself. These components have historically been referred to as the name, link, code, and parameter fields. No method for accessing these fields has been found that works across all of the Forth implementations currently in use. Therefore, a portable Forth program may not use the name, link, or code field in any way. Use of the parameter field (renamed to data field for clarity) is limited to the operations described below.Only words defined with CREATE or with other defining words that call CREATE have data fields. The other defining words in the standard (VARIABLE, CONSTANT, :, etc.) might not be implemented with CREATE. Consequently, a Standard Program must assume that words defined by VARIABLE, CONSTANT, :, etc., may have no data fields. There is no portable way for a Standard Program to modify the value of a constant or to "patch" a colon definition at run time. The DOES> part of a defining word operates on a data field, so DOES> may only be used on words ultimately defined by CREATE.
In standard Forth, FIND, ['] and ' (tick) return an unspecified entity called an execution token. There are only a few things that may be done with an execution token. The token may be passed to EXECUTE to execute the word ticked or compiled into the current definition with COMPILE,. The token can also be stored in a variable or other data structure and used later. Finally, if the word ticked was defined via CREATE, >BODY converts the execution token into the word's data-field address.
An execution token cannot be assumed to be an address and may not be used as one.
D.4.2 Stacks
In some Forth implementations, it is possible to find the address of a stack in memory and manipulate the stack as an array of cells. This technique is not portable. On some systems, especially Forth-in-hardware systems, the stacks might be in memory that can't be addressed by the program or might not be in memory at all. Forth's parameter and return stacks must be treated as stacks.A Standard Program may use the return stack directly only for temporarily storing values. Every value examined or removed from the return stack using R@, R>, or 2R> must have been put on the stack explicitly using >R or 2>R. Even this must be done carefully because the system may use the return stack to hold return addresses and loop-control parameters. Section 3.2.3.3 Return stack of the standard has a list of restrictions.