3 Usage requirements

A system shall provide all of the words defined in 6.1 Core words. It may also provide any words defined in the optional word sets and extensions word sets. No standard word provided by a system shall alter the system state in a way that changes the effect of execution of any other standard word except as provided in this standard. A system may contain non-standard extensions, provided that they are consistent with the requirements of this standard.

The implementation of a system may use words and techniques outside the scope of this standard.

A system need not provide all words in executable form. The implementation may provide definitions, including definitions of words in the Core word set, in source form only. If so, the mechanism for adding the definitions to the dictionary is implementation defined.

A program that requires a system to provide words or techniques not defined in this standard has an environmental dependency.

3.1 Data types

A data type identifies the set of permissible values for a data object. It is not a property of a particular storage location or position on a stack. Moving a data object shall not affect its type.

No data-type checking is required of a system. An ambiguous condition exists if an incorrectly typed data object is encountered.

Table 3.1 summarizes the data types used throughout this standard. Multiple instances of the same type in the description of a definition are suffixed with a sequence digit subscript to distinguish them.

Table 3.1: Data types

Symbol Data type Size on stack

flag flag 1 cell
true true flag 1 cell
false false flag 1 cell
char character 1 cell
n signed number 1 cell
+n non-negative number 1 cell
u unsigned number 1 cell
u | n[1] number 1 cell
x unspecified cell 1 cell
xt execution token 1 cell
addr address 1 cell
a-addr aligned address 1 cell
c-addr character-aligned address 1 cell
ior error result 1 cell
d double-cell signed number 2 cells
+d double-cell non-negative number 2 cells
ud double-cell unsigned number 2 cells
d | ud[2] double-cell number 2 cells
xd unspecified cell pair 2 cells
colon-sys definition compilation implementation dependent
do-sys do-loop structures implementation dependent
case-sys CASE structures implementation dependent
of-sys OF structures implementation dependent
orig control-flow origins implementation dependent
dest control-flow destinations implementation dependent
loop-sys loop-control parameters implementation dependent
nest-sys definition cells implementation dependent
i * x, j * x, k * x[3] any data type 0 or more cells

[1] May be either a signed number or an unsigned number depending on context.
[2] May be either a double-cell signed number or a double-cell unsigned number depending on context.
[3] May be an undetermined number of stack entries of unspecified type. For examples of use, see 6.1.1370 EXECUTE, 6.1.2050 QUIT.

3.1.1 Data-type relationships

Some of the data types are subtypes of other data types. A data type i is a subtype of type j if and only if the members of i are a subset of the members of j. The following list represents the subtype relationships using the phrase "i j" to denote "i is a subtype of j". The subtype relationship is transitive; if i j and j k then i k:

+n u x;
+n n x;
char +n;
a-addr c-addr addr u;
flag x;
xt x;
ior n x;
+d d xd;
+d ud xd.

Any Forth definition that accepts an argument of type i shall also accept an argument that is a subtype of i.

3.1.2 Character types

Characters shall have the following properties:

  • be at least one address unit wide;
  • contain at least eight bits;
  • be of fixed width;
  • have a size less than or equal to cell size;
  • be unsigned.

The characters provided by a system shall include the graphic characters {32 ... 126}, which represent graphic forms as shown in table 3.2.

3.1.2.1 Graphic characters

A graphic character is one that is normally displayed (e.g., A, #, &, 6). These values and graphics, shown in table 3.2, are taken directly from ANS X3.4-1974 (ASCII) and ISO 646-1983, International Reference Version (IRV). The graphic forms of characters outside the hex range {20 ... 7E} are implementation defined. Programs that use the graphic hex 24 (the currency sign) have an environmental dependency.

The graphic representation of characters is not restricted to particular type fonts or styles. The graphics here are examples.

Table 3.2: Standard graphic characters

Hex  IRV  ASCII Hex  IRV  ASCII Hex  IRV  ASCII Hex  IRV  ASCII Hex  IRV  ASCII Hex  IRV  ASCII

20  30 0  0 40 @  @ 50 P  P 60 `  ` 70 p  p
21 !  ! 31 1  1 41 A  A 51 Q  Q 61 a  a 71 q  q
22 "  " 32 2  2 42 B  B 52 R  R 62 b  b 72 r  r
23 #  # 33 3  3 43 C  C 53 S  S 63 c  c 73 s  s
24 ¤  $ 34 4  4 44 D  D 54 T  T 63 d  d 74 t  t
25 %  % 35 5  5 45 E  E 55 U  U 64 e  e 75 u  u
26 &  & 36 6  6 46 F  F 56 V  V 65 f  f 76 v  v
27 '  ' 37 7  7 47 G  G 57 W  W 66 g  g 77 w  w
28 (  ( 38 8  8 48 H  H 58 X  X 67 h  h 78 x  x
29 )  ) 39 9  9 49 I  I 59 Y  Y 68 i  i 79 y  y
2A *  * 3A :  : 4A J  J 5A Z  Z 69 j  j 7A z  z
2B +  + 3B ;  ; 4B K  K 5B [  [ 6A k  k 7B { {
2C ,  , 3C < < 4C L  L 5C \  \ 6C l  l 7C |  |
2D -  - 3D =  = 4D M  M 5D ]  ] 6D m  m 7D } }
2E .  . 3E > > 4E N  N 5E ^  ^ 6E n  n 7E ~  ~
2F /  / 3F ?  ? 4F O  O 5F _  _6F o  o

3.1.2.2 Control characters

All non-graphic characters included in the implementation-defined character set are defined in this standard as control characters. In particular, the characters {0 ... 31}, which could be included in the implementation-defined character set, are control characters.

Programs that require the ability to send or receive control characters have an environmental dependency.

3.1.2.3 Primitive Character

A primitive character (pchar) is a character with no restrictions on its contents. Unless otherwise stated, a "character" refers to a primitive character.

3.1.3 Single-cell types

The implementation-defined fixed size of a cell is specified in address units and the corresponding number of bits. See D.2 Hardware peculiarities.

Cells shall be at least one address unit wide and contain at least sixteen bits. The size of a cell shall be an integral multiple of the size of a character. Data-stack elements, return-stack elements, addresses, execution tokens, flags, and integers are one cell wide.

3.1.3.1 Flags

Flags may have one of two logical states, true or false. Programs that use flags as arithmetic operands have an environmental dependency. A true flag returned by a standard word shall be a single-cell value with all bits set. A false flag returned by a standard word shall be a single-cell value with all bits clear.

3.1.3.2 Integers

The implementation-defined range of signed integers shall include {-32767 ... +32767}. The implementation-defined range of non-negative integers shall include {0 ... 32767}. The implementation-defined range of unsigned integers shall include {0 ... 65535}.

3.1.3.3 Addresses

An address identifies a location in data space with a size of one address unit, which a program may fetch from or store into except for the restrictions established in this standard. The size of an address unit is specified in bits. Each distinct address value identifies exactly one such storage element. See 3.3.3 Data space.

The set of character-aligned addresses, addresses at which a character can be accessed, is an implementation-defined subset of all addresses. Adding the size of a character to a character-aligned address shall produce another character-aligned address.

The set of aligned addresses is an implementation-defined subset of character-aligned addresses. Adding the size of a cell to an aligned address shall produce another aligned address.

3.1.3.4 Counted strings

A counted string in memory is identified by the address (c-addr) of its length character.

The length character of a counted string shall contain a binary representation of the number of data characters, between zero and the implementation-defined maximum length for a counted string. The maximum length of a counted string shall be at least 255.

3.1.3.5 Execution tokens

Different definitions may have the same execution token if the definitions are equivalent.

3.1.3.6 Error results

A value of zero indicates that the operation completed successfully; other values are in the range {-4095 ... -1} and represent a valid THROW code.

The meanings of values in the range {-255 ... -1} are defined by table 9.1 [exception]{THROW code assignments. Values in the range {-4095 ... -256} and their meanings are implementation defined.

A word that returns an ior will not THROW that ior as an exception, but indicates the exception through the ior. This allows a program to take appropriate actions, which may include throwing the exception.

3.1.4 Cell-pair types

A cell pair in memory consists of a sequence of two contiguous cells. The cell at the lower address is the first cell, and its address is used to identify the cell pair. Unless otherwise specified, a cell pair on a stack consists of the first cell immediately above the second cell.

3.1.4.1 Double-cell integers

On the stack, the cell containing the most significant part of a double-cell integer shall be above the cell containing the least significant part.

The implementation-defined range of double-cell signed integers shall include {-2147483647 ... +2147483647}.

The implementation-defined range of double-cell non-negative integers shall include {0 ... 2147483647}.

The implementation-defined range of double-cell unsigned integers shall include {0 ... 4294967295}. Placing the single-cell integer zero on the stack above a single-cell unsigned integer produces a double-cell unsigned integer with the same value. See 3.2.1.1 Internal number representation.

3.1.4.2 Character strings

A string is specified by a cell pair (c-addr u) representing its starting address and length in characters.

3.1.5 System types

The system data types specify permitted word combinations during compilation and execution.

3.1.5.1 System-compilation types

These data types denote zero or more items on the control-flow stack (see 3.2.3.2). The possible presence of such items on the data stack means that any items already there shall be unavailable to a program until the control-flow-stack items are consumed.

The implementation-dependent data generated upon beginning to compile a definition and consumed at its close is represented by the symbol colon-sys throughout this standard.

The implementation-dependent data generated upon beginning to compile a do-loop structure such as DO ... LOOP and consumed at its close is represented by the symbol do-sys throughout this standard.

The implementation-dependent data generated upon beginning to compile a CASE ... ENDCASE structure and consumed at its close is represented by the symbol case-sys throughout this standard.

The implementation-dependent data generated upon beginning to compile an OF ... ENDOF structure and consumed at its close is represented by the symbol of-sys throughout this standard.

The implementation-dependent data generated and consumed by executing the other standard control-flow words is represented by the symbols orig and dest throughout this standard.

3.1.5.2 System-execution types

These data types denote zero or more items on the return stack. Their possible presence means that any items already on the return stack shall be unavailable to a program until the system-execution items are consumed.

The implementation-dependent data generated upon beginning to execute a definition and consumed upon exiting it is represented by the symbol nest-sys throughout this standard.

The implementation-dependent loop-control parameters used to control the execution of do-loops are represented by the symbol loop-sys throughout this standard. Loop-control parameters shall be available inside the do-loop for words that use or change these parameters, words such as I, J, LEAVE and UNLOOP.

3.2 The implementation environment

3.2.1 Numbers

3.2.1.1 Internal number representation

This standard allows one's complement, two's complement, or sign-magnitude number representations and arithmetic. Arithmetic zero is represented as the value of a single cell with all bits clear.

The representation of a number as a compiled literal or in memory is implementation dependent.

3.2.1.2 Digit conversion

Numbers shall be represented externally by using characters from the standard character set. Conversion between the internal and external forms of a digit shall behave as follows:

The value in BASE is the radix for number conversion. A digit has a value ranging from zero to one less than the contents of BASE. The digit with the value zero corresponds to the character "0". This representation of digits proceeds through the character set to the decimal value nine corresponding to the character "9". For digits beginning with the decimal value ten the graphic characters beginning with the character "A" are used. This correspondence continues up to and including the digit with the decimal value thirty-five which is represented by the character "Z". The characters "a" though to "z" should be treated the same as "A" though "Z", with "a" having the value ten and "z" the value thirty-five. The conversion of digits outside this range is implementation defined.

3.2.1.3 Free-field number display

Free-field number display uses the characters described in digit conversion, without leading zeros, in a field the exact size of the converted string plus a trailing space. If a number is zero, the least significant digit is not considered a leading zero. If the number is negative, a leading minus sign is displayed.

Number display may use the pictured numeric output string buffer to hold partially converted strings (see 3.3.3.6 Other transient regions).

3.2.2 Arithmetic

3.2.2.1 Integer division

Division produces a quotient q and a remainder r by dividing operand a by operand b. Division operations return q, r, or both. The identity b × q + r = a shall hold for all a and b.

When unsigned integers are divided and the remainder is not zero, q is the largest integer less than the true quotient.

When signed integers are divided, the remainder is not zero, and a and b have the same sign, q is the largest integer less than the true quotient. If only one operand is negative, whether q is rounded toward negative infinity (floored division) or rounded towards zero (symmetric division) is implementation defined.

Floored division is integer division in which the remainder carries the sign of the divisor or is zero, and the quotient is rounded to its arithmetic floor. Symmetric division is integer division in which the remainder carries the sign of the dividend or is zero and the quotient is the mathematical quotient "rounded towards zero" or "truncated". Examples of each are shown in tables 3.3 and 3.4.

In cases where the operands differ in sign and the rounding direction matters, a program shall either include code generating the desired form of division, not relying on the implementation-defined default result, or have an environmental dependency on the desired rounding direction.

Table 3.3: Floored Division Example


Dividend Divisor Remainder Quotient

10 7 3 1
-10 7 4 -2
10 -7 -4 -2
-10 -7 -3 1

Table 3.3: Symmetric Division Example

Dividend Divisor Remainder Quotient

10 7 3 1
-10 7 -3 -1
10 -7 3 -1
-10 -7 -3 1


3.2.2.2 Other integer operations

In all integer arithmetic operations, both overflow and underflow shall be ignored. The value returned when either overflow or underflow occurs is implementation defined.

3.2.3 Stacks

3.2.3.1 Data stack

Objects on the data stack shall be one cell wide.

3.2.3.2 Control-flow stack

The control-flow stack is a last-in, first out list whose elements define the permissible matchings of control-flow words and the restrictions imposed on data-stack usage during the compilation of control structures.

The elements of the control-flow stack are system-compilation data types.

The control-flow stack may, but need not, physically exist in an implementation. If it does exist, it may be, but need not be, implemented using the data stack. The format of the control-flow stack is implementation defined.

3.2.3.3 Return stack

Items on the return stack shall consist of one or more cells. A system may use the return stack in an implementation-dependent manner during the compilation of definitions, during the execution of do-loops, and for storing run-time nesting information.

A program may use the return stack for temporary storage during the execution of a definition subject to the following restrictions:

  • A program shall not access values on the return stack (using R@, R>, 2R@, 2R> or NR>) that it did not place there using >R, 2>R or N>R;

  • A program shall not access from within a do-loop values placed on the return stack before the loop was entered;

  • All values placed on the return stack within a do-loop shall be removed before I, J, LOOP, +LOOP, UNLOOP, or LEAVE is executed;

  • All values placed on the return stack within a definition shall be removed before the definition is terminated or before EXIT is executed.

3.2.4 Operator terminal

See 1.2.2 Exclusions.

3.2.4.1 User input device

The method of selecting the user input device is implementation defined.

The method of indicating the end of an input line of text is implementation defined.

3.2.4.2 User output device

The method of selecting the user output device is implementation defined.

3.2.5 Mass storage

A system need not provide any standard words for accessing mass storage.

3.2.6 Environmental queries

The name spaces for ENVIRONMENT? and definitions are disjoint. Names of definitions that are the same as ENVIRONMENT? strings shall not impair the operation of ENVIRONMENT?. Table 3.5 contains the valid input strings and corresponding returned value for inquiring about the programming environment with ENVIRONMENT?.

Table 3.4: Environmental Query Strings

String Value data type Constant? Meaning

/COUNTED-STRING n yes maximum size of a counted string, in characters
/HOLD n yes size of the pictured numeric output string buffer, in characters
/PAD n yes size of the scratch area pointed to by PAD, in characters
ADDRESS-UNIT-BITS n yes size of one address unit, in bits
FLOORED flag yes true if floored division is the default
MAX-CHAR u yes maximum value of any character in the implementation-defined character set
MAX-D d yes largest usable signed double number
MAX-N n yes largest usable signed integer
MAX-U u yes largest usable unsigned integer
MAX-UD ud yes largest usable unsigned double number
RETURN-STACK-CELLS n yes maximum size of the return stack, in cells
STACK-CELLS n yes maximum size of the data stack, in cells

If an environmental query (using ENVIRONMENT?) returns false (i.e., unknown) in response to a string, subsequent queries using the same string may return true. If a query returns true (i.e., known) in response to a string, subsequent queries with the same string shall also return true. If a query designated as constant in the above table returns true and a value in response to a string, subsequent queries with the same string shall return true and the same value.

3.2.7 Obsolescent Environmental Queries

This standard designates the practice of using ENVIRONMENT? to inquire whether a given word set is present as obsolescent. If such a query, as listed in table 3.6, returns true, the word set is present in the form defined by Forth 94. As these queries will be withdrawn from future revisions of the standard their use in new programs is discouraged.

See A.3.2.7 Obsolescent Environmental Queries.

Table 3.5: Obsolescent Environmental Query Strings

String Value data type   Constant?   Meaning

CORE flag   no   true if complete core word set of Forth 94 is present
    (i.e., not a subset as defined in 5.1.1)
CORE-EXT flag   no   true if the core extensions word set of Forth 94 is present
BLOCKflagnoForth 94 block word set present.
BLOCK-EXTflagnoForth 94 block extensions word set present.
DOUBLEflagnoForth 94 double number word set present.
DOUBLE-EXTflagnoForth 94 double number extensions word set present.
EXCEPTIONflagnoForth 94 exception word set present.
EXCEPTION-EXTflagnoForth 94 exception extensions word set present.
FACILITYflagnoForth 94 facility word set present.
FACILITY-EXTflagnoForth 94 facility extensions word set present.
FILEflagnoForth 94 file word set present.
FILE-EXTflagnoForth 94 file extensions word set present.
FLOATINGflagnoForth 94 floating-point word set present.
FLOATING-EXTflagnoForth 94 floating-point extensions word set present.
LOCALSflagnoForth 94 locals word set present.
LOCALS-EXTflagnoForth 94 locals extensions word set present.
MEMORY-ALLOCflagnoForth 94 memory-allocation word set present.
MEMORY-ALLOC-EXTflagnoForth 94 memory-allocation extensions word set present.
TOOLSflagnoForth 94 programming-tools word set present.
TOOLS-EXTflagnoForth 94 programming-tools extensions word set present.
SEARCH-ORDERflagnoForth 94 search-order word set present.
SEARCH-ORDER-EXTflagnoForth 94 search-order extensions word set present.
STRINGflagnoForth 94 string word set present.
STRING-EXTflagnoForth 94 string extensions word set present.

3.3 The Forth dictionary

Forth words are organized into a structure called the dictionary. While the form of this structure is not specified by the standard, it can be described as consisting of three logical parts: a name space, a code space, and a data space. The logical separation of these parts does not require their physical separation.

A program shall not fetch from or store into locations outside data space. An ambiguous condition exists if a program addresses name space or code space.

3.3.1 Name space

The relationship between name space and data space is implementation dependent.

3.3.1.1 Word lists

The structure of a word list is implementation dependent. When duplicate names exist in a word list, the latest-defined duplicate shall be the one found during a search for the name.

3.3.1.2 Definition names

Definition names shall contain {1 ... 31} characters. A system may allow or prohibit the creation of definition names containing non-standard characters. A system may allow the creation of definition names longer than 31 characters. Programs with definition names longer than 31 characters have an environmental dependency.

Programs that use lower case for standard definition names or depend on the case-sensitivity properties of a system have an environmental dependency.

A program shall not create definition names containing non-graphic characters.

3.3.2 Code space

The relationship between code space and data space is implementation dependent.

3.3.3 Data space

Data space is the only logical area of the dictionary for which standard words are provided to allocate and access regions of memory. These regions are: contiguous regions, variables, text-literal regions, input buffers, and other transient regions, each of which is described in the following sections. A program may read from or write into these regions unless otherwise specified.

3.3.3.1 Address alignment

Most addresses are cell aligned (indicated by a-addr) or character aligned (c-addr). ALIGNED, CHAR+, and arithmetic operations can alter the alignment state of an address on the stack. CHAR+ applied to an aligned address returns a character-aligned address that can only be used to access characters. Applying CHAR+ to a character-aligned address produces the succeeding character-aligned address. Adding or subtracting an arbitrary number to an address can produce an unaligned address that shall not be used to fetch or store anything. The only way to find the next aligned address is with ALIGNED. An ambiguous condition exists when memory is accessed using an address that is not aligned according to the requirements for the accessed type.

The definitions of 6.1.1000 CREATE and 6.1.2410 VARIABLE require that the definitions created by them return aligned addresses.

After definitions are compiled or the word ALIGN is executed the data-space pointer is guaranteed to be aligned.

3.3.3.2 Contiguous regions

A system guarantees that a region of data space allocated using ALLOT, , (comma), C, (c-comma), and ALIGN shall be contiguous with the last region allocated with one of the above words, unless the restrictions in the following paragraphs apply. The data-space pointer HERE always identifies the beginning of the next data-space region to be allocated. As successive allocations are made, the data-space pointer increases. A program may perform address arithmetic within contiguously allocated regions. The last region of data space allocated using the above operators may be released by allocating a corresponding negatively-sized region using ALLOT, subject to the restrictions of the following paragraphs.

CREATE establishes the beginning of a contiguous region of data space, whose starting address is returned by the CREATEd definition. This region is terminated by compiling the next definition.

Since an implementation is free to allocate data space for use by code, the above operators need not produce contiguous regions of data space if definitions are added to or removed from the dictionary between allocations. An ambiguous condition exists if deallocated memory contains definitions.

3.3.3.3 Variables

The region allocated for a variable may be non-contiguous with regions subsequently allocated with , (comma) or ALLOT. For example, in:

the region X and the region ALLOTted could be non-contiguous.

Some system-provided variables, such as STATE, are restricted to read-only access.

3.3.3.4 Text-literal regions

The text-literal regions, specified by strings compiled with S", S\" and C", may be read-only.

A program shall not store into the text-literal regions created by S", S\" and C" nor into any read-only system variable or read-only transient regions.

3.3.3.5 Input buffers

The address, length, and content of the input buffer may be transient. A program shall not write into the input buffer. In the absence of any optional word sets providing alternative input sources, the input buffer is either the terminal-input buffer, used by QUIT to hold one line from the user input device, or a buffer specified by EVALUATE. In all cases, SOURCE returns the beginning address and length in characters of the current input buffer.

The minimum size of the terminal-input buffer shall be 80 characters.

The address and length returned by SOURCE, the string returned by PARSE, and directly computed input-buffer addresses are valid only until the text interpreter does I/O to refill the input buffer or the input source is changed.

A program may modify the size of the parse area by changing the contents of >IN within the limits imposed by this standard. For example, if the contents of >IN are saved before a parsing operation and restored afterwards, the text that was parsed will be available again for subsequent parsing operations. The extent of permissible repositioning using this method depends on the input source (see 7.3.2 Block buffer regions and 11.3.3 Input source).

A program may directly examine the input buffer using its address and length as returned by SOURCE; the beginning of the parse area within the input buffer is indexed by the number in >IN. The values are valid for a limited time. An ambiguous condition exists if a program modifies the contents of the input buffer.

3.3.3.6 Other transient regions

The data space regions identified by PAD, WORD, and #> (the pictured numeric output string buffer) may be transient. Their addresses and contents may become invalid after:

  • a definition is created via a defining word;
  • definitions are compiled with : or :NONAME;
  • data space is allocated using ALLOT, , (comma), C, (c-comma), or ALIGN.

The previous contents of the regions identified by WORD and #> may be invalid after each use of these words. Further, the regions returned by WORD and #> may overlap in memory. Consequently, use of one of these words can corrupt a region returned earlier by a different word. The other words that construct pictured numeric output strings (<#, #, #S, HOLD, HOLDS, XHOLD) may also modify the contents of these regions. Words that display numbers may be implemented using pictured numeric output words. Consequently, . (dot), .R, .S, ?, D., D.R, U., U.R could also corrupt the regions.

The size of the scratch area whose address is returned by PAD shall be at least 84 characters. The contents of the region addressed by PAD are intended to be under the complete control of the user: no words defined in this standard place anything in the region, although changing data-space allocations as described in 3.3.3.2 Contiguous regions may change the address returned by PAD. Non-standard words provided by an implementation may use PAD, but such use shall be documented.

The size of the region identified by WORD shall be at least 33 characters.

The size of the pictured numeric output string buffer shall be at least (2 × n) + 2 characters, where n is the number of bits in a cell. Programs that consider it a fixed area with unchanging access parameters have an environmental dependency.

3.4 The Forth text interpreter

Upon start-up, a system shall be able to interpret, as described by 6.1.2050 QUIT, Forth source code received interactively from a user input device.

Such interactive systems usually furnish a "prompt" indicating that they have accepted a user request and acted on it. The implementation-defined Forth prompt should contain the word "OK" in some combination of upper or lower case.

Text interpretation (see 6.1.1360 EVALUATE and 6.1.2050 QUIT) shall repeat the following steps until either the parse area is empty or an ambiguous condition exists:

  1. Skip leading spaces and parse a name (see 3.4.1);

  2. Search the dictionary name space (see 3.4.2). If a definition name matching the string is found:

    1. if interpreting, perform the interpretation semantics of the definition (see 3.4.3.2), and continue at a).

    2. if compiling, perform the compilation semantics of the definition (see 3.4.3.3), and continue at a).

  3. If a definition name matching the string is not found, attempt to convert the string to a number (see 3.4.1.3). If successful:

    1. if interpreting, place the number on the data stack, and continue at a);

    2. if compiling, compile code that when executed will place the number on the stack (see 6.1.1780 LITERAL), and continue at a);

  4. If unsuccessful, an ambiguous condition exists (see 3.4.4).

3.4.1 Parsing

Unless otherwise noted, the number of characters parsed may be from zero to the implementation-defined maximum length of a counted string.

If the parse area is empty, i.e., when the number in >IN is equal to the length of the input buffer, or contains no characters other than delimiters, the selected string is empty. Otherwise, the selected string begins with the next character in the parse area, which is the character indexed by the contents of >IN. An ambiguous condition exists if the number in >IN is greater than the size of the input buffer.

If delimiter characters are present in the parse area after the beginning of the selected string, the string continues up to and including the character just before the first such delimiter, and the number in >IN is changed to index immediately past that delimiter, thus removing the parsed characters and the delimiter from the parse area. Otherwise, the string continues up to and including the last character in the parse area, and the number in >IN is changed to the length of the input buffer, thus emptying the parse area.

Parsing may change the contents of >IN, but shall not affect the contents of the input buffer. Specifically, if the value in >IN is saved before starting the parse, resetting >IN to that value immediately after the parse shall restore the parse area without loss of data.

3.4.1.1 Delimiters

If the delimiter is the space character, hex 20 (BL), control characters may be treated as delimiters. The set of conditions, if any, under which a "space" delimiter matches control characters is implementation defined.

To skip leading delimiters is to pass by zero or more contiguous delimiters in the parse area before parsing.

3.4.1.2 Syntax

Forth has a simple, operator-ordered syntax. The phrase A B C returns values as if A were executed first, then B and finally C. Words that cause deviations from this linear flow of control are called control-flow words. Combinations of control-flow words whose stack effects are compatible form control-flow structures. Examples of typical use are given for each control-flow word in Annex A.

Forth syntax is extensible; for example, new control-flow words can be defined in terms of existing ones. This standard does not require a syntax or program-construct checker.

3.4.1.3 Text interpreter input number conversion

When converting input numbers, the text interpreter shall recognize integer numbers in the form <anynum>.

<anynum> := { <BASEnum> | <decnum> | <hexnum> | <binnum> | <cnum> }
<BASEnum> := [-]<bdigit><bdigit>*
<decnum> := #[-]<decdigit><decdigit>*
<hexnum> := $[-]<hexdigit><hexdigit>*
<binnum> := %[-]<bindigit><bindigit>*
<cnum> := '<char>'
<bindigit> := { 0 | 1 }
<decdigit> := { 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 }
<hexdigit> := { <decdigit> | a | b | c | d | e | f | A | B | C | D | E | F }

<bdigit> represents a digit according to the value of BASE (see 3.2.1.2 Digit conversion). For <hexdigit>, the digits a...f have the values 10...15. <char> represents any printable character.

The radix used for number conversion is:

<BASEnum> the value in BASE
<decnum> 10
<hexnum> 16
<binnum> 2
<cnum> the number is the value of <char>

See 2.2.5 BNF notation.

3.4.2 Finding definition names

A string matches a definition name if each character in the string matches the corresponding character in the string used as the definition name when the definition was created. The case sensitivity (whether or not the upper-case letters match the lower-case letters) is implementation defined. A system may be either case sensitive, treating upper- and lower-case letters as different and not matching, or case insensitive, ignoring differences in case while searching.

The matching of upper- and lower-case letters with alphabetic characters in character set extensions such as accented international characters is implementation defined.

A system shall be capable of finding the definition names defined by this standard when they are spelled with upper-case letters.

3.4.3 Semantics

The semantics of a Forth definition are implemented by machine code or a sequence of execution tokens or other representations. They are largely specified by the stack notation in the glossary entries, which shows what values shall be consumed and produced. The prose in each glossary entry further specifies the definition's behavior.

Each Forth definition may have several behaviors, described in the following sections. The terms "initiation semantics" and "run-time semantics" refer to definition fragments, and have meaning only within the individual glossary entries where they appear.

3.4.3.1 Execution semantics

The execution semantics of each Forth definition are specified in an "Execution:" section of its glossary entry. When a definition has only one specified behavior, the label is omitted.

Execution may occur implicitly, when the definition into which it has been compiled is executed, or explicitly, when its execution token is passed to EXECUTE. The execution semantics of a syntactically correct definition under conditions other than those specified in this standard are implementation dependent.

Glossary entries for defining words include the execution semantics for the new definition in a "name Execution:" section.

3.4.3.2 Interpretation semantics

Unless otherwise specified in an "Interpretation:" section of the glossary entry, the interpretation semantics of a Forth definition are its execution semantics.

A system shall be capable of executing, in interpretation state, all of the definitions from the Core word set and any definitions included from the optional word sets or word set extensions whose interpretation semantics are defined by this standard.

A system shall be capable of executing, in interpretation state, any new definitions created in accordance with 3 Usage requirements.

3.4.3.3 Compilation semantics

Unless otherwise specified in a "Compilation:" section of the glossary entry, the compilation semantics of a Forth definition shall be to append its execution semantics to the execution semantics of the current definition.

3.4.4 Possible actions on an ambiguous condition

When an ambiguous condition exists, a system may take one or more of the following actions:

  • ignore and continue;
  • display a message;
  • execute a particular word;
  • set interpretation state and begin text interpretation;
  • take other implementation-defined actions;
  • take implementation-dependent actions.

The response to a particular ambiguous condition need not be the same under all circumstances.

3.4.5 Compilation

A program shall not attempt to nest compilation of definitions.

During the compilation of the current definition, a program shall not execute any defining word, :NONAME, or any definition that allocates dictionary data space. The compilation of the current definition may be suspended using [ (left-bracket) and resumed using ] (right-bracket). While the compilation of the current definition is suspended, a program shall not execute any defining word, :NONAME, or any definition that allocates dictionary data space.

ContributeContributions

BerndPaysanavatar of BerndPaysan [16] 3.3.3.5 Input buffer: "A program shall not write into the input buffer."Comment2016-03-21 02:27:27

There are a number of reasons why this might have undesirable side-effects or not work.

In EVALUATE, the input buffer is the actual string, likely in the dictionary. Writing to it may modify that memory permanently, so the second call will use the modified version. If the input string is in flash, writing might not be possible, or result in unexpected behavior (writing to flash only flips the bits in one direction, usually from 1 to 0, erasing is only possible in larger blocks).

On blocks, the input buffer is the actual block buffer, modifying it can result in writing the changes back to disk (if you UPDATE the block later, or if the block is memory-mapped, and no UPDATE is necessary).

On files, the input buffer may also be mapped into memory, as loading the whole file and then only scanning for newlines may be faster than reading the file line-by-line. Here, the file may be mapped read-only or copy-on-write, which either results in a memory access exception when writing, or in no permanent change on disk.

For terminal input, which is obviously writable use-once memory, it is likely that the change is predictable.

Therefore, it is not a good idea in general to write into the input buffer: The possible reactions are non-portable and the side effects are unlikely desirable.

AntonErtlavatar of AntonErtl

If a file is mapped copy-on-write (and writable), then writing there is relatively harmless. One harmful (if unintended, otherwise just unportable) case is if the file is mapped shared (and writable); then the write will eventually change the source file. Alternatively, the file could be mapped read-only, the the write would cause an exception, which is probably not intended.

However, if we wanted to tighten the standard in this respect, it would be easy to require and implement that the input buffers from files are private and writable; but the other reasons still exist. In particular, the Forth-94 TC discussed this at length in RFI 6, which list some more reasons, among them:

Storing into 'input buffers' is disallowed because we permit input sources to nest indefinitely and it is not practical for systems that conserve resources to guarantee unique concurrent addressability of all nested input sources, nor is it practical to create separate save areas for all current input buffers just in case someone stored into one of them. The TC specifically intends that, when input is coming from refreshable sources, implementations may refresh their buffers on un-nesting to conserve resources, and that when logically possible implementations may use transient, shared buffers (as is common practice with LOAD on multiprogrammed systems.)

[...]

The TC expects all Systems to process buffers provided by EVALUATE in place. This is logically necessary, in our view, since there are no upper limits on the lengths of these buffers. Since it is semantically permissible to describe more than half of addressable memory in an EVALUATE string it is not in general possible to copy such a string elsewhere and address it consistently with the definition of SOURCE .

The Forth-94 committee also discusses a possible tightening wrt EVALUATE:

Given these conditions, it is deterministic for an application to store (with great care) into EVALUATE buffers that it knows to be active, although such methods pertain exclusively to EVALUATE and certainly not to any other input stream source.

Reply New Version

AntonErtlavatar of AntonErtl [33] Bug in 3.4.1.3Example2017-10-25 11:16:43

As pointed out by Harold hzrabbie@gmail.com, "printable character" is not defined in the standard, and there are definitions (Wikipedia) that include BL among the printable characters. However, in the '<char>' syntax, <char> was not intended to be BL; moreover, ' ' produces the xt of ' on all tested systems (as intended). So we should fix this bug by replacing

<char> represents any printable character.

by

<char> represents any non-blank non-control character.

Reply New Version

ruvavatar of ruv [73] Case sensitivityProposal2018-11-03 13:15:53

This contribution has been moved to the proposal section.

ruvavatar of ruv

This reply has been moved to the proposal section.

ruvavatar of ruv

This reply has been moved to the proposal section.

ruvavatar of ruv

This reply has been moved to the proposal section.

GeraldWodniavatar of GeraldWodni

This reply has been moved to the proposal section.
Formal
Reply New Version

AntonErtlavatar of AntonErtl [114] Case insensitivityProposal2019-09-06 18:27:48

This contribution has been moved to the proposal section.

ruvavatar of ruv

This reply has been moved to the proposal section.

ruvavatar of ruv

This reply has been moved to the proposal section.

BerndPaysanavatar of BerndPaysan

This reply has been moved to the proposal section.

AntonErtlavatar of AntonErtl

This reply has been moved to the proposal section.

GeraldWodniavatar of GeraldWodni

This reply has been moved to the proposal section.
Formal
Reply New Version

ruvavatar of ruv [166] Data type for stringsComment2020-11-27 07:13:33

3.1.4.2 Character strings

A string is specified by a cell pair (c-addr u) representing its starting address and length in characters.

Table 3.1: Data types

xd — unspecified cell pair — 2 cells

Don't you think it's convenient to have a symbol for a cell pair (c-addr u) representing a string?

For example, sd or just s ?

As a result, the stack notation for words that handle strings can be more concise. So, we will have an option to write ( sd1 ) instead of ( c-addr1 u1 ).

MitchBradleyavatar of MitchBradley

In my own code, I use $ in stack diagrams to denote an "adr len" string.

veltasavatar of veltas

I think it would be convenient, and any shortening like this might help reading and interpretation of the standard.

People might not be familiar with that notation and it would add a bit more confusion/work to Forthers that decide to check out the standard occasionally.

It could not be used everywhere because some Forth words do not use c-addr u pairwise, for example CMOVE has ( c-addr1 c-addr2 u -- ), so would probably be left as-is. Then at a glance it might look like CMOVE operates on different things entirely to COMPARE, so I think in that regard it might not improve readability.

I think if we add this shortening then counted strings deserve a different stack notation too, I think it would make stack notation much more useful, e.g. COUNT could have ( sx -- sd ) where sx is a hypothetical counted string address, instead of the current ( c-addr -- c-addr u ), where perhaps it is obvious the left is a counted string, but that kind of interpretation is not consistent e.g. with CMOVE stack notation where c-addr1 appears on its own.

I think I lean towards adding this change because it is factoring, and I think the benefits of factoring usually outweigh the perceived concerns.

AntonErtlavatar of AntonErtl

It seems to me that in the standard most uses of c-addr u then refer individually to c-addr and u in the description, so one would have to also specify how to refer to the components of a string descriptor; e.g., if you have a string str1, the components might be str1a and str1u. In the standard document, I think that the costs of such a change are bigger than the benefits.

In the larger Forth world, though, such a convention could be well worth it.

ruvavatar of ruv

I use d-txt for that. Although, I don't happy with d part, xd-txt would be more correct. And a better variant should be something shorter.

$ is not an alphabetical character, but all other data type symbols are alphabetical. So I would prefer an alphabetical symbol for strings too.

ruvavatar of ruv

In the standard document, I think that the costs of such a change are bigger than the benefits.

Yes, I also don't think it's worth to change the glossary entries.

It's worth to find and just specify an alternative form for strings in the Table 3.1: Data types, and add the corresponding note in 3.1.4.2 Character strings.

StephenPelcavatar of StephenPelc

The str notation was rejected in 1999 because of the potential confusion between caddr/len strings and counted strings.

Caddr/u for strings implies that the length is only bounded by the cell size, so the caddr/len notation was introduced to indicate that string lengths may be bounded by the Forth system. As yet, caddr/len has not been formally adopted.

AntonErtlavatar of AntonErtl

This comment has been answered. Closing.

Closed
Reply New Version

ruvavatar of ruv [168] Data object notion usageComment2020-12-01 10:58:24

"Data object" is a primitive notion in the Standard.

But its usage looks inconsistently.

A data type identifies the set of permissible values for a data object. (1)

It seems to imply that the same data object may have different values. Then it means that a data object also identifies the set (namely, the set of its values). And then the set identified by a data type is a subset of the set identified by a data object.

Moving a data object shall not affect its type. (2)

But this makes an impression that a data object is an element of the set identified by a data type.

E.g. a single-cell data objects 123 and 456 are the same data object having the different values, or they are just the different data objects?

I'm inclined to the latter interpretation. Involving also a set of values seems to be unnecessary.

Then a better wording for (1) would be: A data type identifies the set of data objects.

What do you think?

AidanPitt-Brookeavatar of AidanPitt-Brooke

A data type identifies the set of permissible values for a data object. (1)

It seems to imply that the same data object may have different values. Then it means that a data object also identifies the set (namely, the set of its values). And then the set identified by a data type is a subset of the set identified by a data object.

I'm not sure what you meant by the last sentence here, but I would interpret (1) as, "A data type identifies the set of permissible values for an object of that type"; that is, a data object of a given type must have a value within that type's set/range of permissible values. For Standard Forth, in general, a type is not really a property of an object/value, but rather of how that value is handled. A double-cell integer, for example, is treated as "an object" despite consisting of two separate stack items.

Moving a data object shall not affect its type. (2)

But this makes an impression that a data object is an element of the set identified by a data type.

A data type "identifies" two distinct sets. On the one hand, its definition in the Standard or in a concrete implementation declares its set of permissible values. On the other, its existence implies the set of data objects which are its members. You can't conflate the two (even though there may be a one-to-one correspondence between them) because the same value can be represented in two incompatible ways, and thus belong to two disjoint types. (By "disjoint" I mean that they have no members/instances in common.)

Take double-cell integers again. Assuming that a system provides the Double-Number word set, 127 and 127. are two different, incompatible objects, despite having the same numerical value.

Granted, I also find (2) confusing. In my case, it's because it implies that objects somehow "know" their type. It further implies that an object has/belongs to exactly one type. Both of those might be true for the system types that may or may not have a tangible representation, and they probably are true for floating-point numbers if a system provides the Floating-Point word set, but they're both explicitly false for ordinary one-cell and two-cell data types.

Then a better wording for (1) would be: A data type identifies the set of data objects.

This proposed wording is definitely missing information that the original version conveys; it also uses the wrong kind of article for "set", resulting in the absurd proposition that any data type identifies the set of all data objects. I'm sure you meant "a data type identifies a set of data objects", but that's kind of self-evident.

If I had to propose a new wording for (1), I'd go with: A data type defines the set of permissible values for objects of that type.

I think some of the confusion actually comes from the term "data object" itself. I understand why it's used; it's important to be able to draw a distinction between abstract values and the data structures that represent those values. But for those of us who began programming with high-level object-oriented languages, seeing cells and cell-pairs being called "objects" is pretty alien.

ruvavatar of ruv

@AidanPitt-Brooke, thank you for your effort in this convoluted topic.

As I understand, "data object" is a primitive notion in the standard (as well as a similar notion "set of information"). It's not an object in OOP sense. A data object is a sequence of bits, from which you can infer neither its length nor its data type in the general case (i.e., it does not know its type).

The standard considers only data objects that can be placed on some stack (among the data stack, control-flow stack, floating-point stack, return stack, exception stack). It means that any data object (which is available for a program) can be represented as a tuple of the stack items, on which this data object is placed. NB: in some cases the length of a tuple is unknown for a program, and this length can be even zero (e.g. for nest-sys on the return stack).

Of course, actual implementations might operate on compound data objects in memory, but these data objects are usually unavailable for a program, and formally (in the standard) they are represented by other data objects — opaque identifiers (which are tuples). For example, an identifier of execution semantics, word list identifier, file identifier, result of save-input ( i*x u.i ). NB: obviously, some data objects don't have a numeric interpretation at all — formally, they are not numbers (in algebraic sense).

"An ambiguous condition exists if an incorrectly typed data object is encountered" — what a word, during execution, consumes from a stack or produces on the stack are just data objects only.

Then, what does the term "value" mean in this context? As I can see, it is either a synonym for "data object", or an interpretation of a data object as a number (a numeric value).

Anyway, it's unclear what does "a set of permissible values for a data object" mean? What different values can a single concrete data object have?

Reply New Version

ruvavatar of ruv [183] Size of implementation dependent data typesRequest for clarification2021-03-30 16:05:59

May an implementation dependent data type be variable in size?

For example, may the size of colon-sys from : be unequal to the size of colon-sys from :NONAME?

I think, yes, it may. But it is not obvious from the text of the standard.A practical argument is that in many implementations of ENDOF ( C: case-sys1 of-sys -- case-sys2 ) the size of case-sys1 is not equal to the size of case-sys2.

ruvavatar of ruv

Certainly when I talk "the size of a data type", I mean the size of a data object of this data type.

StephenPelcavatar of StephenPelc

I seen nothing that indicates that such sizes need to be consistent. However, you need to be sure that you have covered all use cases and have included the relevant version data (size) in the returned data.

AntonErtlavatar of AntonErtl

On behalf of the committee:

None of the committee think that a given system-dependent type always must have the same size.

However, it's then up to the system to deal with the consequences.

Closed
Reply New Version

AtHavatar of AtH [237] 3.4.5 conflicts with [: … ;]Request for clarification2022-05-11 12:45:05

Quotations are nested definitions. They should be explicitly allowed by 3.4.5, that in current wording speaks against all nested definitions.

AidanPitt-Brookeavatar of AidanPitt-Brooke

By my understanding, the resolution to this apparent conflict is that quotations aren't part of the Forth-2012 standard, which this document represents.

Reply New Version

ruvavatar of ruv [238] Same execution tokenRequest for clarification2022-06-13 22:40:38

3.1.3.5 Executiontokens says:

Different definitions may have the same execution token if the definitions are equivalent.

Are the following definitions equivalent?

: foo 123 . ;
: bar 123 . ; immediate

It looks like the definitions for foo and bar are not equivalent. Should they be allowed to have the same execution token?

Probably, it can be stated that no two noname definitions can have the same execution token. And no two CREATE definitions, or two VARIABLE definitions, or two VALUE definitions can have the same execution token. Is it correct?

AntonErtlavatar of AntonErtl

My take is that foo and bar are equivalent as far as the execution token is concerned. Once you have the execution token, you cannot determine their name or immediacy with standard means. There are systems that allow access to the name and/or immediacy with non-standard means, and these systems will choose to produce different xts for foo and bar.

Actually I don't know any system that would produce the same xt for foo and bar. The only standard case where I expect that a system may produce the same xt is for a synonym.

But this statement is older than synonyms, so what is it about? A program might want to check for a certain condition by comparing with an xt, and on a system with code deduplication that produces the same xt for definitions that result in the same code, this might not work as intended.

Given the common practice of not having code deduplication, do we want to restrict the statement to synonyms, in order to let programs use the technique outlined above?

ruvavatar of ruv

My take is that foo and bar are equivalent as far as the execution token is concerned.

So you want to say that the execution semantics of foo and the execution semantics of bar are equivalent (the same), and then these execution semantics may be identified by the same execution token.

Actually, it's obvious from the terms definitions: an execution token identifies execution semantics, so the same execution token identifies the same execution semantics. Not so obvious that the same execution semantics may be identified by the different execution tokens.

Concerning 3.1.3.5 — due to its wording ("to have execution token"), it says not about equivalent execution semantics, but about equivalent Forth words (named Forth definitions), and their associated execution tokens (when these tokens can be retrieved by a standard program).

And in my example the words foo and bar (that are named Forth definitions) are not equivalent, and then their definitions are not equivalent, and then 3.1.3.5 is not applicable to this case.

If we want 3.1.3.5 be applicable to this case, we should reword it, I think.

For example, we can replace 3.1.3.5 by the following list (1):

  • Different Forth definitions may have the same execution token if these definitions define the same execution semantics.
  • Different execution tokens may identify the same execution semantics.
  • When an execution token can be obtained by a program before the corresponding execution semantics are defined, this token shall be unique among all the available execution tokens at the moment.

But this statement is older than synonyms, so what is it about?

Probably some system-specific alternatives were already known, e.g. alias or alike. Also, for example, the same execution token can be easy implemented for constants having the same value, for plain synonyms like : foo2 foo1 ;, for noops like : foo3 ; and other simple cases when the generated code is identical.

A program might want to check for a certain condition by comparing with an xt, and on a system with code deduplication that produces the same xt for definitions that result in the same code, this might not work as intended.

Yes, I'm also bothered on this regard.

Given the common practice of not having code deduplication, do we want to restrict the statement to synonyms, in order to let programs use the technique outlined above?

In the latter item in the list (1) I suggest the certain cases when the standard requires xt to be unique. At the moment, it covers only noname-definitions. But I'm going to propose a word germ ( -- xt|0 ) that returns xt of the current definition (a definition being compiled, including a quotation). After that all colon-like definitions (colon, noname, quotation) will have a unique xt among each other, and it meets the common practice.


Once you have the execution token, you cannot determine their name or immediacy with standard means.

Yes, since neither name, nor immediacy are properties of an execution token.

But you can find some words this execution token is associated with.

StephenPelcavatar of StephenPelc

Saying that words are equivalent if their execution semantics are identical is a ghastly mistake. For example IF is equivalent to THEN.

Children of CREATE, VALUE and VARIABLE are distinct definitions. At the very least, think of the disambiguation issues.

ruvavatar of ruv

Saying that words are equivalent if their execution semantics are identical is a ghastly mistake.

Of course, it's obvious. And it would be too easy disprovable by a counterexample. So I don't think Anton meant that.

Usually, in such contexts "equivalence" means operational equivalence (observational equivalence). Formally, two Forth definitions are equivalent if (and only if) no such a program exists (a standard program) that always halts when one definition is used and never halts when another definition is used. In practice, it's usually enough to demonstrate different behavior in some test case.

Reply New Version