Proposal: [67] Multi-Tasking Proposal

Informal

This proposal has been moved into this section. Its former address was: /standard/facility/MS

This page is dedicated to discussing this specific proposal

ContributeContributions

AndrewHaley [67] Multi-Tasking ProposalProposal2018-09-06 17:19:38

Multi-tasking

Multiprogramming is a common facility in Forth systems. The wordset which supports this is known as the multi-tasking wordset, and the entity executing a program sequence is known as a task.

Rationale

The earliest Forth systems developed by Charles Moore supported multiprogramming, in that the computer could execute muliple concurrent program sequences. Many of today's Forths support multiprogramming, but it has never been part of any standard. Forth has been multi-tasking for almost 50 years. It's time to standardize it.

Interactions

The most complex part of this proposal is not the wordset itself but the way in which it interacts with the rest of the system. It is necessary to go through every line of the standard to determine which parts will be affected and then to decide what to do.

Round-robin versus pre-emptive task scheduling

Originally, Forth used a co-operative (sometimes called "round robin") task scheduler to share time between tasks on a single processor. Such a scheduler only switches tasks voluntarily, i.e. when a task relinquishes control. This is done either by executing PAUSE or an I/O word such as TYPE. More recent systems sometimes use a "pre-emptive" scheduler which may switch tasks at any time, and some systems have multiple processors sharing the same memory. This wordset doesn't address such differences except to note that you may have to execute PAUSE in a long-running computation to allow other tasks to run. From th epoint of view of this standard it does not matter: as long as it follows the rules of the Memory Model, a Standard Forth program will execute correctly on either kind of system.

The Standard Forth Memory Model

Multiprocessor systems vary in the degree to which processors operate asynchronously. In order to allow systems to operate as quickly as possible while minimizing power consumption, memory operations performed by one task may appear to other tasks in a different order, stores may appear to be delayed, and in some cases incomplete updates to cells in memory may be visible. On some systems values may be read that have never been written by any task.

A Data Race occurs when two or more tasks access the same memory location concurrently, and at least one of the accesses is some kind of non-atomic store, and the tasks are not using any MUTEXes to guarantee exclusive access to that memory location. This standard does not specify what values result from data races, either when those values are read from memory or when they are written.

ATOMIC accesses and MUTEX operations are sequentially consistent. Sequential consistency means that there appears to be a total order of all atomic operations, i.e. every task observes atomic operations in the same order. More formally, he result of any execution is the same as if

the operations of all the tasks had been executed in some
sequential order, and

the operations of each processor appear in this total ordering in
the same order as was specified in its program.

Therefore, a well-defined program may store values into a buffer using ordinary (non-atomic) stores then publish the address of that buffer to other tasks using either a region protected by a MUTEX or by using an atomic store. Another task may use a MUTEX or an atomic fetch to read the address of the buffer, then use ordinary (non-atomic) fetches to load the values from it.

[ As long as a program has no data races, the entire program is sequentially consistent, even though it uses plain fetches and stores for most of its operations. This is an important property that greatly aids intuitive understanding of programs. ]

Task creation

TASK ( "<spaces>name” -- )

Skip leading space delimiters. Parse name delimited by a
space. Create a definition for name with the execution semantics
defined below. Reserve /TASK bytes of data space at a suitably
aligned address.

name is referred to as a "task".

name Execution: ( –– a-addr )

a-addr is the address of the task. The task may not be used until
it has been CONSTRUCTed.

/TASK ( -- n) "per task"

n is the size of a task.

[This word allows arrays of tasks to be created without having
to name each one.]

CONSTRUCT ( a-addr )

Instantiate the task at a-addr. a-addr must be a memory region
of at least /TASK bytes in size. CONSTRUCT shall be executed at
most once on a task. Once CONSTRUCT has been executed, the
task's USER variables may be accessed.

START-TASK ( xt a-addr -- )

Start the task at addr asynchronously executing the word whose
execution token is xt. If the word terminates by executing EXIT,
reaching the end of the word, or throwing an execption, the task
terminates.  The task may be restarted after this by executing
START-TASK again.  The task shall not be restarted until it has
terminated.

User variables

+USER ( n "<spaces>name” -- )

Skip leading space delimiters. Parse name delimited by a
space. Create a definition for name with the execution semantics
defined below. Reserve n bytes of data space, suitably aligned, in
every task.

name is referred to as a "user variable".

[ I can't find any common practice for defining user variables. Traditional forth defines n USER to have a fixed offset in the uder area, but that is of no use unless once can discover the highest already-dedefined user variable. Suggestions welcome. ]

name Execution: ( –– a-addr )

a-addr is the address of the variable in the currently-running
task.

[ I think we need to say that a child of MARKER deallocates all user variables defined after it. ]

HIS ( addr1 addr2 -- addr3 )

Given a task at address addr1 and a user varible address in the
current task, return the address of the corresponding user
variable in that task.

[ Usage: \<task-name\> \<user-variable-name\> HIS
e.g.  TYPIST BASE HIS @ ]

Starting, stopping, and pausing tasks

STOP ( -- )

Block the currently-running task unless or until AWAKEN has been
issued. Calls to AWAKEN are not counted, so multiple AWAKENs
before a STOP can unblock only a single STOP.

AWAKEN ( a-addr -- )

a-addr is the address of a task. If that task is blocked, unblock
it so that it resumes execution. If that task is not blocked, the
next time it executes STOP it will continue rather than blocking.

PAUSE ( -- )

May cause the execuiting task temporarily to relinquish the CPU.

[ In a Forth system which uses round-robin scheduling this can be used to allow other tasks to run However, this isn't usually needed because any I/O causes a task to block. All words which perform I/O (TYPE, ACCEPT, READFILE, etc.) should (unless they are extremely high priority?) execute PAUSE . ]

Atomic memory words

ATOMIC@ ( a-addr - x )

x is the value stored at a-addr.  ATOMIC@ is sequentially consistent.

See: Sequential consistency

ATOMIC! ( x a-addr –– ) “atomic-store”

Store x at a-addr.  ATOMIC! is sequentially consistent.

See: Sequential consistency

ATOMICXCHG ( x1 a-addr – x2) "atomic-exchange"

Atomically replace the value at a-addr with x1. x2 is the value
previously at a-addr. This is an atomic read-modify-write
operation. ATOMIC-XCHG has the memory effect of an ATOMIC@
followed by an ATOMIC! .

See: Sequential consistency

ATOMICCAS ( x-expected x-desired a-addr – x-prev)

Atomically compare the value at a-addr with expected, and if
equal, replace the value at a-addr with desired. prev is the value
at a-addr immediately before this operation. This is an atomic
read-modify-write operation.  ATOMIC-CAS has the memory effect of
an ATOMIC@ followed by (if the store happens) an ATOMIC! .

See: Sequential consistency

Mutual Exclusion

/MUTEX ( –- n) "per-mutex"

n is the number of bytes in a mutex.

MUTEXINIT ( a-addr -- )

Initialize a mutex. Set its state to released.

GET ( a-addr --)

Obtain control of the mutex at a-addr. If the mutex is owned by
another task, the task executing GET will block until the mutex is
available.  This operation is part of the total order, and happens
after another task releases the mutex.

[ In a round-robin scheduler, this word executes PAUSE before attempting to acquire the mutex. ]

RELEASE ( a-addr –- )

Release the mutex at addr.

This operation is part of the total order, and happens before
another task can GET the mutex.

AndrewHaleyNew Version: [67] Multi-Tasking Proposal [r186] 2018-09-12 10:04:26

Show differences

Multi-tasking

Multiprogramming is a common facility in Forth systems. The wordset which supports this is known as the multi-tasking wordset, and the entity executing a program sequence is known as a task.

Rationale

Interactions

Round-robin versus pre-emptive task scheduling

The Standard Forth Memory Model

the operations of all the tasks had been executed in some sequential order, and

the operations of each processor appear in this total ordering in the same order as was specified in its program.

Therefore, a well-defined program may store values into a buffer by using ordinary (non-atomic) stores then publish the address of that buffer to other tasks using either a region protected by a MUTEX or by using an atomic store. Another task may use a MUTEX or an atomic fetch to read the address of the buffer, then use ordinary (non-atomic) fetches to load the values from it.

Task creation

TASK ( "<spaces>name” -- )

Skip leading space delimiters. Parse name delimited by a space. Create a definition for name with the execution semantics defined below. Reserve /TASK bytes of data space at a suitably aligned address.

name is referred to as a "task".

name Execution: ( –– a-addr )

a-addr is the address of the task. The task may not be used until it has been CONSTRUCTed.

/TASK ( -- n) "per task"

n is the size of a task.

[This word allows arrays of tasks to be created without having to name each one.]

CONSTRUCT ( a-addr )

Instantiate the task at a-addr. a-addr must be a memory region of at least /TASK bytes in size. CONSTRUCT shall be executed at most once on a task. Once CONSTRUCT has been executed, the task's USER variables may be accessed.

START–TASK ( xt a-addr -- )

Start the task at addr asynchronously executing the word whose execution token is xt. If the word terminates by executing EXIT, reaching the end of the word, or throwing an execption, the task terminates. The task may be restarted after this by executing START-TASK again. The task shall not be restarted until it has terminated.

User variables

+USER ( n "<spaces>name” -- )

Skip leading space delimiters. Parse name delimited by a space. Create a definition for name with the execution semantics defined below. Reserve n bytes of data space, suitably aligned, in every task.

name is referred to as a "user variable".

name Execution: ( –– a-addr )

a-addr is the address of the variable in the currently-running task.

[ I think we need to say that a child of MARKER deallocates all user variables defined after it. ]

HIS ( addr1 addr2 -- addr3 )

Given a task at address addr1 and a user varible address in the current task, return the address of the corresponding user variable in that task.

[ Usage: <task-name> <user-variable-name> HIS e.g. TYPIST BASE HIS @ ]

Starting, stopping, and pausing tasks

STOP ( -- )

Block the currently-running task unless or until AWAKEN has been issued. Calls to AWAKEN are not counted, so multiple AWAKENs before a STOP can unblock only a single STOP.

AWAKEN ( a-addr -- )

a-addr is the address of a task. If that task is blocked, unblock it so that it resumes execution. If that task is not blocked, the next time it executes STOP it will continue rather than blocking.

PAUSE ( -- )

May cause the execuiting task temporarily to relinquish the CPU.

Atomic memory words

ATOMIC@ ( a-addr - x )

x is the value stored at a-addr. ATOMIC@ is sequentially consistent.

See: Sequential consistency

ATOMIC! ( x a-addr –– ) “atomic-store”

Store x at a-addr. ATOMIC! is sequentially consistent.

See: Sequential consistency

ATOMIC–XCHG ( x1 a-addr – x2) "atomic-exchange"

Atomically replace the value at a-addr with x1. x2 is the value previously at a-addr. This is an atomic read-modify-write operation. ATOMIC-XCHG has the memory effect of an ATOMIC@ followed by an ATOMIC! .

See: Sequential consistency

ATOMIC–CAS ( x-expected x-desired a-addr -– x-prev)

Atomically compare the value at a-addr with expected, and if equal, replace the value at a-addr with desired. prev is the value at a-addr immediately before this operation. This is an atomic read-modify-write operation. ATOMIC–CAS has the memory effect of an ATOMIC@ followed by (if the store happens) an ATOMIC! .

See: Sequential consistency

[ Do we need ATOMIC+! ? It can be written by using ATOMIC–CAS but it's rather clumsy to do so. ]

Mutual Exclusion

/MUTEX ( –- n) "per-mutex"

n is the number of bytes in a mutex.

MUTEX–INIT ( a-addr -- )

Initialize a mutex. Set its state to released.

GET ( a-addr --)

Obtain control of the mutex at a-addr. If the mutex is owned by another task, the task executing GET will block until the mutex is available. This operation is part of the total order, and happens after another task releases the mutex.

[ In a round-robin scheduler, this word executes PAUSE before attempting to acquire the mutex. ]

RELEASE ( a-addr –- )

Release the mutex at addr.

This operation is part of the total order, and happens before another task can GET the mutex.

ruv [r187] 2018-09-12 23:27:55

SP-Forth/4 supports multitasking from the box.

USER ( "name" -- ) — it knows the highest already-defined user variable. There are also USER-HERE ( -- offset ) and USER-ALLOT ( n -- ) factors.

INIT-TASK ( addr -- ) is better than CONSTRUCT ( addr -- )

START-TASK ( x xt addr -- ) — it should pass to xt a one parameter at least.

SUSPEND-TASK ( addr -- ) is better name than STOP ( -- ), it also can suspend themself or other task.

RESUME-TASK ( addr -- ) is better name than AWAKEN ( addr -- ), i.e. it better pairs with SUSPEND name.

GET (that produces nothing!) and RELEASE are bad names since they are too general. What if you will need to also wait on semaphore or critical section?

RELEASE-MUTEX ( h -- )

WAIT-MUTEX ( h -- )

WAIT-MUTEX-FOR ( h time-ms -- flag ) — false means waiting timeout.

PAUSE ( time-ms -- ) \ use 0 to just pass the scheduler loop

ruv [r188] 2018-09-12 23:43:35

Possible variants:

INIT-TASK ( xt addr -- )

START-TASK ( x addr -- ) — it already knows xt

STOP-TASK ( addr -- ) — stop (kill) a given task. This name pairs well with START name.

GeraldWodni [r451] 2020-09-01 14:18:46

An implementation by Andrew Haley can be found here: https://sourceforge.net/projects/concurrentforth/

Reply New Version

MitraArdron [167] Query re status; and minor commentsComment2020-11-29 06:32:26

I'm wondering whether this proposal is going somewhere, or was just Andrew's idea ? I don't see serious problems with it, but I'd prefer to only implement one version of multitasking in webForth, but I'm also not seeing any movement for two years?

Minor issues with the proposal ....

I think "INIT-TASK" makes more sense than "CONSTRUCT" for consistency.

ATOMIC! - I'm not quite sure what the point of this is? I'm presuming you mean that the Store can't be interrupted between starting to write and ending (for example if its writing a cell to multiple bytes). But in that case, I would presume that every ! in the core forth words should be an ATOMIC! . If that is not the meaning, then maybe it needs some explanation. (ATOMIC@ has the ame issue; ATOMIC-XCHG and ATOMIC-CAS seem fine to me)

MUTEX - I'd suggest MUTEXGET MUTEXRELEASE because GET is too common a word not to cause clashes?

USER variables seem to be a problem for me - I'm starting with eForth so there are about 40 already, Of those roughly 6 are directly needed to switch tasks (0..3 and SP0, RP0). Almost all the rest are are only relevant if the task has access to I/O or is reading and interpreting/compiling text (?KEY 'EMIT 'EXPECT 'TAP 'ECHO 'PROMPT temp SPAN >IN #TIB 'EVAL 'NUMBER HLD CONTEXT+ 8 CURRENT CP NP LAST BASE) So there is a lot of memory overhead 27 * cellsize bytes per task. If we are going to standardise, wouldn't it be good to be able to do lightweight (no I/O) tasks as well?

UlrichHoffmann [r754] 2021-09-09 14:40:18

On behalf of the committee we invite you to pick up the proposal and develop it further.

As you seem quite interested in the standard you are welcome to attend the next meeting.

Closed

Reply New Version

kc5tja [242] Round-robin vs PreemptiveExample2022-06-25 15:26:37

Round-robin is correctly identified as a scheduling algorithm. However, the choice between cooperative and preemptive multitasking has nothing to do with scheduling. For example, the Commodore-Amiga operating system Kickstart uses round-robin scheduling with tasks of equal priority, even though it is also preemptive. The Tripos operating system uses a pure priority-driven scheduling algorithm, with no round-robin behavior at all, despite being cooperative. Thus, it is best to use the term "cooperative" instead of round-robin, and avoid the whole "sometimes called" stuff. It is distracting and factually wrong. Commentary about how people frequently mis-use the term "round-robin," and/or a more general discussion of the different scheduling algorithms that exist, is best left for an appendix entry. I don't think it has a proper place in the body of a standard.

Here's is how I would clean up that section. Very little needs to change, but it is, in my estimation, a huge improvement.

Cooperative Versus Preemptive Task Scheduling

Originally, Forth used a cooperative task scheduler to share time between tasks on a single processor. Such a scheduler only switches tasks voluntarily, i.e. when a task relinquishes control. This is done either by executing PAUSE or an I/O word such as TYPE. More recent systems sometimes use a "preemptive" scheduler which may switch tasks at any time, and some systems have multiple processors sharing the same memory. This wordset doesn't address such differences except to note that you may have to execute PAUSE in a long-running computation to allow other tasks to run. From the point of view of this standard, it does not matter: as long as it follows the rules of the Memory Model, a Standard Forth program will execute correctly on either kind of system.

ruv [r837] 2022-07-16 10:43:59

Yes, I agree — in the most places of this proposal the term "cooperative" should be used instead of the term "round-robin".

Reply New Version

kc5tja [243] Stack Sizes?Example2022-06-25 15:38:51

The proposal looks good, at least for a system which utilizes a MMU to detect stack overflows to expand the stacks automatically. However, not all Forth systems will be equipped with such an MMU. On such systems, how does one ensure a data or return stack large enough to allow the task to run?

ruv [r836] 2022-07-16 10:25:11

how does one ensure a data or return stack large enough to allow the task to run?

This problem already exists for a standard program: how does one ensure the data and return stacks are large enough to allow the program to run? There is no a standard API for that at the moment.

The only way for a system is to document the initial size for the stacks, and a system defined way to configure it (if any). And for a program — to document the requirements for the stacks, and use a system-defined method to configure these sizes (if any).

And the same is for thread stack size.

ruv [r838] 2022-07-16 10:46:00

And of course, an API to configure stacks size can be proposed.

kc5tja [r841] 2022-07-31 01:46:14

<blockquote>This problem already exists for a standard program: how does one ensure the data and return stacks are large enough to allow the program to run? There is no a standard API for that at the moment.

Yes, of course; this is patent. The question was, <b>how, using the proposed API, do we specify these extents? Explicit stack sizes are required for MMU-less systems, but the API proposed above lacks any means of specifying how big a data or return stack is to be. SwiftForth, for example, uses a version of TASK which accepts these parameters on the data stack, allowing a programmer to specify them explicitly.

ruv [r844] 2022-08-11 19:10:15

The question was, how, using the proposed API, do we specify these extents?

Obviously, the proposed API doesn't provide any way for that.

A standard system may have many stacks. Among them: the data stack, return stack, floating-point stack, control-flow stack, exception stack, local variables stack. All the stacks should be local for a thread.

But a standard program (that is portable) doesn't know what stacks are employed in the system (as separate stacks), and how many cells of each stack are used in the system routines. Hence the program cannot reliably set the absolute size for the stacks at all.

What a program can do is inform the system what size for a stack is required for the program's needs — so, the system can free unneeded memory of the stack, or reserve smaller space for the stack, or reserve a bigger space for the stack, or throw an exception (or return a throw code) if the required space cannot be allocated (taking into account the system's own needs) — but it's up to the system whether the stack will be actually reduced or not.

I think, this capability is most wanted for the data stack and return stack, so we might not consider other stacks for a while.

The next question: what thread is the subject for stack changing?

the current thread (then the API will be capable for single-thread systems too);
the new thread, before creating;
any other thread.

In the first option, the API can be like the following:

need-data-stack ( u -- ior ) \ inform the system that the program needs u cells of the data stack for the program's needs
need-return-stack ( u -- ior )  \ inform the system that the program needs u cells of the return stack for the program's needs

ior should be a special nonzero value if the system cannot provide at least u cells of the stack for the program's needs. An ambiguous condition exists if the operation was success and the program will use more than u cells of the stack after that.

ruv [r845] 2022-08-11 19:26:40

This solution it not well adequate when a program uses third party libraries, since it could be unaware what stack size is required for the library's routines.

ruv [r994] 2023-02-26 18:44:30

In the general case a stack cannot be replaced by a new one (of more size), since addresses in the stack can be used under the hood.

What is possible (for each stack, or at least the data stack and return stack) are:

to ask the system for a number of available cells in the stack (similar to unused for data space);
to execute a definition with guaranteed amount of the stack space (so a new stack can be allocated before execution, and freed after; in the case of the data stack the parameters should be passed to the new stack and the results — to the old one);
to specify the stack size for a new thread (before creating/starting);
to specify the stack size for the Forth system before starting;

See also a StackOverflow question: Set stack size programmatically on Windows.

Reply New Version

kc5tja [244] Interactions with MARKER and KILL-TASKExample2022-06-25 15:54:21

While the standard allows MARKER to focus only on the state of the dictionary, and thus an implementation doesn't have to extend marker words to do anything related to tasks, I'm wondering what are the general thoughts about a marker word being extended to kill currently running threads defined after a marker word? Without such an extension, you would need to write code like this:

MARKER end-program

TASK fooer
  fooer CONSTRUCT

: end-program
  fooer KILL-TASK  ( STOP takes no parameters, so we'll need to introduce this new word. )
  end-program ;

Which reminds me about KILL-TASK. Exact name doesn't matter to me, but I think we'd want this. I imagine creating an interactive program, launching one background task per asynchronously-managed I/O device. When the program is terminated by the user, it would be necessary to bring those tasks down cleanly (even if rudely) before reverting HERE to somewhere ahead of their task records.

ruv [r835] 2022-07-16 10:13:31

I'm wondering what are the general thoughts about a marker word being extended to kill currently running threads defined after a marker word?

It doesn't matter where a thread is defined (i.e., where a definition created by task is located in the dictionary). What does matter is where the code fragments and data fragments that can be used in the thread are located in the dictionary. But this problem is undecidable in the general case.

So, terminating a thread when marker is executed is a very partial solution. This solution is harmful, since users would rely on this solution as if it is a reliable solution, and probably will not think whether a marker will remove some data or code that is used by some running thread, if the corresponding task is defined before the marker.

kc5tja [r840] 2022-07-31 01:40:24

<blockquote>This solution is harmful, since users would rely on this solution as if it is a reliable solution, and probably will not think whether a marker will remove some data or code that is used by some running thread, if the corresponding task is defined before the marker.

This paragraph does not make any sense to me. Can you rephrase this to be more clear?

ruv [r842] 2022-07-31 14:41:20

it would be necessary to bring those tasks down cleanly (even if rudely) before reverting HERE to somewhere ahead of their task records.

Since rewriting the task record of a running task (when you load something into the dictionary after HERE was reverted) can lead to crash, can't it?

But if HERE is reverted ahead of any data or code that is used by some other running task, it can lead to crash as well.

So, when you take into account only task records, you make false impression of reliability. But it's impossible to take into account all data or code.

An example:

task t2
defer t2.core
:  t2.proc ( -- ) begin t2.core 1000 pause-ms again ;


marker end-program

task t1

:noname 123 . cr ; is t2.core

:  t1.proc ( -- ) ['] t2.proc t2 start-task  ;

' t1.proc t1 start-task

end-program

Even if end-program kills t1, t2 is still running. And it doesn't matter in what way t2 was started, the only important thing that the code which is used in t2 was removed by end-program (and this code will be overwritten when you load something into the dictionary).

So, the marker words (i.e., the definitions created by marker) should not take into account the running tasks/threads. But the corresponding ambiguous condition should be probably explicitly mentioned for marker (see my comment).

kc5tja [r843] 2022-08-01 05:14:30

<blockquote>Even if end-program kills t1, t2 is still running.

Ahh! That's a good catch! I didn't think about that case. Thank you for pointing it out.

Reply New Version

ruv [247] Better API for multitaskingComment2022-07-18 00:03:00

Task object problem

The proposed interface (partially) is:

TASK ( "name” -- ), name ( -- a-addr.task )
/TASK ( -- u ) — This word allows arrays of tasks to be created without having to name each one.
CONSTRUCT ( a-addr.task -- ) — Instantiate the task at a-addr.task.
START-TASK ( xt a-addr.task -- )

The task shall not be restarted [by means of START-TASK] until it has terminated.

It's a significant condition that I oversight before.

How it can be known in the parent thread whether a child thread has terminated or not? How to correctly free the memory allocated for a "task"?

Actually, it's a problem — since the user should manage memory for "tasks" on a very low level, and there is no enough means for that. But even having such means, this resource management should be hidden under the hood.

Requirements for multitasking

I implemented a variant of cooperative round-robin multitasking in Forth for MS-DOS many years ago, and I use multi-threading in SP-Forth/4 on Windows (and Linux) a lot. In these both implementations a data object that is required to start a new thread was superfluous (and it could be eliminated).

In practice, I need a convenient way to:

launch an xt execution in a new separate thread (i.e., this xt is executed asynchronously, in parallel, using separate stacks),
pass at least one argument to the new thread for this xt: xt ( x -- ),
premature terminate execution of the own thread (it's like exit for a word, or bye for a process),
forcefully terminate execution of another thread,
wait for a thread termination,
get a throw code of a terminated thread.

I'm not sure concerning the following:

suspend execution of own or another thread (if it was not terminated),
resume execution of another thread (if it was suspended, otherwise do nothing),

I don't include synchronization primitives in the list since they are a slightly different topic.

Another API variant

According to requirements, I can suggest the following API to manage tasks (threads):

launch-task ( x xt -- tid ior )
close-task-id ( tid -- ior )
clone-task-id ( tid1 -- tid2 ior ) — a method to obtain another independent identifier.
task-id-own ( -- tid.own )
terminate ( -- )
kill-task ( tid -- ior )
wait-task ( tid -- ior ) — wait task termination
wait-task-for ( u.timeout tid -- flag ior ) — wait termination for given timeout
task-status ( tid -- n.status-code ior )

Optionally:

launch-task-suspended ( x xt -- tid ior )
suspend-task ( tid -- ior )
resume-task ( tid -- ior )

The method to suspend a task is not quite useful, since it cannot be used for synchronization.

Any tid is valid until it's closed (freed). The own tid is closed automatically when the thread terminates, so it should not be passed to another tread. When all tids are closed, the resources are freed automatically. NB: these identifier are not unique, since many identifiers may identify the same task/thread. Maybe the term "handle" is suited better here than "identifier".

Shared resources considerations

The API should specify how to work with dynamic memory: may a memory block allocated in one thread be freed in another thread, or not. For example, in SP-Forth/4 memory is allocated/freed in the per thread heaps, and all the allocated memory blocks in the thread are freed when the thread is terminated; and the global heap is also available.

Also, the API should specify does here return different values in different threads, or the same. If here can be local for threads, may these threads perform evaluate and/or includedindependently in parallel, or not. If yes, should the compilation word list be different in different threads, or not.

High level inter-thread communication

A very convenient way for communications are channels (that is a message queue under the hood). So, each running task has an associated channel for inbound messages. Usually a message is cell-sized number. Maybe be strings should be also considered.

A sketch of methods:

send-message ( c-addr u tid -- ior ) — sends a copy of given string.
wait-message ( -- c-addr u ) — the returned string is valid until the next call of wait-message-*.
wait-message-for ( u.timeout -- c-addr u flag )

The meaning of the content of string ( c-addr u ) is defined by the program (it may be binary data as well as textual).

Reply New Version