11.6.1.2090 READ-LINE FILE

( c-addr u₁ fileid -- u₂ flag ior )

Read the next line from the file specified by fileid into memory at the address c-addr. At most u₁ characters are read. Up to two implementation-defined line-terminating characters may be read into memory at the end of the line, but are not included in the count u₂. The line buffer provided by c-addr should be at least u₁+2 characters long.

If the operation succeeded, flag is true and ior is zero. If a line terminator was received before u₁ characters were read, then u₂ is the number of characters, not including the line terminator, actually read (0 <= u₂ <= u₁). When u₁ = u₂ the line terminator has yet to be reached.

If the operation is initiated when the value returned by FILE-POSITION is equal to the value returned by FILE-SIZE for the file identified by fileid, flag is false, ior is zero, and u₂ is zero. If ior is non-zero, an exception occurred during the operation and ior is the implementation-defined I/O result code.

An ambiguous condition exists if the operation is initiated when the value returned by FILE-POSITION is greater than the value returned by FILE-SIZE for the file identified by fileid, or if the requested operation attempts to read portions of the file not written.

At the conclusion of the operation, FILE-POSITION returns the next file position after the last character read.

See:

A.11.6.1.2090 READ-LINE.

Rationale:

Implementations are allowed to store the line terminator in the memory buffer in order to allow the use of line reading functions provided by host operating systems, some of which store the terminator. Without this provision, a transient buffer might be needed. The two-character limitation is sufficient for the vast majority of existing operating systems. Implementations on host operating systems whose line terminator sequence is longer than two characters may have to take special action to prevent the storage of more than two terminator characters.

Standard Programs may not depend on the presence of any such terminator sequence in the buffer.

A typical line-oriented sequential file-processing algorithm might look like:

BEGIN                        ( )
    ... READ-LINE THROW      ( length not-eof-flag )
WHILE                        ( length )
    ...                      ( )
REPEAT DROP                  ( )

READ-LINE needs a separate end-of-file flag because empty (zero-length) lines are a routine occurrence, so a zero-length line cannot be used to signify end-of-file.

Testing:

200 CONSTANT bsize
CREATE buf bsize ALLOT
VARIABLE #chars

T{ fn1 R/O OPEN-FILE SWAP fid1 ! -> 0 }T
T{ fid1 @ FILE-POSITION -> 0. 0 }T
T{ buf 100 fid1 @ READ-LINE ROT DUP #chars ! ->
<TRUE> 0 line1 SWAP DROP }T
T{ buf #chars @ line1 COMPARE -> 0 }T
T{ fid1 @ CLOSE-FILE -> 0 }T

ContributeContributions

AntonErtl [14] Dealing with newlinesComment2016-02-02 15:47:01

Up to Gforth 0.4, we used the C approach to text files: let the C library translate between OS-dependent newlines in the file and one newline character (typically LF) in memory on input and on output. That approach turned out to cause problems when dealing with CRLF-containing files in combination with READ-FILE and REPOSITION-FILE (among other cases), because READ-FILE referred to the in-memory length, while REPOSITION-FILE referred to the in-file length.

So, in Gforth 0.5 we switched to opening all files as binary files (whether BIN is used in fam or not); READ-FILE recognizes all three kinds of newlines (LF, CR, and CRLF), and CR and WRITELINE output the standard newline of the platform (LF on Unix, CRLF on Windows). If the user reads text files with READ-FILE or writes them with WRITE-FILE, they have to worry about that themselves.

The experience with this new (well, by now,16-year old) approach is positive; no problems have been reported, and the problems we had with the previous approach were solved.

This approach works so well, because Forth has tended to avoid dealing with newlines as characters or strings: We have CR and WRITE-LINE for outputting a newline, and READ-LINE and ACCEPT for inputting lines. In all these places the actual value of the newline is abstracted away. The C approach, OTOH is due to the fact that in the Unix roots of C newline was visible as a single character, and they wanted to make programs written for that model run on OSs that have CRLF newlines.

, but no problems with that approach have been reported.

Reply New Version

AntonErtl [216] Some clarificationsComment2021-11-01 19:19:13

The "0 <= u2 <= u1" is misleading. As becomes clear from the rest, 0 <= u2 < u1 is also guaranteed if the line terminator starts before u1 chars are received.

READ-LINE reads at most u1 characters that are not part of the line terminator. Line terminators of up to 2 chars can occur (i.e., CRLF). However, even with such a line terminator, it's enough to read u1+1 chars: if the line terminator does not start at the last char in the buffer, READ-LINE does not need to know if a line terminator follows right afterwards: It just returns u2=u1, no need to know about line terminators.

At least in Linux it is significantly faster to use input buffering than to always read u1+1 characters using a system call and reposition the file with another system call.

Acknowledgments: Discussions with ruv and dxforth resulted in this comment.

KonradSchwarz [r931] 2022-11-25 16:24:35

u2 should always be set to the number of characters read, excluding line terminators, not just when the line has less than u1 characters. (A strict reading of the current wording leaves u2 undefined when the line has u1 or more characters, excluding terminators).

If a line, excluding line terminators, is exactly u1 characters long, u2 will be set to u1 and it won't be clear if a complete line has been read or not: longer lines will also return u2 := u1.

To reliably distinguish between fully read lines and too-long lines, u1 should be selected to be one greater than the largest line length one is prepared to handle. The different cases can then be disambiguated as follows:

u2 < u1 ... a line was read completely u2 == u1 ... the first u1 characters of the next line were read

AntonErtl [r932] 2022-11-25 18:39:38

The whole point of the complicated specification of READ-LINE is to support reading arbitarily long lines using buffers that may be shorter than the line. When reading a line with u1 non-terminator characters (followed by a line terminator), READ-LINE has to return u2=u1, and the next READ-LINE has to return u2=0. That's not entirely clear from the specification (so maybe we should rewrite it for more clarity), but it's the only interpretation that allows knowing that the line actually has u1 characters rather than being the start of a longer line.

Reply New Version