6.2.2008 PARSE CORE EXT

( char "ccc<char>" -- c-addr u )

Parse ccc delimited by the delimiter char.

c-addr is the address (within the input buffer) and u is the length of the parsed string. If the parse area was empty, the resulting string has a zero length.

See:

Rationale:

Typical use: char PARSE ccc<char>

The traditional Forth word for parsing is WORD. PARSE solves the following problems with WORD:

  1. WORD always skips leading delimiters. This behavior is appropriate for use by the text interpreter, which looks for sequences of non-blank characters, but is inappropriate for use by words like ( , .(, and .". Consider the following (flawed) definition of .(:

       : .( [CHAR] ) WORD COUNT TYPE ; IMMEDIATE

    This works fine when used in a line like:

       .( HELLO)    5 .

    but consider what happens if the user enters an empty string:

       .( )    5 .

    The definition of .( shown above would treat the ) as a leading delimiter, skip it, and continue consuming characters until it located another ) that followed a non-) character, or until the parse area was empty. In the example shown, the 5 . would be treated as part of the string to be printed.

    With PARSE, we could write a correct definition of .(:

       : .( [CHAR] ) PARSE TYPE ; IMMEDIATE

    This definition avoids the "empty string" anomaly.

  2. WORD returns its result as a counted string. This has four bad effects:

    1. The characters accepted by WORD must be copied from the input buffer into a transient buffer, in order to make room for the count character that must be at the beginning of the counted string. The copy step is inefficient, compared to PARSE, which leaves the string in the input buffer and doesn't need to copy it anywhere.

    2. WORD must be careful not to store too many characters into the transient buffer, thus overwriting something beyond the end of the buffer. This adds to the overhead of the copy step. (WORD may have to scan a lot of characters before finding the trailing delimiter.)

    3. The count character limits the length of the string returned by WORD to 255 characters (longer strings can easily be stored in blocks!). This limitation does not exist for PARSE.

    4. The transient buffer is typically overwritten by the next use of WORD.

    The need for WORD has largely been eliminated by PARSE and PARSE-NAME. WORD is retained for backward compatibility.

ContributeContributions

JamesNorrisavatar of JamesNorris [144] What happens when parse reaches the end of the parse area and the parse delimiter was not found?Request for clarification2020-08-01 13:42:29

The text says if the parse area is empty a string of length zero is returned. This suggests that not finding the delimiter for something other than length zero also ends the parse. The reason I'm asking is there was something in the old standard about refilling the parse area when parsing from a file... I'm not sure if it's in this standard. If that's true then it conflicts with this. My forth currently loads the entire file into a buffer and treats the end of the buffer as the end of the parse area. And treats one line from the terminal also as one buffer. I figured it would be better to not have the interpreter get stuck in a looking for an end delimiter state over multiple lines if the user forgot to put one in... But when parsing from a file it does allow carriage returns to be in the string returned from parse. Is treating reaching the end of the parse area the same as finding the end delimiter the intent of the standard for PARSE? Also... I was wondering if loading an entire file into a buffer and doing EVALUATE on the whole file buffer in one go in keeping with the spirit of the standard? (It really is a lot faster and simpler than doing multiple reads from a file.)

JennyBrienavatar of JennyBrien

The text says if the parse area is empty a string of length zero is returned. This suggests that not finding the delimiter for something other than length zero also ends the parse. The reason I'm asking is there was something in the old standard about refilling the parse area when parsing from a file... I'm not sure if it's in this standard. If that's true then it conflicts with this.

If the delimiter is not found, PARSE will normally return the same string as it would if the parse area was one char longer and the final char was the one sought. It's possible to check for this condition if you need to by comparing the length returned with the length of the remaining parse area before PARSE was called.

REFILL allows the same definition to be used for potentially multi-line parses from any source. The general form would be something like:

begin begin incomplete-parse while ... refill 0= until

Is treating reaching the end of the parse area the same as finding the end delimiter the intent of the standard for PARSE?

Yes.

Also... I was wondering if loading an entire file into a buffer and doing EVALUATE on the whole file buffer in one go in keeping with the spirit of the standard? (It really is a lot faster and simpler than doing multiple reads from a file.)

That's almost possible (I've done it in the past) if you convert all line delimiters to blank space when you load the load. But then you are treating the entire file as one line and you will run into problems if the code contains words like \ that are looking for line delimiters.

Alternatively, your system could have SOURCE return the length to the next line delimiter and REFILL skip over the line delimiter.

JamesNorrisavatar of JamesNorris

"That's almost possible (I've done it in the past) if you convert all line delimiters to blank space when you load the load. But then you are treating the entire file as one line and you will run into problems if the code contains words like \ that are looking for line delimiters."

It works. I don't convert line delimiters to blank spaces. I use the loaded file buffer as is.

For parsing words I parse for a set of delimiters instead of just a space. This is in the current standard although it says control characters. The set of delimiters I use are: { space, line feed, tab, vertical tab, back space, carriage return, and form feed }

For line comment I use a modified PARSE. This is technically not in the standard. The modified wording used for if you include the block word set would make this follow the standard, but I'm not sure if this wording has been added for parsing from files yet. I'm kinda hoping it would be. Instead of parsing for just one character, I parse for a set, which is: { line feed, vertical tab, back space, carriage return, and form feed }

MitchBradleyavatar of MitchBradley

I led the effort to specify the input stream in a way that works across files, blocks, keyboard, and string input. Before the standard, I had used, for many years, a stream-oriented approach in my own systems, so I fully understand why that is appealing to you. But such an approach was just not feasible when considering the existing practice around >IN and the variables that are now hidden within SOURCE . As a result, the standard specifies an input model that is intended to be strictly line-oriented - the input buffer contains exactly one line. PARSE works entirely within a line - so "0 PARSE" will return the remainder of the line. To get the next line you must do an explicit REFILL. If you want a parsing operation to work across multiple lines, it must be explicitly programmed to do REFILL as necessary. It is possible that the text does not make this clear, or that it is just wrong, but that behavior was what I intended, and I believe that the committee understood that to be the case. Your approach of reading the entire file into one big buffer is certainly feasible - I do that myself in some scenarios - but to be compliant with the standard as I understand it, REFILL needs to look for the next line delimiter and arrange for the input buffer to contain just the one line.

JamesNorrisavatar of JamesNorris

"Extend the execution semantics of 6.2.2125 REFILL with the following:

When the input source is a text file, attempt to read the next line from the text-input file. If successful, make the result the current input buffer, set >IN to zero, and return true. Otherwise return false."

It technically doesn't say you have to read only one line on the first read. But I see your point. Why is the standard dictating implementation instead of behavior?

But it also says this:

"When the input source is a string from EVALUATE, return false and perform no other action."

And I'm using EVALUATE to do the buffers. I use EVALUATE for everything... So... is this compliant or non compliant?

" But such an approach was just not feasible when considering the existing practice around >IN and the variables that are now hidden within SOURCE"

The way I do it is: each buffer, including the terminal input buffer, has its own >IN offset. And then I use a current input buffer variable to determine what the current >IN and SOURCE are. This also allows for including one file within another file. When the included file finishes, the first file continues from where it left off. Basically the include does this: save the current input buffer to a local variable, load the file to a buffer, set the current input buffer to the new buffer, evaluate the buffer, put the current input buffer back the way it was.

The main reasons for doing an entire file in one go are:

  1. There is a lot of operating system overhead involved with doing a read from a file. Doing one big read is a lot more efficient and faster than doing a lot of smaller reads.
  2. It means I need a lot less words to manipulate files. I only need two, one to load the entire file to a buffer, and one to save a buffer to a file. (The saved file overwrites any existing file... so the whole buffer is the file after the save.)
  3. It really simplifies the parsing code. I just do an EVALUATE on the whole buffer. And parse for a set of delimiters instead of just a space.

But I am wondering what specific standard behaviors will break by doing it this way? Won't all the words be parsed in exactly the same way? If not, is it possible to get the standard changed to allow this? And technically, the way it's worded now doesn't prevent someone implementing things this way...

JamesNorrisavatar of JamesNorris

I forgot to mention... include frees the buffer after the evaluate... kind of important to not leak memory :-)

JamesNorrisavatar of JamesNorris

"PARSE works entirely within a line - so "0 PARSE" will return the remainder of the line"

I just looked through PARSE and I couldn't find where it says where it only works within a line. I checked my parse and it ends when the delimiter is found or it reaches the end of the buffer. Changing my parse to only go to the end of a line is an easy fix, I just add the set of line terminators to the list of delimiters for a parse. But, is this really the intent of the standard?

That means you can't use ( to do multi line comments in a file. It also means you can't pass strings to EVALUATE with line terminators in them. It's also extra rules for special cases... Also no multi line ." C" " S" or .( in a file. I find multi line ( useful.

If this is the intent of the standard, can PARSE and all the words using it be changed to say they only go to the end of the line if the terminator is not found? I would rather things go the other way though, where these words can operate over multiple lines if the buffer is from a file.

In other words, this is the definition of parse area in the standard: "parse area: The portion of the input buffer that has not yet been parsed, and is thus available to the system for subsequent processing by the text interpreter and other parsing operations."

I interpreted this as I could change the input buffer to be whatever I wanted, including a buffer I loaded from a file. The thing that is missing from the definition is that the parse area can not contain line terminator characters, and what to do if they are found. In reality, you can leave this definition alone, and all the words that use parse alone, and just put something in PARSE that says what to do if the end of the parse area is reached without finding the delimiter. And change PARSE-NAME to include line terminators as additional white space delimiters when parsing from a buffer from a file or EVALUATE. And change \ to use line terminators as delimiters when parsing from a buffer from EVALUATE or a file. The wording of SOURCE does not need to change when parsing from a file. Neither does the wording of >IN need to change. They just return values relative to the current input buffer when loading from a file... which is what they already say.

So in reality, to have parsing from a buffer loaded from a file be compliant with the standard all that needs to happen is a slight rewording of \ and PARSE-NAME. \ would have to parse to a line terminator or the end of the parse area. and PARSE-NAME would have to use white space as delimiters which would include the space character and line terminator characters. This would also fix the ambiguous condition of what happens when someone passes a string to EVALUATE which contains line terminators.

PARSE needs to be reworded anyways because it does not specifically say the parse ends when the end of the parse area is reached... (I got confused by the wording.)

AntonErtlavatar of AntonErtl

I would love to point out chapter and verse for every issue, but at the volume of questions you are asking, this would be a full-time job. So below, I will just point out my understanding of the issues, and leave it to you to read the appropriate sections for details. Relevant sections are "3.4.1 Parsing", "3.3.3.5 Input buffers" (and the related block and file sections mentioned there), "11.6.1.1718 INCLUDED" (and similar words), "6.1.1360 EVALUATE" (and its blocks version) and "7.6.1.1790 LOAD".

  1. Normal parsing does not REFILL. You need to REFILL explicitly; words that REFILL internally explicitly mention it (e.g., "11.6.1.0080 (")

  2. Yes, reaching the end of the parse area (i.e., line in case of INCLUDED) ends parsing unless otherwise mentioned.

  3. EVALUATE has specific effects (such as setting SOURCE-ID to -1) that are inappropriate during, e.g., INCLUDED. But you can certainly have a common factor between EVALUATE and INCLUDED for processing an input buffer.

  4. Loading the whole file into RAM is probably in the spirit (and if the wording does not allow it, it probably should), but you then have to treat each line as a separate input buffer.

  5. "Why is the standard dictating implementation instead of behavior?" It describes how a standard system behaves. If a standard program cannot detect the difference, you can implement it differently. In the present case, could a standard program write to the file during INCLUDED? That would be detectable, but as mentioned in 4, I think it should not be.

  6. The way you describe your implementation, I expect it is not compliant. But one would have to look at the implementation, and then consider whether one can write a standard program that detects the difference. As a first step, you could run Gerry Jackson's test suite.

  7. "But I am wondering what specific standard behaviors will break by doing it this way?" Every program that uses "0 parse" to get the rest of the line will not behave as intended. Every program that uses REFILL to switch to the next line will not behave as intended. Possibly more breakage.

  8. "If not, is it possible to get the standard changed to allow this?" Very unlikely, because breaking existing programs (and such programs exist) is a no-no.

  9. "But, is this really the intent of the standard?" Yes. Mitch Bradley made it clear. And if you read the references mentioned above, it is not just the intent, but also the wording of the standard.

  10. Multi-line "(" is standard. See "11.6.1.0080 (". Multi-line ." S" and C" are not.

  11. "can PARSE and all the words using it be changed to say they only go to the end of the line if the terminator is not found?" The standard already says that in 3.4.1. But if you want, you can propose changing the wording of the standard. I don't think it would help, though; it would just inflate the standard, making it harder to find the relevant text for other issues. We have this site for clarifying issues.

JamesNorrisavatar of JamesNorris

  1. Normal parsing does not REFILL. You need to REFILL explicitly; words that REFILL internally explicitly mention it (e.g., "11.6.1.0080 (")

I'll have to unfix " then... I made it only one line.

  1. Yes, reaching the end of the parse area (i.e., line in case of INCLUDED) ends parsing unless otherwise mentioned.

The definition of PARSE above needs to say what is returned when the end of the parse area is reached and the terminator was not found.

  1. EVALUATE has specific effects (such as setting SOURCE-ID to -1) that are inappropriate during, e.g., INCLUDED. But you can certainly have a common factor between EVALUATE and INCLUDED for processing an input buffer.

It was my understanding from reading the standard that implementing SOURCE-ID was optional.

  1. Loading the whole file into RAM is probably in the spirit (and if the wording does not allow it, it probably should), but you then have to treat each line as a separate input buffer.

It does not. Why would I have to treat each line as a separate input buffer when I can instead consider line terminators to be white space when parsing words, and treat line terminators as delimiters when parsing for the end of the line in line comment? This way I avoid unnecessary copying.

  1. "Why is the standard dictating implementation instead of behavior?" It describes how a standard system behaves. If a standard program cannot detect the difference, you can implement it differently. In the present case, could a standard program write to the file during INCLUDED? That would be detectable, but as mentioned in 4, I think it should not be.

My reply to number 4 illustrates my question. There is more than one way to do a line comment, or parse words and achieve the same result.

  1. The way you describe your implementation, I expect it is not compliant. But one would have to look at the implementation, and then consider whether one can write a standard program that detects the difference. As a first step, you could run Gerry Jackson's test suite.

I would like to use the test suite, but the repository on github is missing a license as far as I can tell which means it is copyrighted and not available for the public. https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository

  1. "But I am wondering what specific standard behaviors will break by doing it this way?" Every program that uses "0 parse" to get the rest of the line will not behave as intended. Every program that uses REFILL to switch to the next line will not behave as intended. Possibly more breakage.

I fixed PARSE to stop at the end of the line by treating finding a line terminator as finding the end of the parse area for PARSE. I'm still loading the entire file into a buffer and then parsing the whole buffer in one go for PARSE-NAMES. I just have to treat line terminators as white space delimiters for PARSE-NAMES.

  1. "If not, is it possible to get the standard changed to allow this?" Very unlikely, because breaking existing programs (and such programs exist) is a no-no.

Even if no existing programs or behaviors are broken? Such as doing a line comment in a different way that still has the interpreter skipping to the end of the line?

  1. Multi-line "(" is standard. See "11.6.1.0080 (". Multi-line ." S" and C" are not.

I'll have to unfix that one too...

  1. "can PARSE and all the words using it be changed to say they only go to the end of the line if the terminator is not found?" The standard already says that in 3.4.1. But if you want, you can propose changing the wording of the standard. I don't think it would help, though; it would just inflate the standard, making it harder to find the relevant text for other issues. We have this site for clarifying issues.

I missed this on first reading it. I'm guessing you are referring to this phrase:

3.4.1 Unless otherwise noted, the number of characters parsed may be from zero to the implementation-defined maximum length of a counted string.

It doesn't exactly say it's one line.... And PARSE doesn't return a counted string so... not sure why this is in there. Had to look up 'counted string' in the definition of terms which says:

counted string: A data structure consisting of one character containing a length followed by zero or more contiguous data characters. Normally, counted strings contain text.

And 'one character' is implementation defined. It's usually a byte but some use something larger to hold characters. So technically you are correct but I'm not sure of the reason for this rule.

JamesNorrisavatar of JamesNorris

"The definition of PARSE above needs to say what is returned when the end of the parse area is reached and the terminator was not found."

I found this in the extended description of parsing:

"Otherwise, the string continues up to and including the last character in the parse area, and the number in >IN is changed to the length of the input buffer, thus emptying the parse area."

So it does say what to do if the end of the parse area is reached and the delimiter was not found. But it wouldn't hurt to have it up there with the short description of PARSE. Something like "If no end delimiter remains in the parse area, the parsed string equals the entire parse area." I'm not going to go so far as to make a proposal since it is already in the extended description.

AntonErtlavatar of AntonErtl

  1. Yes, SOURCE-ID is optional. Another issue is that REFILL during EVALUATE returns false, while REFILL during INCLUDED returns true and changes the input buffer to the next line if there is one.

  2. Words that parse must not parse beyond the end of line in a standard system (unless they also refill). And REFILL has to work as described in the standard. As for avoiding copying, I think it is possible to read (or mmap) the whole file into a buffer, and then treat each line in that file as input buffer. REFILL then changes the address that SOURCE returns. No copying of buffer contents happening.

  3. Looking at some individual test files, they mention distribution terms at the start of the file (typically public domain). If you want a LICENSE file, maybe you could contribute one.

  4. If no existing programs are broken because your implementations of standard words satisfy the requirements of the standard, then there is no need to change the standard. If no existing programs are broken, because no existing programs exercise the areas where your implementations of the standard words deviate from the requirements, then you could make a proposal for changing the standard, and (to get it accepted) would have to convince people that there is really no program around that exercises these areas.

  5. I meant

Otherwise, the string continues up to and including the last character in the parse area, and the number in >IN is changed to the length of the input buffer, thus emptying the parse area.

Concerning the parse area being a line when including a file, look at INCLUDED.

The sentence about counted strings probably refers to restrictions that systems have that primarily use WORD to parse (in most of the words that parse). PARSE and PARSE-NAME probably should "note otherwise". Of course, it is good practice for systems to use PARSE and PARSE-NAME instead of WORD, but systems that implement, say .", by calling WORD can still be standard.

JamesNorrisavatar of JamesNorris

"6) Looking at some individual test files, they mention distribution terms at the start of the file (typically public domain). If you want a LICENSE file, maybe you could contribute one."

This is from GitHub:

"Public repositories on GitHub are often used to share open source software. For your repository to truly be open source, you'll need to license it so that others are free to use, change, and distribute the software."

In other words, it is illegal to copy and use software unless the author specifically gives permission. The standard way to do that is with a license file, and it has to come from the author. If I wrote one and posted it, that would be identity theft and forgery, which is a felony. I know some people don't care about this, but some day down the line it could become a problem for you if you don't take the time to make sure you really have the author's permission to copy and use their work.

"4) Words that parse must not parse beyond the end of line in a standard system (unless they also refill). And REFILL has to work as described in the standard. As for avoiding copying, I think it is possible to read (or mmap) the whole file into a buffer, and then treat each line in that file as input buffer. REFILL then changes the address that SOURCE returns. No copying of buffer contents happening."

I'm still not understanding what difference it makes if you pass the entire file to PARSE and treat line delimiters as white space or if you pass lines and do REFILLS when the strings returned are exactly the same either way. I honestly can't think of a test case that would be able to test the difference that matters in practical use. Again, the standard does not say you have to do it this way, and why are you dictating implementation instead of end behavior? The stuff in section 3.4.1 is a limitation on what parsing returns, not on how much stuff you pass to PARSE.

JamesNorrisavatar of JamesNorris

I mean treat line terminators as additional end delimiters, not white space. (I make mistakes :-)

AntonErtlavatar of AntonErtl

  1. SOURCE will produce different results. >IN @ and >IN ! will produce different results. REFILL will produce different results.

Of course, if your implementation makes sure that they do not (maybe I misunderstood your description of it), it may be standard-compliant.

Concerning "prescribing the implementation", that's not what is happening: An implementation that reads individual lines into memory on REFILL can be standard; an implementation that mmaps everything and then copies each line can be standard; an implementation that reads everything into a buffer and then lets SOURCE point into that buffer, in a different place after every REFILL can be standard. The implementation is not prescribed.

  1. I think you are mistaken that you would perform "identity theft and forgery" by adding a LICENSE file, even if it was not an accurate summary of the licenses of the individual source files.

But in any case, why demand of Gerry Jackson what you think you are not allowed to do yourself? He is not the author of all files in his collection, so by your reasoning he is not allowed to write a LICENSE file, either.

JamesNorrisavatar of JamesNorris

"4) SOURCE will produce different results. >IN @ and >IN ! will produce different results. REFILL will produce different results."

My source does not support REFILL since it is based on the Forth 94 draft standard and there it's an optional word. If I did add REFILL it would always return false since someone is only supposed to call REFILL when the >IN value is equal to the length returned from SOURCE and in my Forth that means >IN is at the end of the file.

When loading from a file, my Forth's SOURCE returns the start address of the buffer the file was loaded into and the length of the entire file. If you use PARSE or PARSE-NAME in my FORTH using the values returned from SOURCE you will get an address pointing to the start of the named string and same length as any FORTH following the standard. >IN will have the offset of the correct next character following any PARSE or PARSE-NAMES call (once I upload the next version with the fixes to make PARSE single line only that is). The only difference is my Forth handles the case where there are line terminators in the strings passed to EVALUATE, PARSE, and PARSE-NAME. In the standard I guessing this is an ambiguous condition?

Also, in my Forth, if PARSE ends on a line terminator, >IN will have the offset of the line terminator. In other standard Forths, that would technically be the length of the current input buffer and not be pointing to a valid character.

One of the ways above mentioned loading the entire file into a buffer, like mine, and having SOURCE return the starting addresses of each line and >IN being the offset in the current line. REFILL in this situation would then technically consume the line terminator? That's looking at the buffer twice, mine only looks once but looks harder.

Hmm yes technically that would be a difference, >IN is the offset in the current line in the standard, and SOURCE the start address of each line. Is there anything in the standard that says this? And is there anything that depends on this behavior? PARSE, PARSE-NAME, and EVALUATE do not depend on this behavior in my Forth. Line comment in my forth also works correctly. If someone were to write their own line comment and wanted to compare the >IN offset with the length returned from SOURCE then yes it would be a problem. If they did 0 PARSE to skip to the end of the line, it would work fine.

My suggestion for the standard is that it not specify that SOURCE be the start of each line and that >IN be the offset in the line. The only change I'm suggesting is that the definition of PARSE above say it goes to the end of the line if a delimiter is not found before then. That wasn't clear to me when I read the standard. If someone has a reason for why having SOURCE and >IN work this way is important then I'll probably change my mind, but really, I kind of like how my implementation only needs to look at stuff once, and how REFILL is not necessary.

On the copyright issue, I did not write the files. I can't tell the public that the author has given permission to copy and use the files without actually asking the author. Legally I would need something from each author of each file in writing to do something like that (an email from the author would word too). That's how US copyright law works. I know most people don't care about this and I've had problems on some of my jobs because they wanted me to copy or use copyrighted stuff and I said I'd need to contact the authors, or said no when it was a commercial application they wanted me to pirate. I've taken the time to contact authors in some of these situations and they usually say yes and are happy you want to use their stuff, especially if it's for a non profit use. I'm not interested in going through all that to use these test suites though.

ruvavatar of ruv

When loading from a file, my Forth's SOURCE returns the start address of the buffer the file was loaded into and the length of the entire file.

If you want to be standard compliant, just use another name for your flavor, and provide SOURCE with standard behavior. Ditto for other standard words. The Forth text interpreter is not obligated to use all these standard words, they can be provided just for programs.

You can even implement SOURCE in lazy evaluated manner (with memorization): that is, it calculates the length on the first call after each pass of the line terminator. Your REFILL can just adjust your pointer in the entire file.

>IN is harder for virtualization (it would be better to have a setter and getter). When >IN was used, PARSE, PARSE-NAME and WORD should check changes of >IN value.

OTOH, the corresponding overhead will take place in some programs only, and not in the system itself.

Also, in my Forth, if PARSE ends on a line terminator, >IN will have the offset of the line terminator. In other standard Forths, that would technically be the length of the current input buffer and not be pointing to a valid character.

A standard program cannot read a character beyond the input buffer. So it doesn't matter is it a valid character or not.

JamesNorrisavatar of JamesNorris

I did some more reading and found that the interpreter section refers you to QUIT, and if you look up all the terms, a line is defined as a sequence of characters followed by a line terminator or implied line terminator. My forth doesn't implement the file extension word set so I didn't read that until today. I do something different.

I don't see why this is important to do... or if I do the work arounds above, why they are needed. The only thing I can think of is older Forths use preallocated memory regions to hold input. My Forth has dynamic memory allocation so I am not restricted by that. When I did my Forth I was thinking someone someday may be working on a terminal device and have something more complicated where they can put line terminators and other characters into the input stream. If you are using an operating system function to do ACCEPT like I am, I thought it would be a good idea to not assume that operating system would always follow the rules. I suppose you could pre-parse the 'line' returned from the operating system as if it were a file to pull out each line, but why?

Why is it so important that SOURCE refer to a line and >IN be the offset from the beginning of a line?

Who needs this behavior? Isn't enough that SOURCE and >IN refer to the current parsing position?

In reality this is a restriction that line terminators can't be in the input buffer.

In any case, I think don't my Forth will be following this part of the standard. It's just kind of a bummer. I thought I did a good job of reading the standard and following everything, but it turns out I missed something.

JohanKotlinskiavatar of JohanKotlinski

This thread is incredibly illuminating! I wish I saw it sooner, as I have struggled with the same problems.

I think I understand and agree now, that the ideal way to load a text file from RAM buffer is to add a custom REFILL behaviour, which is able to get lines from a RAM buffer. The real shortcoming of the Forth standard then, is that regular users are not able do this. The system-provided REFILL is set in stone by the forthwright, and cannot easily be altered.

This could be fixed by adding a standard way for users to add custom REFILL behaviors. The simplest idea I can think of (just to get an idea):

SET-REFILL ( source-id xt -- ) \\ Set a custom REFILL for the given source id

JohanKotlinskiavatar of JohanKotlinski

OK, maybe SET-REFILL above is a bit too simplistic. It might be better solved by a word like:

INCLUDE-XT ( source-id xt -- ) \ Include a text file using a custom SOURCE-ID and REFILL execution token.

It is fair to point out, such a word is probably already compatible with the standard, and the standard does not need to explicitly provide it. I will continue to experiment in this direction, and report if the results seem generally useful.

Reply New Version