Post by grimblefritz on Nov 19, 2020 18:26:00 GMT -5
inputcsv is somewhat broken. I've already posted about the need to be able to define what is a quote, a delimiter, and an end of record, and also the need to parse strings and not just files.
However, the current inputcsv doesn't seem to respect the conventions of a proper CSV file format!
Note three issues.
Issue 1
As evidenced by the third row of output, inputcsv merrily ignored the EOR marker (ie, end of line) of the second input line. Instead, it read the last field of the second line and used it as the first field of the third inputcsv. And then, it continued with the first field of the third line as its next input.
Issue 2
The above illustrates that inputcsv does not do the common things of either ignoring (or consuming) extra fields beyond what are identified in the inputcsv command. Most systems will ignore (drop) any extra fields. Some (rare) systems will simply give the remainder of the input record as the last field.
Issue 3
The fourth line of output illustrates that inputcsv also doesn't know what to do when there are fewer fields in the file than are specified to inputcsv. As with the above example, it simply continues reading from the next line.
All of the above indicate to me that LB is not doing a line input and then parsing that line into fields (which is what a proper CSV parser should do.) Instead, it gives the impression that it is parsing the file directly, beginning to end, looking for the comma delimiter (respecting quotes along the way) and simply ignoring EOR.
This all combines to make inputcsv fragile, imo. And, not useful for a lot of real world cases.
I propose, as a non-breaking fix, to enhance the existing inputcsv with a command to set a strict (or proper) csv mode:
strictcsv true|false (default to false)
If set to true, then inputcsv would behave as follows:
* Respect EOR markers (ie, do a line input and then parse that)
* If input record contains more fields than specified to inputcsv, ignore the remaining data
* If input record contains fewer fields than specified to inputcsv, set the remaining vars to null
In the second case above (more fields than vars), then an option might be to have the last inputcsv var to simply consume the rest of the record. This could be controlled via something like:
greedycsv true|false (default to false)
(An aside, it would be nice to see LB4 Windows continue development in parallel with LB5 AgnOStic - my term for multiplatform stuff - at least until LB5 is fully capable to replace LB4.)
However, the current inputcsv doesn't seem to respect the conventions of a proper CSV file format!
'contents of test.txt
'
'1,11,22,33
'2,44,55,66,zz
'3,77,88,99
'4,00
'5,bb,cc,dd
'6
open "test.txt" for input as #test
while not(eof(#test))
inputcsv #test, a$, b$, c$, d$
print "a=";a$;" b=";b$;" c=";c$;" d=";d$
wend
close #test
'resulting output
'
'a=1 b=11 c=22 d=33
'a=2 b=44 c=55 d=66
'a=zz b=3 c=77 d=88
'a=99 b=4 c=00 d=5
'a=bb b=cc c=dd d=6
Note three issues.
Issue 1
As evidenced by the third row of output, inputcsv merrily ignored the EOR marker (ie, end of line) of the second input line. Instead, it read the last field of the second line and used it as the first field of the third inputcsv. And then, it continued with the first field of the third line as its next input.
Issue 2
The above illustrates that inputcsv does not do the common things of either ignoring (or consuming) extra fields beyond what are identified in the inputcsv command. Most systems will ignore (drop) any extra fields. Some (rare) systems will simply give the remainder of the input record as the last field.
Issue 3
The fourth line of output illustrates that inputcsv also doesn't know what to do when there are fewer fields in the file than are specified to inputcsv. As with the above example, it simply continues reading from the next line.
All of the above indicate to me that LB is not doing a line input and then parsing that line into fields (which is what a proper CSV parser should do.) Instead, it gives the impression that it is parsing the file directly, beginning to end, looking for the comma delimiter (respecting quotes along the way) and simply ignoring EOR.
This all combines to make inputcsv fragile, imo. And, not useful for a lot of real world cases.
I propose, as a non-breaking fix, to enhance the existing inputcsv with a command to set a strict (or proper) csv mode:
strictcsv true|false (default to false)
If set to true, then inputcsv would behave as follows:
* Respect EOR markers (ie, do a line input and then parse that)
* If input record contains more fields than specified to inputcsv, ignore the remaining data
* If input record contains fewer fields than specified to inputcsv, set the remaining vars to null
In the second case above (more fields than vars), then an option might be to have the last inputcsv var to simply consume the rest of the record. This could be controlled via something like:
greedycsv true|false (default to false)
(An aside, it would be nice to see LB4 Windows continue development in parallel with LB5 AgnOStic - my term for multiplatform stuff - at least until LB5 is fully capable to replace LB4.)