|
Post by Carl Gundel on Nov 23, 2019 23:32:16 GMT -5
Okay, I've implemented the encoder parameter for OPEN. Here is an example that extends the example that Chris shared when the matter first came to our attention.
temp$ = "€"
print "Write to disk using UTF-8" print "Length: ";len(temp$) print "Character: ";temp$ print "ASCII value: ";asc(temp$) open "test.txt" for output as #file encoder = utf_8 print #file, temp$; close #file
print print "----------------------------------------" print
print "Read from disk using UTF-8" open "test.txt" for input as #file encoder = utf_8 temp2$ = input$(#file, 1) close #file
print temp$, "vs", temp2$ if temp$ = temp2$ then print "<EQUAL>" if temp$ <> temp2$ then print "<NOT EQUAL>" print print "Length: ";len(temp2$) print "Character: ";temp2$ print "ASCII value: ";asc(temp2$)
print print "----------------------------------------" print
print "Read again but use ISO8859-1" open "test.txt" for input as #file 'by default encoder = iso8859_1 temp2$ = input$(#file, lof(#file)) 'line input #file, temp2$ close #file print temp$, "vs", temp2$ if temp$ = temp2$ then print "<EQUAL>" if temp$ <> temp2$ then print "<NOT EQUAL>" print "Length: ";len(temp2$) print "Characters: " for x = 1 to len(temp2$) ch$ = mid$(temp2$, x, 1) print " "; ch$, asc(ch$) next x
Here is the output of the code when run.
Write to disk using UTF-8 Length: 1 Character: € ASCII value: 8364
----------------------------------------
Read from disk using UTF-8 € vs € <EQUAL>
Length: 1 Character: € ASCII value: 8364
----------------------------------------
Read again but use ISO8859-1 € vs ⬠<NOT EQUAL> Length: 3 Characters: â 226 130 ¬ 172
The block character is actually invisible when LB prints it.
I hope this makes sense.
Any feedback welcome as always.
|
|
|
Post by Chris Iverson on Nov 24, 2019 3:20:41 GMT -5
Interesting. I think this confirms that it uses UTF8 internally, as the bytes it writes out for the Euro symbol if you tell it to use ISO8859-1(which doesn't support the Euro symbol) are the three bytes used to specify the symbol in UTF8.
If it's told to write out a character to a file that isn't supported by that encoding, it writes the bytes for that character out raw.
|
|
|
Post by Rod on Nov 24, 2019 6:03:32 GMT -5
Does this not just show that you must use the same encoder for output as input?
If it is natively utf8 and seems to be easy to use and understand as such, what is the case for iso8891-1? I think you said it offers single byte encoding and more useable characters. Is single byte encoding really necessary?
Using utf8 we appear to get len() mid$() and all other string handling working correctly at character level despite the hidden additional bytes. So we just have a bigger range of characters. At string level its character by character.
Using iso8891-1 we appear to break that and could get three characters back for the Euro symbol. That would spoil string handling would it not.
When we step outside strings we are back to bytes in a file If I build a file I don't think I am going to try to save the Euro symbol as a byte. I would just use the 0-255 asc range to define my byte values or numbers. If I am saving a string I will either have terminators or allow space for variation in the size of the string.
So, is this single byte encoding really needed? I kinda think it is like old asc chr$(0)-chr$(31) you would never think to put these in strings unless specifically required. But playing with binary or bytes on file you think nothing of using chr$(0)-chr$(31) through to chr$(255) I would now never think of using chr$(8364) other than in a string$ for storage or display.
Is there a hint of swimming against the tide using an encoder that is not the native choice?
I don't know enough about it, I shouldn't be commenting, A test rig would help everyone, 351?
|
|
|
Post by Carl Gundel on Nov 24, 2019 14:50:02 GMT -5
Does this not just show that you must use the same encoder for output as input? If it is natively utf8 and seems to be easy to use and understand as such, what is the case for iso8891-1? I think you said it offers single byte encoding and more useable characters. Is single byte encoding really necessary? Using utf8 we appear to get len() mid$() and all other string handling working correctly at character level despite the hidden additional bytes. So we just have a bigger range of characters. At string level its character by character. Using iso8891-1 we appear to break that and could get three characters back for the Euro symbol. That would spoil string handling would it not. When we step outside strings we are back to bytes in a file If I build a file I don't think I am going to try to save the Euro symbol as a byte. I would just use the 0-255 asc range to define my byte values or numbers. If I am saving a string I will either have terminators or allow space for variation in the size of the string. So, is this single byte encoding really needed? I kinda think it is like old asc chr$(0)-chr$(31) you would never think to put these in strings unless specifically required. But playing with binary or bytes on file you think nothing of using chr$(0)-chr$(31) through to chr$(255) I would now never think of using chr$(8364) other than in a string$ for storage or display. Is there a hint of swimming against the tide using an encoder that is not the native choice? I don't know enough about it, I shouldn't be commenting, A test rig would help everyone, 351? The single byte encoding is world's easier when you start doing things in binary, or if you have fixed length random access fields, or if you want to have a dynamically accessed file structure. In addition to this, all earlier versions of LB use the single byte model so there is a good argument for this to be the default mode IMHO.
|
|
|
Post by Rod on Nov 24, 2019 16:25:18 GMT -5
As I read it if you simply use the asc 0-255 range you get single byte in utf8, so compatible? If you step outside asc 0-255 you get a vast range of characters that will be multi byte. But you would not do that inadvertently, you would do that intentionally. Not for RAF, not for Binary, not for dynamic access that seeks or skips defined byte boundaries. These "rogue" characters are only relevant to fancy text displays. Its just like chr$(10), chr$(13), in text. You currently expect to have unseen control characters in text. I don't see an extended multibyte character imbedded in a text file being that much different.
|
|
|
Post by Rod on Nov 24, 2019 16:35:55 GMT -5
Soon we will be discussing sorting
|
|
|
Post by Carl Gundel on Nov 24, 2019 17:35:24 GMT -5
As I read it if you simply use the asc 0-255 range you get single byte in utf8, so compatible? If you step outside asc 0-255 you get a vast range of characters that will be multi byte. But you would not do that inadvertently, you would do that intentionally. Not for RAF, not for Binary, not for dynamic access that seeks or skips defined byte boundaries. These "rogue" characters are only relevant to fancy text displays. Its just like chr$(10), chr$(13), in text. You currently expect to have unseen control characters in text. I don't see an extended multibyte character imbedded in a text file being that much different. No, only the values 0-127 are single byte in UTF-8 so once the high bit is set it becomes 2 or more bytes.
|
|