Some progress with file encoding

Carl Gundel
Administrator

Posts: 1,535

Some progress with file encoding Nov 23, 2019 23:32:16 GMT -5

Quote

Post by Carl Gundel on Nov 23, 2019 23:32:16 GMT -5

Okay, I've implemented the encoder parameter for OPEN. Here is an example that extends the example that Chris shared when the matter first came to our attention.


temp$ = "€"

print "Write to disk using UTF-8"
print "Length: ";len(temp$)
print "Character: ";temp$
print "ASCII value: ";asc(temp$)
open "test.txt" for output as #file encoder = utf_8
print #file, temp$;
close #file

print
print "----------------------------------------"
print

print "Read from disk using UTF-8"
open "test.txt" for input as #file  encoder = utf_8
temp2$ = input$(#file, 1)
close #file

print temp$, "vs", temp2$
if temp$ = temp2$ then print "<EQUAL>"
if temp$ <> temp2$ then print "<NOT EQUAL>"
print
print "Length: ";len(temp2$)
print "Character: ";temp2$
print "ASCII value: ";asc(temp2$)

print
print "----------------------------------------"
print

print "Read again but use ISO8859-1"
open "test.txt" for input as #file   'by default encoder = iso8859_1
temp2$ = input$(#file, lof(#file))
'line input #file, temp2$
close #file
print temp$, "vs", temp2$
if temp$ = temp2$ then print "<EQUAL>"
if temp$ <> temp2$ then print "<NOT EQUAL>"
print "Length: ";len(temp2$)
print "Characters: "
for x = 1 to len(temp2$)
  ch$ = mid$(temp2$, x, 1)
  print "  "; ch$, asc(ch$)
next x

Here is the output of the code when run.

Write to disk using UTF-8
Length: 1
Character: €
ASCII value: 8364

----------------------------------------

Read from disk using UTF-8
€             vs            €
<EQUAL>

Length: 1
Character: €
ASCII value: 8364

----------------------------------------

Read again but use ISO8859-1
€             vs            â¬
<NOT EQUAL>
Length: 3
Characters: 
  â           226
             130
  ¬           172

The block character is actually invisible when LB prints it.

I hope this makes sense.

Any feedback welcome as always.

Last Edit: Nov 23, 2019 23:33:35 GMT -5 by Carl Gundel

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

Chris Iverson
Administrator

Posts: 940

Some progress with file encoding Nov 24, 2019 3:20:41 GMT -5

Quote

Post by Chris Iverson on Nov 24, 2019 3:20:41 GMT -5

Interesting. I think this confirms that it uses UTF8 internally, as the bytes it writes out for the Euro symbol if you tell it to use ISO8859-1(which doesn't support the Euro symbol) are the three bytes used to specify the symbol in UTF8.

If it's told to write out a character to a file that isn't supported by that encoding, it writes the bytes for that character out raw.

Rod
Global Moderator

Posts: 2,621

Some progress with file encoding Nov 24, 2019 6:03:32 GMT -5

Quote

Post by Rod on Nov 24, 2019 6:03:32 GMT -5

Does this not just show that you must use the same encoder for output as input?

If it is natively utf8 and seems to be easy to use and understand as such, what is the case for iso8891-1? I think you said it offers single byte encoding and more useable characters. Is single byte encoding really necessary?

Using utf8 we appear to get len() mid$() and all other string handling working correctly at character level despite the hidden additional bytes. So we just have a bigger range of characters. At string level its character by character.

Using iso8891-1 we appear to break that and could get three characters back for the Euro symbol. That would spoil string handling would it not.

When we step outside strings we are back to bytes in a file If I build a file I don't think I am going to try to save the Euro symbol as a byte. I would just use the 0-255 asc range to define my byte values or numbers. If I am saving a string I will either have terminators or allow space for variation in the size of the string.

So, is this single byte encoding really needed? I kinda think it is like old asc chr$(0)-chr$(31) you would never think to put these in strings unless specifically required. But playing with binary or bytes on file you think nothing of using chr$(0)-chr$(31) through to chr$(255) I would now never think of using chr$(8364) other than in a string$ for storage or display.

Is there a hint of swimming against the tide using an encoder that is not the native choice?

I don't know enough about it, I shouldn't be commenting, A test rig would help everyone, 351?

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Carl Gundel
Administrator

Posts: 1,535

Some progress with file encoding Nov 24, 2019 14:50:02 GMT -5

Quote

Post by Carl Gundel on Nov 24, 2019 14:50:02 GMT -5

Nov 24, 2019 6:03:32 GMT -5 Rod said:

Does this not just show that you must use the same encoder for output as input?

If it is natively utf8 and seems to be easy to use and understand as such, what is the case for iso8891-1? I think you said it offers single byte encoding and more useable characters. Is single byte encoding really necessary?

Using utf8 we appear to get len() mid$() and all other string handling working correctly at character level despite the hidden additional bytes. So we just have a bigger range of characters. At string level its character by character.

Using iso8891-1 we appear to break that and could get three characters back for the Euro symbol. That would spoil string handling would it not.

When we step outside strings we are back to bytes in a file If I build a file I don't think I am going to try to save the Euro symbol as a byte. I would just use the 0-255 asc range to define my byte values or numbers. If I am saving a string I will either have terminators or allow space for variation in the size of the string.

So, is this single byte encoding really needed? I kinda think it is like old asc chr$(0)-chr$(31) you would never think to put these in strings unless specifically required. But playing with binary or bytes on file you think nothing of using chr$(0)-chr$(31) through to chr$(255) I would now never think of using chr$(8364) other than in a string$ for storage or display.

Is there a hint of swimming against the tide using an encoder that is not the native choice?

I don't know enough about it, I shouldn't be commenting, A test rig would help everyone, 351?

The single byte encoding is world's easier when you start doing things in binary, or if you have fixed length random access fields, or if you want to have a dynamically accessed file structure. In addition to this, all earlier versions of LB use the single byte model so there is a good argument for this to be the default mode IMHO.

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

Rod
Global Moderator

Posts: 2,621

Some progress with file encoding Nov 24, 2019 16:25:18 GMT -5

Quote

Post by Rod on Nov 24, 2019 16:25:18 GMT -5

As I read it if you simply use the asc 0-255 range you get single byte in utf8, so compatible? If you step outside asc 0-255 you get a vast range of characters that will be multi byte. But you would not do that inadvertently, you would do that intentionally. Not for RAF, not for Binary, not for dynamic access that seeks or skips defined byte boundaries. These "rogue" characters are only relevant to fancy text displays. Its just like chr$(10), chr$(13), in text. You currently expect to have unseen control characters in text. I don't see an extended multibyte character imbedded in a text file being that much different.

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Rod Global Moderator Posts: 2,621	Some progress with file encoding Nov 24, 2019 16:35:55 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Rod on Nov 24, 2019 16:35:55 GMT -5 Soon we will be discussing sorting
	Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Carl Gundel
Administrator

Posts: 1,535

Some progress with file encoding Nov 24, 2019 17:35:24 GMT -5 via mobile

Quote

Post by Carl Gundel on Nov 24, 2019 17:35:24 GMT -5

Nov 24, 2019 16:25:18 GMT -5 Rod said:

As I read it if you simply use the asc 0-255 range you get single byte in utf8, so compatible? If you step outside asc 0-255 you get a vast range of characters that will be multi byte. But you would not do that inadvertently, you would do that intentionally. Not for RAF, not for Binary, not for dynamic access that seeks or skips defined byte boundaries. These "rogue" characters are only relevant to fancy text displays. Its just like chr$(10), chr$(13), in text. You currently expect to have unseen control characters in text. I don't see an extended multibyte character imbedded in a text file being that much different.

No, only the values 0-127 are single byte in UTF-8 so once the high bit is set it becomes 2 or more bytes.

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

Post by Carl Gundel on Nov 23, 2019 23:32:16 GMT -5

Post by Chris Iverson on Nov 24, 2019 3:20:41 GMT -5

Post by Rod on Nov 24, 2019 6:03:32 GMT -5

Post by Carl Gundel on Nov 24, 2019 14:50:02 GMT -5

Post by Rod on Nov 24, 2019 16:25:18 GMT -5

Post by Rod on Nov 24, 2019 16:35:55 GMT -5

Post by Carl Gundel on Nov 24, 2019 17:35:24 GMT -5