More file encoding discussion on Discord | Liberty BASIC Community Forum

Carl Gundel
Administrator

Posts: 1,535

More file encoding discussion on Discord Nov 19, 2019 15:48:58 GMT -5

Quote

Post by Carl Gundel on Nov 19, 2019 15:48:58 GMT -5

CarlGundelLast Sunday at 8:38 PM

I've been playing with file encoding. For some reason I was having bad luck with the Windows-1252 encoder (the default encoding in the Smalltalk I'm using) which is supposed to be a single byte per character extended ASCII encoder and which is similar to ISO-8859-1. I finally realized that the default encoder is reading two bytes at a time, which seems like a bug. When I switched to using ISO-8859-1 encoding it was reading one byte per character properly.

I'd prefer to use the Windows-1252 encoder because it has more actual characters and is a superset of ISO-8859-1 in that respect.

So the good news is that I have the source code for the misbehaving encoder, so perhaps I will be able to fix it.

Chris IversonLast Sunday at 8:41 PM

that is weird

hopefully you're able to pinpoint it

cundoLast Sunday at 8:44 PM

I think ANSI 1252 is what I use with subtitles in movies

CarlGundelLast Sunday at 8:45 PM

Yeah, the thing is that build 150 uses the Windows-1252 encoder, and that bug is why it doesn't work correctly.

My thinking is that a single byte extended ASCII will be the default for LB5, and if you want UTF-8 or some other encoder you will need to open the file and then apply the encoder in a separate statement.

cundoLast Sunday at 8:51 PM

Optional in the file dialog?

I use that in Akel Pad. It says open with encoding, and a combobox lets you select one

CarlGundelLast Sunday at 9:01 PM

No, I mean open using the OPEN statement.

cundoLast Sunday at 9:01 PM

Ah ok

When coding

CarlGundelLast Sunday at 9:02 PM

The file dialog is for picking a file. You can decide how to handle the encoding in your BASIC program.

I'm open to suggestions of course.

.

.

OPEN "filepath" for output as #filehandle

ENCODER #filehandle, UTF-8

.

.

Or something like that.

.

.

Or maybe on one line.

.

.

cundoLast Sunday at 9:10 PM

What would happen if not encoder is chosen

CarlGundelLast Sunday at 9:10 PM

OPEN "filepath" for output as #filehandle encoder UTF-8

Chris IversonLast Sunday at 9:10 PM

uses default eoncding

CarlGundelLast Sunday at 9:11 PM

Yup.

Chris IversonLast Sunday at 9:11 PM

I think the second option would work best, a modifier fro the OPEN statement

simply because it falls in line with the LEN modifier for random access files that already exists

cundoLast Sunday at 9:12 PM

OPEN "filepath" for output as #filehandle with encoder utf-8

Chris IversonLast Sunday at 9:12 PM

OPEN "filepath" FOR OUTPUT AS #filehandle ENCODING="UTF-8" or whatever

CarlGundelLast Sunday at 9:12 PM

Yeah something like that.

Chris IversonLast Sunday at 9:12 PM

or anything specified above, just throwing out rationalization for making it a statement modifier

although

hmm

CarlGundelLast Sunday at 9:13 PM

I don't feel completely comfortable adding syntax to the OPEN statement, but maybe it is the right thing.

Chris IversonLast Sunday at 9:13 PM

I was going to add a benefit of having a separate ENCODING command

in that you could then change the encoding of a file on the fly

but I don't see that having much use

CarlGundelLast Sunday at 9:14 PM

UTF-8 random access files will be kind of a messy affair because it isn't possible to know exactly how many bytes will be required by any particular string.

Chris IversonLast Sunday at 9:14 PM

and I DO see that being potentially buggy

depending on how well Smalltalk supports having all that be changed

CarlGundelLast Sunday at 9:14 PM

So for UTF-8 it makes sense to say that the field statement species each item size in bytes, not characters?

Chris IversonLast Sunday at 9:15 PM

I think that would be the best option.

really, the only option. If the field size can vary based on the characters used, that pretty much messes everything up

CarlGundelLast Sunday at 9:15 PM

For single byte per character strings this is also true already.

Yup, and this also makes it hard to set the file stream position precisely.

So when a byte is also a character setting the position is trivial, but no so with variable length characters.

Fun right?

It's not anybody's fault really.

Chris IversonLast Sunday at 9:18 PM

Yeah, if you wanted to do character positioning in a UTF8 file, you'd have to constantly be reading small chunks to check for extended bytes

UTF8 at least has a small advantage over UTF16 in that the first byte in a character always starts the same way

CarlGundelLast Sunday at 9:18 PM

Or read from the beginning of the file to count characters. Ugh.

cundoLast Sunday at 9:18 PM

:open_mouth:

Chris IversonLast Sunday at 9:19 PM

so moving forward and backwards in characters in a string is at least doable

whereas UTF16 substitution pairs can be nearly anything

CarlGundelLast Sunday at 9:19 PM

For many kinds of applications this will not be an issue.

But if you do clever jumping around in files then it gets tricky. So, probably the best thing is to structure files into equal-length byte segments and then you store your information in these segments. When you want to access some particular data you have known boundary positions by byte.

Kinda wasteful of disk space, but with terabyte hard drives maybe it doesn't matter anymore. I know, sacrilege. :wink:

Of course with random access files you will just need to make the field sizes larger to compensate for the variable lengths of the strings.

CarlGundelToday at 1:22 PM

Okay, the weird behavior of the 1252 encoder is not a bug, but a feature.

cundoToday at 1:23 PM

What

Chris IversonToday at 1:23 PM

^

CarlGundelToday at 1:23 PM

The issue stems from the fact that it was originally designed as a single byte encoding and a lot of information on the Internet defines it as such. Microsoft and IBM claimed that it was a standard, but it was never actually ANSI approved in spite of their claims.

So, in practice CP1252 became a non single byte encoding changing many of the characters between $80 and $9F to be double byte Unicode code points.

So, in the example that you posted Chris of the value 128 coming different when read in this is because it was converted to be the Unicode euro sign which is two bytes. The same encoder should write it as 128 to file. I know it'weird.

I'm gonna try that round trip and report back.

So, that leaves me a bit stumped.

Chris IversonToday at 1:29 PM

Well, as long as it's consistent, that's fine, but that still leaves us needing some way of accessing files without going through an encoder

CarlGundelToday at 1:29 PM

I think that LB users will expect single byte ASCII compatibility by default. Perhaps ISO-8859-1 will be a better default for LB5.

Chris IversonToday at 1:29 PM

Just raw data

Plus that's a good point about single-byte sets being expected

CarlGundelToday at 1:31 PM

It's not really consistent in the sense that you write 128 and it comes back different. On the flip side if your keyboard has the euro sign and you type this in and save it to a file it will be saved as 128 and when you read it back in it will convert to the 2 byte version. Reading the file in binary mode will not give you what you expect.

And, on top of that the programmer might have a mind bending experience when using chr$() and asc(). Clearly asc() is now a misnomer.

Chris IversonToday at 1:34 PM

actually, that's exactly the behavior I was expecting currently.

the program is unicode internally.

so of course the encoder converts in to unicode when importing it

and converts it to CP1252 when exporting it

but I also expect there to be some sort of null/raw encoder

CarlGundelToday at 1:35 PM

On the other hand if I write a file using CP1252 and read it back in using ISO-8859-1 the result will be some lost characters, but you can read the file in and almost all the characters will be correct.

Chris IversonToday at 1:35 PM

where no transformation is done on the input/output

CarlGundelToday at 1:35 PM

This is all pretty messy.

Chris IversonToday at 1:36 PM

I think you'd have the same issue in ISO-8859-1, actually

CarlGundelToday at 1:37 PM

Not exactly the same.

If you're using a US keyboard and you never use the extended characters defined in CP1252 then you are always mapping characters to bytes and back one to one.

ISO-8859-1 is really a single byte extended ASCII.

CP1252 was intended to be this also but it didn't stay that way.

This was a serious mistake. They shouldn't have changed the spec without renaming it.

Chris IversonToday at 1:39 PM

I don't actually think CP1252 is multibyte.

the "anything on teh US keyboard" holds true in CP1252 as it does in ISO-8859

CarlGundelToday at 1:40 PM

And unfortunately CP1252 is the standard for Windows for good or ill.

Chris IversonToday at 1:40 PM

it's anything above byte 128 that starts causing problems

and I think that'd still be true in ISO-8859

CarlGundelToday at 1:42 PM

It is single byte on disk. In practice when you read it into memory you have to make the 32 characters between $80 and $9F into double bytes. This makes it schizophrenic.

Chris IversonToday at 1:42 PM

that's not because of hte standard.

that's because it's being converted to unicode.

CarlGundelToday at 1:42 PM

Only the characters between $80 to $9F are a problem in ISO-8859-1.

Yeah, I understand that CP1252 is supposed to be single byte. In practice it is more complicated because the people who created Unicode didn't consider CP1252 when they designed their character values. That is why the encoding poses this surprising mismatch.

Even though Microsoft claimed that CP1252 is an ANSI standard and they even called it ANSI 1252 back in the day, it was never actually approved. If it had been approved perhaps the Unicode people would have given it more consideration.

Chris IversonToday at 1:46 PM

hmm

CarlGundelToday at 1:46 PM

Messy, right?

Chris IversonToday at 1:46 PM

I see, the code points defined in ISO-8859-1 stlil hold in unicode

À is 0x00E0

er

C0

lowercase is E0

although that's 0xC0 in 1252 as well

CarlGundelToday at 1:47 PM

Probably the most important thing for LB5 going forward is that we write a really good explanation of this stuff in the docs.

Chris IversonToday at 1:47 PM

however, there is that range defined in CP1252 that's not in ISO-8859

CarlGundelToday at 1:47 PM

Yeah, all the characters in ISO-8859-1 are compatible with Unicode out of the box.

It just doesn't have as many characters in the set.

Chris IversonToday at 1:47 PM

does that actually encode as single bytes in LB5 if you're using ISO-8859?

CarlGundelToday at 1:48 PM

Yes.

Chris IversonToday at 1:48 PM

interesting.

CarlGundelToday at 1:49 PM

There is also an ASCII encoder in this version of Smalltalk. Not sure if there's any advantage to using that over ISO-8859-1.

Chris IversonToday at 1:49 PM

yeah, I wonder what that would actually cover

CarlGundelToday at 1:49 PM

Only the values from 0 to 127.

Chris IversonToday at 1:49 PM

ah.

well, that does actually make sense.

that's all ASCII really defined.

CarlGundelToday at 1:50 PM

But it also would preserve values from 128 to 255 I believe.

Yeah ASCII is only the 7 least significant bits.

Chris IversonToday at 1:51 PM

IF the ASCII encoder actually passes 128+ bytes through without mangling them

it might be the best option for backwards compat

CarlGundelToday at 1:51 PM

That's why ASCII is UTF-8 compatible. Once you set the high bit you are encoding with two or more bytes.

ISO-8895-1 also preserves the values and allows for more languages, so I think it wins over ASCII. However I can make it possible for the programmer to choose the encoding.

Last Edit: Nov 19, 2019 15:52:21 GMT -5 by Carl Gundel

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

tsh73
Global Moderator

Global Moderator

just in case you wondered Anatoly is my real name

Posts: 1,418

More file encoding discussion on Discord Nov 20, 2019 5:07:20 GMT -5

Quote

Post by tsh73 on Nov 20, 2019 5:07:20 GMT -5

I vote for optional ENCODING clause
OPEN "filepath" FOR OUTPUT AS #filehandle ENCODING="UTF-8"
Not that I have any idea about how to do that (in Smalltalk or whatever), really

Accidentally
Just happened to finish (re)reading Charles Petzold ".NET Book Zero",
www.charlespetzold.com/dotnet/
and
Chapter 25. Files and Streams
discusses Encoding among other things (from Page 232).

Rod
Global Moderator

Posts: 2,621

More file encoding discussion on Discord Nov 20, 2019 5:25:27 GMT -5

Quote

Post by Rod on Nov 20, 2019 5:25:27 GMT -5

This is just some rambling thought as I try to get my head round it. Since files will be one form or the other we do need file level control. Internally will it all be completely transparent. Seems like it could work if multibyte characters are disallowed so fixed format two bytes define a character or glyph.

Some of the following may be completely OFF!

1)LB5 uses unicode internally, Space is x0020, " is x0022, chr$(32) and chr$(34) as was.
2)When a user calls for chr$(34) LB5 will create a two byte character x0022 in memory and display the single glyph ".
3)When LB5 writes to an external file it would normally write two bytes per character x0022 for ".
4)When LB5 reads a file it would normally expect two bytes per character x0022 for ".
5)When the Encoder is used LB5 will write one byte to file for the 8859-1 set (190 extended ascii).
6)When the Encoder is used LB5 will read one byte from file and create a two byte character in memory.
7)One way to transition is to accept that there are now two bytes defining each character in memory and on file.
8)We can import ascII files using the Encoder with OPEN for INPUT as ASCII or something similar
9)Since files exist in one form or another a Prefrences flag would be inappropriate.
10)Multibyte unicode should be filtered out, its a two byte system.

chr$(34) is allowed single byte characters are assumed to be ascii and transform to x00XX
chr$(x0022) is allowed
asc(a$) returns 34 for ", x0022
len(a$) returns the number of glyphs in a string the string length in memory is twice as long.
Inkey$() returns the usual keycode values
input$(#h,n) returns n glyphs in a string the string length in memory is twice as long
instr(a$,b$,n) returns true if the glyphs are found beyond n glyphs
in fact all string handling uses glyphs it is transparent that they are two bytes long in memory

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

tsh73
Global Moderator

Global Moderator

just in case you wondered Anatoly is my real name

Posts: 1,418

More file encoding discussion on Discord Nov 20, 2019 9:11:57 GMT -5

Quote

Post by tsh73 on Nov 20, 2019 9:11:57 GMT -5

3)When LB5 writes to an external file it would normally write two bytes per character x0022 for ".
4)When LB5 reads a file it would normally expect two bytes per character x0022 for ".

(3), (4) - how work with binary files or UTF-8?

Rod
Global Moderator

Posts: 2,621

More file encoding discussion on Discord Nov 20, 2019 11:12:58 GMT -5

Quote

Post by Rod on Nov 20, 2019 11:12:58 GMT -5

I suppose my thoughts were a question not a statement. My limited understanding, utf8 is a variable width character set, the worst of both worlds. But yes we would need a method to import/convert? utf8. But, surely you can't because you can't convert the variable width characters that have no placement in a fix width format.

I can see that you might hold them in a string to display in a utf8 capable control but you cant process them very easily.

As for binary the only thing that makes sense is bytes not characters or glyphs so binary IS binary asc(a$) bin(a$)

I am not trying to suggest anything at the moment, simply trying to understand what the problem actually is and overthinking it. I am sure Carl has a plan.

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Chris Iverson
Administrator

Posts: 940

More file encoding discussion on Discord Nov 20, 2019 12:19:50 GMT -5

Quote

Post by Chris Iverson on Nov 20, 2019 12:19:50 GMT -5

I can't actually tell what encoding LB5 uses internally, but I it's definitely some form of Unicode.

The thing I find weird is that the byte results that you get from going through a string do not match any Unicode encoding directly. UTF8, UTF16, UTF32, none of them.

When you step through a string in LB5 using mid$(), and grab the values of each character using asc(), you do not get any null values, which I would expect if you were going byte-by-byte through a UTF16 or UTF32 string. If the characters you're stepping through are all regular ASCII characters, that's all you'll get. This first made me suspect that UTF8 is used internally.

However, if you hit a character that isn't an ASCII character, the value that you get back is NOT a UTF8 composition byte. It is the direct numerical value of the Unicode codepoint of the character(in tests that I've done, I've used the Euro sign, which is Unicode codepoint 0x20AC. That's the value that comes back from ASC(), decimal value 8,364.)

So, mid$() is somehow going character-by-character, no matter how many bytes would be used to store the character. This means that, regardless of what encoding is used, the Unicode codepoint of the character is always retrieved by asc(). This includes characters that would be stored as MORE than 2 bytes in UTF-16, such as Emoji. (The font used by the "console" in LB5 does not support Emoji, but opening a window and setting the title to a string that includes Emoji DOES work. At least for me on Windows 10. This shows that even extended Unicode codepoints are treated the same way.)

Here's a quick example showing the GRINNING FACE emoji(character info: www.fileformat.info/info/unicode/char/1f600/index.htm), which is codepoint 0x1F600, working in LB5, and showing as one character:

a$ = chr$(hexdec("1F600"))

a$ = a$ + "stuff"
print len(a$)
print dechex$(asc(mid$(a$, 1, 1)))

print a$

open a$ for window as #a
input a

close #a
end

In ANY Unicode encoding, that would be stored as multiple bytes, so the string operations are working based on characters, not bytes.

Rod
Global Moderator

Posts: 2,621

More file encoding discussion on Discord Nov 20, 2019 15:25:50 GMT -5

Quote

Post by Rod on Nov 20, 2019 15:25:50 GMT -5

So it appears to be using esc sequence to identify and handle multibyte character codes. The net result is that LB5 string handling works as expected but provides extended character access, simply put the ASC/character range and its decimal/hex value is widely extended.

print "Assign chr$(8364) to a$"
a$ = chr$(8364)
print "Length of a$=";len(a$)
print "ASC    of a$=";asc(a$)
print "Glyph/Char  =";a$
print
print "Add ""Sign"" to a$"
a$=a$+"Sign"
print "Length of a$=";len(a$)
print "String of a$=";a$
print
print mid$(a$,1,1)
print asc(mid$(a$,1,1))

open a$ for window as #a
input a

So back to understanding the problem. Carl is currently focused on "files", presumably file format. If LB5 is using single byte ascII for ascII compatible characters and escape code for wide character is there a problem with files? LB5 will read old ascII without problems. It will write them out as old ascII too. Will we ever need to force it to write single character and drop all extended characters? Can't see the need if we have deliberately introduced extended characters.

We leave space in RAFS now for variable length strings so no problem.
We never usually seek variable length data it is always fixed length (even with space for variable length strings)
We always expect binary files to be byte by byte and nearly always in the range 0-256

So I am still wondering, cool interesting stuff though.

Perhaps some code showing the bug Carl mentions would help focus on the current problem.

Last Edit: Nov 20, 2019 15:30:56 GMT -5 by Rod

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Carl Gundel
Administrator

Posts: 1,535

More file encoding discussion on Discord Nov 20, 2019 23:32:31 GMT -5

Quote

Post by Carl Gundel on Nov 20, 2019 23:32:31 GMT -5

Nov 20, 2019 11:12:58 GMT -5 Rod said:

I suppose my thoughts were a question not a statement. My limited understanding, utf8 is a variable width character set, the worst of both worlds. But yes we would need a method to import/convert? utf8. But, surely you can't because you can't convert the variable width characters that have no placement in a fix width format.

I can see that you might hold them in a string to display in a utf8 capable control but you cant process them very easily.

As for binary the only thing that makes sense is bytes not characters or glyphs so binary IS binary asc(a$) bin(a$)

I am not trying to suggest anything at the moment, simply trying to understand what the problem actually is and overthinking it. I am sure Carl has a plan.

I am beginning to finally get my mind around this. It's really important I think that there be a really well written explanation of how this works in the LB5 docs because it is not straightforward.

By default it looks like ISO-8859-1 will be the default encoding (in the existing LB5 builds the default is CP1252) because it is a simple byte<->character encoding that is a superset of ASCII. This will also be the encoding for binary mode because it will be possible to read and write 1:1 with single bytes.

Optionally if you decide that you want to get fancy or you need more accented etc. characters which are in the CP1252 encoding you will be able to choose that, or if you need even more you will be able to choose UTF-8. There are even more encoders and I'm sure it makes sense to make some of them available in some fashion.

The Unicode people certainly did not have an easy task when they drafted their standard. The result is a compromise that doesn't support extended ASCII encodings very well and things get complicated.

BASIC is supposed to be simple, so the default will be simple.

Last Edit: Nov 20, 2019 23:35:55 GMT -5 by Carl Gundel

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

Rod
Global Moderator

Posts: 2,621

More file encoding discussion on Discord Nov 21, 2019 13:47:19 GMT -5

Quote

Post by Rod on Nov 21, 2019 13:47:19 GMT -5

I know we are not happy with the encoder in 350 Not sure if this is another problem completely or a symptom of the problem.

This seems to fall apart after chr$(128)


dim a$(256)
dim b$(256)
open "test.txt" for output as #1
for i = 32 to 255
    a$(i)=chr$(i)
    print #1, a$(i);
next
close #1
print
print
open "test.txt" for input as #1
for i=32 to 255
    b$(i)=input$(#1,1)
    print i,a$(i),b$(i)
next

close #1
end

Last Edit: Nov 21, 2019 13:47:41 GMT -5 by Rod

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Carl Gundel
Administrator

Posts: 1,535

More file encoding discussion on Discord Nov 21, 2019 14:36:12 GMT -5

Quote

Post by Carl Gundel on Nov 21, 2019 14:36:12 GMT -5

Nov 21, 2019 13:47:19 GMT -5 Rod said:

I know we are not happy with the encoder in 350 Not sure if this is another problem completely or a symptom of the problem.

This seems to fall apart after chr$(128)


dim a$(256)
dim b$(256)
open "test.txt" for output as #1
for i = 32 to 255
    a$(i)=chr$(i)
    print #1, a$(i);
next
close #1
print
print
open "test.txt" for input as #1
for i=32 to 255
    b$(i)=input$(#1,1)
    print i,a$(i),b$(i)
next

close #1
end

Yeah, the goal is to make this work as expected without changing any BASIC code.

Last Edit: Nov 21, 2019 14:36:35 GMT -5 by Carl Gundel

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

Rod
Global Moderator

Posts: 2,621

More file encoding discussion on Discord Nov 22, 2019 3:27:06 GMT -5

Quote

Post by Rod on Nov 22, 2019 3:27:06 GMT -5

I tweaked my test code for the files problem.

'dim arrays 0-255 = 256 single byte characters
dim a$(255)
dim b$(255)

'write all 256 characters to file
open "test.txt" for output as #1
for i = 0 to 255
    a$(i)=chr$(i)
    print #1, a$(i);
next
close #1

'read back all 256 single byte characters
'this works in Liberty BASIC v4.5
'currently in LB5 r350 it saves and
'returns multi byte Unicode values for
'certain non ASC characters above chr$(127)
'This is no bad thing but the save is not
'consistent with the return, possibly two code 
'pages involved or as Carl names it "encoders"
open "test.txt" for input as #1
for i=0 to 255
    b$(i)=input$(#1,1)
    print i,a$(i),asc(a$(i)),b$(i),asc(b$(i))
next

close #1
end

I am starting to see how we can use Unicode characters (Thanks to Chris) and so greatly extend our display range and indeed international support. So I post this code again, check it out, easy access to multi character display. We have escaped asc shackles!

print "Assign chr$(8364) to a$"
a$ = chr$(8364)
print "Length of a$=";len(a$)
print "ASC    of a$=";asc(a$)
print "Glyph/Char  =";a$
print
print "Add ""Sign"" to a$"
a$=a$+"Sign"
print "Length of a$=";len(a$)
print "String of a$=";a$
print
print mid$(a$,1,1)
print asc(mid$(a$,1,1))
wait

Last Edit: Nov 22, 2019 3:28:42 GMT -5 by Rod

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Carl Gundel
Administrator

Posts: 1,535

More file encoding discussion on Discord Nov 22, 2019 15:39:13 GMT -5

Quote

Post by Carl Gundel on Nov 22, 2019 15:39:13 GMT -5

Nov 22, 2019 3:27:06 GMT -5 Rod said:

I am starting to see how we can use Unicode characters (Thanks to Chris) and so greatly extend our display range and indeed international support. So I post this code again, check it out, easy access to multi character display. We have escaped asc shackles!

You can also just type the other characters in if you have a keyboard that supports them. ;-)

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com