Post by Carl Gundel on Nov 19, 2019 15:48:58 GMT -5
CarlGundelLast Sunday at 8:38 PM
I've been playing with file encoding. For some reason I was having bad luck with the Windows-1252 encoder (the default encoding in the Smalltalk I'm using) which is supposed to be a single byte per character extended ASCII encoder and which is similar to ISO-8859-1. I finally realized that the default encoder is reading two bytes at a time, which seems like a bug. When I switched to using ISO-8859-1 encoding it was reading one byte per character properly.
I'd prefer to use the Windows-1252 encoder because it has more actual characters and is a superset of ISO-8859-1 in that respect.
So the good news is that I have the source code for the misbehaving encoder, so perhaps I will be able to fix it.
Chris IversonLast Sunday at 8:41 PM
that is weird
hopefully you're able to pinpoint it
cundoLast Sunday at 8:44 PM
I think ANSI 1252 is what I use with subtitles in movies
CarlGundelLast Sunday at 8:45 PM
Yeah, the thing is that build 150 uses the Windows-1252 encoder, and that bug is why it doesn't work correctly.
My thinking is that a single byte extended ASCII will be the default for LB5, and if you want UTF-8 or some other encoder you will need to open the file and then apply the encoder in a separate statement.
cundoLast Sunday at 8:51 PM
Optional in the file dialog?
I use that in Akel Pad. It says open with encoding, and a combobox lets you select one
CarlGundelLast Sunday at 9:01 PM
No, I mean open using the OPEN statement.
cundoLast Sunday at 9:01 PM
Ah ok
When coding
CarlGundelLast Sunday at 9:02 PM
The file dialog is for picking a file. You can decide how to handle the encoding in your BASIC program.
I'm open to suggestions of course.
.
.
OPEN "filepath" for output as #filehandle
ENCODER #filehandle, UTF-8
.
.
Or something like that.
.
.
Or maybe on one line.
.
.
cundoLast Sunday at 9:10 PM
What would happen if not encoder is chosen
CarlGundelLast Sunday at 9:10 PM
OPEN "filepath" for output as #filehandle encoder UTF-8
Chris IversonLast Sunday at 9:10 PM
uses default eoncding
CarlGundelLast Sunday at 9:11 PM
Yup.
Chris IversonLast Sunday at 9:11 PM
I think the second option would work best, a modifier fro the OPEN statement
simply because it falls in line with the LEN modifier for random access files that already exists
cundoLast Sunday at 9:12 PM
OPEN "filepath" for output as #filehandle with encoder utf-8
Chris IversonLast Sunday at 9:12 PM
OPEN "filepath" FOR OUTPUT AS #filehandle ENCODING="UTF-8" or whatever
CarlGundelLast Sunday at 9:12 PM
Yeah something like that.
Chris IversonLast Sunday at 9:12 PM
or anything specified above, just throwing out rationalization for making it a statement modifier
although
hmm
CarlGundelLast Sunday at 9:13 PM
I don't feel completely comfortable adding syntax to the OPEN statement, but maybe it is the right thing.
Chris IversonLast Sunday at 9:13 PM
I was going to add a benefit of having a separate ENCODING command
in that you could then change the encoding of a file on the fly
but I don't see that having much use
CarlGundelLast Sunday at 9:14 PM
UTF-8 random access files will be kind of a messy affair because it isn't possible to know exactly how many bytes will be required by any particular string.
Chris IversonLast Sunday at 9:14 PM
and I DO see that being potentially buggy
depending on how well Smalltalk supports having all that be changed
CarlGundelLast Sunday at 9:14 PM
So for UTF-8 it makes sense to say that the field statement species each item size in bytes, not characters?
Chris IversonLast Sunday at 9:15 PM
I think that would be the best option.
really, the only option. If the field size can vary based on the characters used, that pretty much messes everything up
CarlGundelLast Sunday at 9:15 PM
For single byte per character strings this is also true already.
Yup, and this also makes it hard to set the file stream position precisely.
So when a byte is also a character setting the position is trivial, but no so with variable length characters.
Fun right?
It's not anybody's fault really.
Chris IversonLast Sunday at 9:18 PM
Yeah, if you wanted to do character positioning in a UTF8 file, you'd have to constantly be reading small chunks to check for extended bytes
UTF8 at least has a small advantage over UTF16 in that the first byte in a character always starts the same way
CarlGundelLast Sunday at 9:18 PM
Or read from the beginning of the file to count characters. Ugh.
cundoLast Sunday at 9:18 PM
:open_mouth:
Chris IversonLast Sunday at 9:19 PM
so moving forward and backwards in characters in a string is at least doable
whereas UTF16 substitution pairs can be nearly anything
CarlGundelLast Sunday at 9:19 PM
For many kinds of applications this will not be an issue.
But if you do clever jumping around in files then it gets tricky. So, probably the best thing is to structure files into equal-length byte segments and then you store your information in these segments. When you want to access some particular data you have known boundary positions by byte.
Kinda wasteful of disk space, but with terabyte hard drives maybe it doesn't matter anymore. I know, sacrilege. :wink:
Of course with random access files you will just need to make the field sizes larger to compensate for the variable lengths of the strings.
CarlGundelToday at 1:22 PM
Okay, the weird behavior of the 1252 encoder is not a bug, but a feature.
cundoToday at 1:23 PM
What
Chris IversonToday at 1:23 PM
^
CarlGundelToday at 1:23 PM
The issue stems from the fact that it was originally designed as a single byte encoding and a lot of information on the Internet defines it as such. Microsoft and IBM claimed that it was a standard, but it was never actually ANSI approved in spite of their claims.
So, in practice CP1252 became a non single byte encoding changing many of the characters between $80 and $9F to be double byte Unicode code points.
So, in the example that you posted Chris of the value 128 coming different when read in this is because it was converted to be the Unicode euro sign which is two bytes. The same encoder should write it as 128 to file. I know it'weird.
I'm gonna try that round trip and report back.
So, that leaves me a bit stumped.
Chris IversonToday at 1:29 PM
Well, as long as it's consistent, that's fine, but that still leaves us needing some way of accessing files without going through an encoder
CarlGundelToday at 1:29 PM
I think that LB users will expect single byte ASCII compatibility by default. Perhaps ISO-8859-1 will be a better default for LB5.
Chris IversonToday at 1:29 PM
Just raw data
Plus that's a good point about single-byte sets being expected
CarlGundelToday at 1:31 PM
It's not really consistent in the sense that you write 128 and it comes back different. On the flip side if your keyboard has the euro sign and you type this in and save it to a file it will be saved as 128 and when you read it back in it will convert to the 2 byte version. Reading the file in binary mode will not give you what you expect.
And, on top of that the programmer might have a mind bending experience when using chr$() and asc(). Clearly asc() is now a misnomer.
Chris IversonToday at 1:34 PM
actually, that's exactly the behavior I was expecting currently.
the program is unicode internally.
so of course the encoder converts in to unicode when importing it
and converts it to CP1252 when exporting it
but I also expect there to be some sort of null/raw encoder
CarlGundelToday at 1:35 PM
On the other hand if I write a file using CP1252 and read it back in using ISO-8859-1 the result will be some lost characters, but you can read the file in and almost all the characters will be correct.
Chris IversonToday at 1:35 PM
where no transformation is done on the input/output
CarlGundelToday at 1:35 PM
This is all pretty messy.
Chris IversonToday at 1:36 PM
I think you'd have the same issue in ISO-8859-1, actually
CarlGundelToday at 1:37 PM
Not exactly the same.
If you're using a US keyboard and you never use the extended characters defined in CP1252 then you are always mapping characters to bytes and back one to one.
ISO-8859-1 is really a single byte extended ASCII.
CP1252 was intended to be this also but it didn't stay that way.
This was a serious mistake. They shouldn't have changed the spec without renaming it.
Chris IversonToday at 1:39 PM
I don't actually think CP1252 is multibyte.
the "anything on teh US keyboard" holds true in CP1252 as it does in ISO-8859
CarlGundelToday at 1:40 PM
And unfortunately CP1252 is the standard for Windows for good or ill.
Chris IversonToday at 1:40 PM
it's anything above byte 128 that starts causing problems
and I think that'd still be true in ISO-8859
CarlGundelToday at 1:42 PM
It is single byte on disk. In practice when you read it into memory you have to make the 32 characters between $80 and $9F into double bytes. This makes it schizophrenic.
Chris IversonToday at 1:42 PM
that's not because of hte standard.
that's because it's being converted to unicode.
CarlGundelToday at 1:42 PM
Only the characters between $80 to $9F are a problem in ISO-8859-1.
Yeah, I understand that CP1252 is supposed to be single byte. In practice it is more complicated because the people who created Unicode didn't consider CP1252 when they designed their character values. That is why the encoding poses this surprising mismatch.
Even though Microsoft claimed that CP1252 is an ANSI standard and they even called it ANSI 1252 back in the day, it was never actually approved. If it had been approved perhaps the Unicode people would have given it more consideration.
Chris IversonToday at 1:46 PM
hmm
CarlGundelToday at 1:46 PM
Messy, right?
Chris IversonToday at 1:46 PM
I see, the code points defined in ISO-8859-1 stlil hold in unicode
À is 0x00E0
er
C0
lowercase is E0
although that's 0xC0 in 1252 as well
CarlGundelToday at 1:47 PM
Probably the most important thing for LB5 going forward is that we write a really good explanation of this stuff in the docs.
Chris IversonToday at 1:47 PM
however, there is that range defined in CP1252 that's not in ISO-8859
CarlGundelToday at 1:47 PM
Yeah, all the characters in ISO-8859-1 are compatible with Unicode out of the box.
It just doesn't have as many characters in the set.
Chris IversonToday at 1:47 PM
does that actually encode as single bytes in LB5 if you're using ISO-8859?
CarlGundelToday at 1:48 PM
Yes.
Chris IversonToday at 1:48 PM
interesting.
CarlGundelToday at 1:49 PM
There is also an ASCII encoder in this version of Smalltalk. Not sure if there's any advantage to using that over ISO-8859-1.
Chris IversonToday at 1:49 PM
yeah, I wonder what that would actually cover
CarlGundelToday at 1:49 PM
Only the values from 0 to 127.
Chris IversonToday at 1:49 PM
ah.
well, that does actually make sense.
that's all ASCII really defined.
CarlGundelToday at 1:50 PM
But it also would preserve values from 128 to 255 I believe.
Yeah ASCII is only the 7 least significant bits.
Chris IversonToday at 1:51 PM
IF the ASCII encoder actually passes 128+ bytes through without mangling them
it might be the best option for backwards compat
CarlGundelToday at 1:51 PM
That's why ASCII is UTF-8 compatible. Once you set the high bit you are encoding with two or more bytes.
ISO-8895-1 also preserves the values and allows for more languages, so I think it wins over ASCII. However I can make it possible for the programmer to choose the encoding.
I've been playing with file encoding. For some reason I was having bad luck with the Windows-1252 encoder (the default encoding in the Smalltalk I'm using) which is supposed to be a single byte per character extended ASCII encoder and which is similar to ISO-8859-1. I finally realized that the default encoder is reading two bytes at a time, which seems like a bug. When I switched to using ISO-8859-1 encoding it was reading one byte per character properly.
I'd prefer to use the Windows-1252 encoder because it has more actual characters and is a superset of ISO-8859-1 in that respect.
So the good news is that I have the source code for the misbehaving encoder, so perhaps I will be able to fix it.
Chris IversonLast Sunday at 8:41 PM
that is weird
hopefully you're able to pinpoint it
cundoLast Sunday at 8:44 PM
I think ANSI 1252 is what I use with subtitles in movies
CarlGundelLast Sunday at 8:45 PM
Yeah, the thing is that build 150 uses the Windows-1252 encoder, and that bug is why it doesn't work correctly.
My thinking is that a single byte extended ASCII will be the default for LB5, and if you want UTF-8 or some other encoder you will need to open the file and then apply the encoder in a separate statement.
cundoLast Sunday at 8:51 PM
Optional in the file dialog?
I use that in Akel Pad. It says open with encoding, and a combobox lets you select one
CarlGundelLast Sunday at 9:01 PM
No, I mean open using the OPEN statement.
cundoLast Sunday at 9:01 PM
Ah ok
When coding
CarlGundelLast Sunday at 9:02 PM
The file dialog is for picking a file. You can decide how to handle the encoding in your BASIC program.
I'm open to suggestions of course.
.
.
OPEN "filepath" for output as #filehandle
ENCODER #filehandle, UTF-8
.
.
Or something like that.
.
.
Or maybe on one line.
.
.
cundoLast Sunday at 9:10 PM
What would happen if not encoder is chosen
CarlGundelLast Sunday at 9:10 PM
OPEN "filepath" for output as #filehandle encoder UTF-8
Chris IversonLast Sunday at 9:10 PM
uses default eoncding
CarlGundelLast Sunday at 9:11 PM
Yup.
Chris IversonLast Sunday at 9:11 PM
I think the second option would work best, a modifier fro the OPEN statement
simply because it falls in line with the LEN modifier for random access files that already exists
cundoLast Sunday at 9:12 PM
OPEN "filepath" for output as #filehandle with encoder utf-8
Chris IversonLast Sunday at 9:12 PM
OPEN "filepath" FOR OUTPUT AS #filehandle ENCODING="UTF-8" or whatever
CarlGundelLast Sunday at 9:12 PM
Yeah something like that.
Chris IversonLast Sunday at 9:12 PM
or anything specified above, just throwing out rationalization for making it a statement modifier
although
hmm
CarlGundelLast Sunday at 9:13 PM
I don't feel completely comfortable adding syntax to the OPEN statement, but maybe it is the right thing.
Chris IversonLast Sunday at 9:13 PM
I was going to add a benefit of having a separate ENCODING command
in that you could then change the encoding of a file on the fly
but I don't see that having much use
CarlGundelLast Sunday at 9:14 PM
UTF-8 random access files will be kind of a messy affair because it isn't possible to know exactly how many bytes will be required by any particular string.
Chris IversonLast Sunday at 9:14 PM
and I DO see that being potentially buggy
depending on how well Smalltalk supports having all that be changed
CarlGundelLast Sunday at 9:14 PM
So for UTF-8 it makes sense to say that the field statement species each item size in bytes, not characters?
Chris IversonLast Sunday at 9:15 PM
I think that would be the best option.
really, the only option. If the field size can vary based on the characters used, that pretty much messes everything up
CarlGundelLast Sunday at 9:15 PM
For single byte per character strings this is also true already.
Yup, and this also makes it hard to set the file stream position precisely.
So when a byte is also a character setting the position is trivial, but no so with variable length characters.
Fun right?
It's not anybody's fault really.
Chris IversonLast Sunday at 9:18 PM
Yeah, if you wanted to do character positioning in a UTF8 file, you'd have to constantly be reading small chunks to check for extended bytes
UTF8 at least has a small advantage over UTF16 in that the first byte in a character always starts the same way
CarlGundelLast Sunday at 9:18 PM
Or read from the beginning of the file to count characters. Ugh.
cundoLast Sunday at 9:18 PM
:open_mouth:
Chris IversonLast Sunday at 9:19 PM
so moving forward and backwards in characters in a string is at least doable
whereas UTF16 substitution pairs can be nearly anything
CarlGundelLast Sunday at 9:19 PM
For many kinds of applications this will not be an issue.
But if you do clever jumping around in files then it gets tricky. So, probably the best thing is to structure files into equal-length byte segments and then you store your information in these segments. When you want to access some particular data you have known boundary positions by byte.
Kinda wasteful of disk space, but with terabyte hard drives maybe it doesn't matter anymore. I know, sacrilege. :wink:
Of course with random access files you will just need to make the field sizes larger to compensate for the variable lengths of the strings.
CarlGundelToday at 1:22 PM
Okay, the weird behavior of the 1252 encoder is not a bug, but a feature.
cundoToday at 1:23 PM
What
Chris IversonToday at 1:23 PM
^
CarlGundelToday at 1:23 PM
The issue stems from the fact that it was originally designed as a single byte encoding and a lot of information on the Internet defines it as such. Microsoft and IBM claimed that it was a standard, but it was never actually ANSI approved in spite of their claims.
So, in practice CP1252 became a non single byte encoding changing many of the characters between $80 and $9F to be double byte Unicode code points.
So, in the example that you posted Chris of the value 128 coming different when read in this is because it was converted to be the Unicode euro sign which is two bytes. The same encoder should write it as 128 to file. I know it'weird.
I'm gonna try that round trip and report back.
So, that leaves me a bit stumped.
Chris IversonToday at 1:29 PM
Well, as long as it's consistent, that's fine, but that still leaves us needing some way of accessing files without going through an encoder
CarlGundelToday at 1:29 PM
I think that LB users will expect single byte ASCII compatibility by default. Perhaps ISO-8859-1 will be a better default for LB5.
Chris IversonToday at 1:29 PM
Just raw data
Plus that's a good point about single-byte sets being expected
CarlGundelToday at 1:31 PM
It's not really consistent in the sense that you write 128 and it comes back different. On the flip side if your keyboard has the euro sign and you type this in and save it to a file it will be saved as 128 and when you read it back in it will convert to the 2 byte version. Reading the file in binary mode will not give you what you expect.
And, on top of that the programmer might have a mind bending experience when using chr$() and asc(). Clearly asc() is now a misnomer.
Chris IversonToday at 1:34 PM
actually, that's exactly the behavior I was expecting currently.
the program is unicode internally.
so of course the encoder converts in to unicode when importing it
and converts it to CP1252 when exporting it
but I also expect there to be some sort of null/raw encoder
CarlGundelToday at 1:35 PM
On the other hand if I write a file using CP1252 and read it back in using ISO-8859-1 the result will be some lost characters, but you can read the file in and almost all the characters will be correct.
Chris IversonToday at 1:35 PM
where no transformation is done on the input/output
CarlGundelToday at 1:35 PM
This is all pretty messy.
Chris IversonToday at 1:36 PM
I think you'd have the same issue in ISO-8859-1, actually
CarlGundelToday at 1:37 PM
Not exactly the same.
If you're using a US keyboard and you never use the extended characters defined in CP1252 then you are always mapping characters to bytes and back one to one.
ISO-8859-1 is really a single byte extended ASCII.
CP1252 was intended to be this also but it didn't stay that way.
This was a serious mistake. They shouldn't have changed the spec without renaming it.
Chris IversonToday at 1:39 PM
I don't actually think CP1252 is multibyte.
the "anything on teh US keyboard" holds true in CP1252 as it does in ISO-8859
CarlGundelToday at 1:40 PM
And unfortunately CP1252 is the standard for Windows for good or ill.
Chris IversonToday at 1:40 PM
it's anything above byte 128 that starts causing problems
and I think that'd still be true in ISO-8859
CarlGundelToday at 1:42 PM
It is single byte on disk. In practice when you read it into memory you have to make the 32 characters between $80 and $9F into double bytes. This makes it schizophrenic.
Chris IversonToday at 1:42 PM
that's not because of hte standard.
that's because it's being converted to unicode.
CarlGundelToday at 1:42 PM
Only the characters between $80 to $9F are a problem in ISO-8859-1.
Yeah, I understand that CP1252 is supposed to be single byte. In practice it is more complicated because the people who created Unicode didn't consider CP1252 when they designed their character values. That is why the encoding poses this surprising mismatch.
Even though Microsoft claimed that CP1252 is an ANSI standard and they even called it ANSI 1252 back in the day, it was never actually approved. If it had been approved perhaps the Unicode people would have given it more consideration.
Chris IversonToday at 1:46 PM
hmm
CarlGundelToday at 1:46 PM
Messy, right?
Chris IversonToday at 1:46 PM
I see, the code points defined in ISO-8859-1 stlil hold in unicode
À is 0x00E0
er
C0
lowercase is E0
although that's 0xC0 in 1252 as well
CarlGundelToday at 1:47 PM
Probably the most important thing for LB5 going forward is that we write a really good explanation of this stuff in the docs.
Chris IversonToday at 1:47 PM
however, there is that range defined in CP1252 that's not in ISO-8859
CarlGundelToday at 1:47 PM
Yeah, all the characters in ISO-8859-1 are compatible with Unicode out of the box.
It just doesn't have as many characters in the set.
Chris IversonToday at 1:47 PM
does that actually encode as single bytes in LB5 if you're using ISO-8859?
CarlGundelToday at 1:48 PM
Yes.
Chris IversonToday at 1:48 PM
interesting.
CarlGundelToday at 1:49 PM
There is also an ASCII encoder in this version of Smalltalk. Not sure if there's any advantage to using that over ISO-8859-1.
Chris IversonToday at 1:49 PM
yeah, I wonder what that would actually cover
CarlGundelToday at 1:49 PM
Only the values from 0 to 127.
Chris IversonToday at 1:49 PM
ah.
well, that does actually make sense.
that's all ASCII really defined.
CarlGundelToday at 1:50 PM
But it also would preserve values from 128 to 255 I believe.
Yeah ASCII is only the 7 least significant bits.
Chris IversonToday at 1:51 PM
IF the ASCII encoder actually passes 128+ bytes through without mangling them
it might be the best option for backwards compat
CarlGundelToday at 1:51 PM
That's why ASCII is UTF-8 compatible. Once you set the high bit you are encoding with two or more bytes.
ISO-8895-1 also preserves the values and allows for more languages, so I think it wins over ASCII. However I can make it possible for the programmer to choose the encoding.