Post by Carl Gundel on Oct 22, 2019 20:54:17 GMT -5
Copied from the discord chat room.
CarlGundel Yesterday at 11:45 AM
Right now I'm working on support for BYREF in LB5. This is implemented in LB4 but its kind of a cheat because it only pretends to do assignment by reference in that version. LB5 does the real thing, and it has been tricky to implement. Almost ready.
I'm also figuring out how to support multibyte characters in files, along with ASCII.
Tricky.
Certainly open to people's thoughts about it.
Chris Iverson Yesterday at 11:47 AM
I am definitely in support of proper unicode support, but if it's easier, I'm fine with at least getting ASCII working for now
to at least achieve parity with LB4 on file manipulation
CarlGundel Yesterday at 11:47 AM
The default could be support for ISO-8859-1. That's a single byte encoding.
Chris Iverson Yesterday at 11:48 AM
Is that something that would be defined internally in LB5 as the default, or something it gets from the operating environment(Windows, etc)?
CarlGundel Yesterday at 11:48 AM
Then if we want UTF-8 we could have a modifier in the OPEN statement.
UTF-8 or other.
Chris Iverson Yesterday at 11:49 AM
if you're supporting unicode anything, UTF8 would probably be the best to go for, despite the fact that Windows uses UTF16 internally
UTF8 is used everywhere on the web, and also is the default on Linux and (I believe) MAC
plus you have the bonus that standard text files(anything below byte 128) are identical in ASCII/UTF8
CarlGundel Yesterday at 11:51 AM
Yeah, Windows is one thing, but then there is Linux and MacOS.
I think I may also be able to support UTF16 also.
Chris Iverson Yesterday at 11:51 AM
Even Windows devs have said that they'd have focused on UTF8 if it existed when they were first implementing Unicode support
CarlGundel Yesterday at 11:52 AM
The trouble with variable size character sets is that positioning in file streams is a mess.
Chris Iverson Yesterday at 11:52 AM
too late to change it now, though, considering the thousands and thousands of applications that now exist that use and expect UTF16 on windows
Yeah, although it's actually a little easier to do in UTF8 compared to UTF16, as it's fairly easy to pick out surrogate sets in UTF8
CarlGundel Yesterday at 11:53 AM
If you have a random access file and the record size is 1000 , you have a realize that this is 1000 bytes, not 1000 characters unless you're using ASCII or ISO-8859.
Chris Iverson Yesterday at 11:54 AM
Yup, file access and text manipulation have to be considered separate steps.
CarlGundel Yesterday at 11:54 AM
UTF8 doesn't solve the problem.
Of the record size I mean.
Chris Iverson Yesterday at 11:54 AM
oh, no, it doesn't. If you wanted every character to be the same size, you'd have to use UTF32, and that's just a gigantic waste of space.
CarlGundel Yesterday at 11:55 AM
So, obviously if I wanted a seamless experience for LB users who have used random access files or binary files this is tricky, and there is no perfect solution.
An important guideline in software design is "the principle of least astonishment" and this is hard to realize in this particular case. I know it caused me distress when I first encountered it and I'd like to spare LB users.
Rod Today at 5:16 AM
Ok, not something I am familiar with. A few random thoughts from an ascii user.
UTF is designed to widen the character set. It is only useful if we can display and edit that character set within a Liberty control and read and write and copy and paste to files or other controls.
So is it all or nothing, all ascii or UTF?
If it isn’t all or nothing wE could have a command to write or read a UTF string which in turn could be filed, displayed or pasted as UTF. a$$ v a$
How do we know a string is in UTF or ascii format?
But we don’t have a Liberty control capable of displaying UTF, or do we? If our text controls and graphic displays recognised UTF, old ascii users would not know or care.
New UTF users would use new tools. Perhaps len() needs to return characters not bytes. At least for a$$
Dow we need a new data type UTF?
CarlGundel Today at 9:49 PM
The widgets themselves know how to handle extended character sets, so that's not a problem. Right now I am only concerned with files. You should be able to read and write files without things getting all screwed up. For the most part Liberty BASIC users are western language speakers, and so it would be great if I can figure out how to cater to those users by default. ISO-8859-1 should be writable and readable to disk without doing anything special. Then if you want extended character sets I will try to add UTF-8 as an easy to use optional encoding for files when you need it.
CarlGundel Yesterday at 11:45 AM
Right now I'm working on support for BYREF in LB5. This is implemented in LB4 but its kind of a cheat because it only pretends to do assignment by reference in that version. LB5 does the real thing, and it has been tricky to implement. Almost ready.
I'm also figuring out how to support multibyte characters in files, along with ASCII.
Tricky.
Certainly open to people's thoughts about it.
Chris Iverson Yesterday at 11:47 AM
I am definitely in support of proper unicode support, but if it's easier, I'm fine with at least getting ASCII working for now
to at least achieve parity with LB4 on file manipulation
CarlGundel Yesterday at 11:47 AM
The default could be support for ISO-8859-1. That's a single byte encoding.
Chris Iverson Yesterday at 11:48 AM
Is that something that would be defined internally in LB5 as the default, or something it gets from the operating environment(Windows, etc)?
CarlGundel Yesterday at 11:48 AM
Then if we want UTF-8 we could have a modifier in the OPEN statement.
UTF-8 or other.
Chris Iverson Yesterday at 11:49 AM
if you're supporting unicode anything, UTF8 would probably be the best to go for, despite the fact that Windows uses UTF16 internally
UTF8 is used everywhere on the web, and also is the default on Linux and (I believe) MAC
plus you have the bonus that standard text files(anything below byte 128) are identical in ASCII/UTF8
CarlGundel Yesterday at 11:51 AM
Yeah, Windows is one thing, but then there is Linux and MacOS.
I think I may also be able to support UTF16 also.
Chris Iverson Yesterday at 11:51 AM
Even Windows devs have said that they'd have focused on UTF8 if it existed when they were first implementing Unicode support
CarlGundel Yesterday at 11:52 AM
The trouble with variable size character sets is that positioning in file streams is a mess.
Chris Iverson Yesterday at 11:52 AM
too late to change it now, though, considering the thousands and thousands of applications that now exist that use and expect UTF16 on windows
Yeah, although it's actually a little easier to do in UTF8 compared to UTF16, as it's fairly easy to pick out surrogate sets in UTF8
CarlGundel Yesterday at 11:53 AM
If you have a random access file and the record size is 1000 , you have a realize that this is 1000 bytes, not 1000 characters unless you're using ASCII or ISO-8859.
Chris Iverson Yesterday at 11:54 AM
Yup, file access and text manipulation have to be considered separate steps.
CarlGundel Yesterday at 11:54 AM
UTF8 doesn't solve the problem.
Of the record size I mean.
Chris Iverson Yesterday at 11:54 AM
oh, no, it doesn't. If you wanted every character to be the same size, you'd have to use UTF32, and that's just a gigantic waste of space.
CarlGundel Yesterday at 11:55 AM
So, obviously if I wanted a seamless experience for LB users who have used random access files or binary files this is tricky, and there is no perfect solution.
An important guideline in software design is "the principle of least astonishment" and this is hard to realize in this particular case. I know it caused me distress when I first encountered it and I'd like to spare LB users.
Rod Today at 5:16 AM
Ok, not something I am familiar with. A few random thoughts from an ascii user.
UTF is designed to widen the character set. It is only useful if we can display and edit that character set within a Liberty control and read and write and copy and paste to files or other controls.
So is it all or nothing, all ascii or UTF?
If it isn’t all or nothing wE could have a command to write or read a UTF string which in turn could be filed, displayed or pasted as UTF. a$$ v a$
How do we know a string is in UTF or ascii format?
But we don’t have a Liberty control capable of displaying UTF, or do we? If our text controls and graphic displays recognised UTF, old ascii users would not know or care.
New UTF users would use new tools. Perhaps len() needs to return characters not bytes. At least for a$$
Dow we need a new data type UTF?
CarlGundel Today at 9:49 PM
The widgets themselves know how to handle extended character sets, so that's not a problem. Right now I am only concerned with files. You should be able to read and write files without things getting all screwed up. For the most part Liberty BASIC users are western language speakers, and so it would be great if I can figure out how to cater to those users by default. ISO-8859-1 should be writable and readable to disk without doing anything special. Then if you want extended character sets I will try to add UTF-8 as an easy to use optional encoding for files when you need it.