|
Post by Chris Iverson on Aug 29, 2019 11:33:13 GMT -5
This is one of the bugs I ran into while creating my SSD1306 OLED display code. temp$ = chr$(hexdec("80")) print "Length: ";len(temp$) print "Character: ";temp$ print "ASCII value: ";asc(temp$)
open "test.txt" for output as #file
print #file, temp$; close #file
open "test.txt" for binary as #file
temp2$ = input$(#file, 1)
close #file
print if temp$ <> temp2$ then print "not equal" print
print "Length: ";len(temp2$) print "Character: ";temp2$ print "ASCII value: ";asc(temp2$)
print
open "test.txt" for input as #file line input #file, temp2$ close #file
if temp$ <> temp2$ then print "not equal" Results: Length: 1 Character: ASCII value: 128
not equal
Length: 1 Character: € ASCII value: 8364
not equal As you can see, writing a character with a byte value larger than 127 to a file, and then reading it back, will result in a completely different value. This looks to be some sort of unicode conversion automatically taking place; in the ASCII locale/charset on my computer, byte 128 is the Euro sign. The value read back is the Unicode codepoint of the Euro sign. This only happens when reading data from a file. Playing with ASC() and CHR$() conversions purely in memory always seems to result in the right value, in my tests so far. value = 128 temp1$ = chr$(value) print asc(temp1$) Additionally, if you actually check the contents of the file, using LB4 or a hex editor, you'll see that the data that got written out in that first test is indeed a single byte with the value 0x80(128). It's whatever process is reading the file that's doing the converting. Now, while limited Unicode support like that does show interesting possible future features, like potential Unicode support, it does have the downside of flat-out not being able to directly manipulate binary files, even when opened for binary.
|
|
|
Post by Carl Gundel on Aug 29, 2019 11:42:19 GMT -5
This is one of the bugs I ran into while creating my SSD1306 OLED display code. Additionally, if you actually check the contents of the file, using LB4 or a hex editor, you'll see that the data that got written out in that first test is indeed a single byte with the value 0x80(128). It's whatever process is reading the file that's doing the converting. Now, while limited Unicode support like that does show interesting possible future features, like potential Unicode support, it does have the downside of flat-out not being able to directly manipulate binary files, even when opened for binary. Sorry, what version of LB? What platform?
|
|
|
Post by Chris Iverson on Aug 29, 2019 11:47:42 GMT -5
Sorry, this is on LB5-350, confirmed on Windows.
I first found this issue on the Pi, but I'm not sure it behaves in exactly the same way, as I didn't run the example code I gave on the Pi or Linux yet.
|
|
|
Post by Carl Gundel on Aug 29, 2019 13:25:22 GMT -5
Sorry, this is on LB5-350, confirmed on Windows. I first found this issue on the Pi, but I'm not sure it behaves in exactly the same way, as I didn't run the example code I gave on the Pi or Linux yet. Thanks. This is very useful feedback.
|
|
|
Post by tenochtitlanuk on Aug 29, 2019 17:11:55 GMT -5
Confirm the problem. My Spanish dictionary program uses 'enya'- n with a tilde, in ASCII ( extended) not UTF8- within the program for a button title and LB5 can't load that line. Just whan my Spanish<-->English dictionary was working so well!
|
|
|
Post by Carl Gundel on Aug 29, 2019 21:06:32 GMT -5
Confirm the problem. My Spanish dictionary program uses 'enya'- n with a tilde, in ASCII ( extended) not UTF8- within the program for a button title and LB5 can't load that line. Just whan my Spanish<-->English dictionary was working so well! The new Smalltalk operates under the assumption that UTF-8 or other multibyte character set is in use, and I have got my work cut out trying to make something sane happen. Wish me luck, pray, throw salt over your shoulder, cross fingers, etc.
|
|
|
Post by tenochtitlanuk on Aug 30, 2019 9:38:41 GMT -5
Glad it's you who's trying to tackle this! I spent hours trying to edit UTF8 2-byte represenations of accented characters in my dictionary text ( UTF8 with LF separators) and replacing them with equivalents ASCII chars between 128 and 255- and never knew whether the software I used was showing 2-bytes as one char or whether it really WAS ASCII. Lots of recourse to a hex editor!
If only UTF8 headers were reliably added...
Thanks as ever for all your work- it keeps giving my ageing brain lots of fun!
|
|
|
Post by Carl Gundel on Aug 30, 2019 11:53:57 GMT -5
Glad it's you who's trying to tackle this! I spent hours trying to edit UTF8 2-byte represenations of accented characters in my dictionary text ( UTF8 with LF separators) and replacing them with equivalents ASCII chars between 128 and 255- and never knew whether the software I used was showing 2-bytes as one char or whether it really WAS ASCII. Lots of recourse to a hex editor! If only UTF8 headers were reliably added... Thanks as ever for all your work- it keeps giving my ageing brain lots of fun! Yeah, UTF-8 may be a brilliant piece of work, but it does make extended ASCII problematic.
|
|
|
Post by Chris Iverson on Aug 30, 2019 15:48:25 GMT -5
Not gonna lie, I would be extremely happy if LB5 natively supported UTF8, considering it's become the dominant, default encoding of the web and many modern systems(and for good reason).
Extended ASCII is always going to be a problem, because you're making the assumption that the person using your program is on the same character set that you are, and it's reaching the point where that's more and more unlikely. Not to mention cross-platform support now, where some of those "codepages" may not even exist on other platforms. And if they're not using the same code page, your program will either not look correct, or will not function properly, and there's nothing that you as the developer can do to fix it. And this becomes more and more of an issue as time goes on, and we become even more interconnected with each other, and you're more likely to reach users who aren't in your native country/language.
All that said, that only applies to data that is actually supposed to be human-readable text. LB still requires methods of raw data manipulation for the times when you're either working with binary data that's not supposed to be interpreted by a text encoder, or you're dealing with explicitly-encoded text that you don't want converted.
|
|
|
Post by Carl Gundel on Aug 30, 2019 16:10:40 GMT -5
Not gonna lie, I would be extremely happy if LB5 natively supported UTF8, considering it's become the dominant, default encoding of the web and many modern systems(and for good reason). Extended ASCII is always going to be a problem, because you're making the assumption that the person using your program is on the same character set that you are, and it's reaching the point where that's more and more unlikely. Not to mention cross-platform support now, where some of those "codepages" may not even exist on other platforms. And if they're not using the same code page, your program will either not look correct, or will not function properly, and there's nothing that you as the developer can do to fix it. And this becomes more and more of an issue as time goes on, and we become even more interconnected with each other, and you're more likely to reach users who aren't in your native country/language. All that said, that only applies to data that is actually supposed to be human-readable text. LB still requires methods of raw data manipulation for the times when you're either working with binary data that's not supposed to be interpreted by a text encoder, or you're dealing with explicitly-encoded text that you don't want converted. Agreed. This is a sticky problem. I think that LB5 should default to UTF-8, but it should be very easy to use 8-bit data when doing binary I/O and also to be able to force ASCII and some form of extended ASCII.
|
|