File encoding doesn't work right in Linux - lin64-351

donnybowers
Full Member

"I just want to celebrate another day of life" - Rare Earth

Posts: 160

File encoding doesn't work right in Linux - lin64-351 Jun 14, 2020 1:48:31 GMT -5

Quote

Post by donnybowers on Jun 14, 2020 1:48:31 GMT -5

I can write a text file, and I can then read it using normal sequential file protocols and it displays fine. But when I open the file in a text editor it isn't right. In Leafpad (text editor) it only shows the first line of text. In Geany (programmers text editor) it looks like it's all written in Chinese. It does pretty much the same thing whether I use the default encoder or any of the others.

Wine's notepad clone also reads it different. It gives you two lines and you can't even cursor through the second line to see all of the characters. Very strange behavior.

I also find it odd that the ascii encoder will only give you 127 characters and the other encoders give different characters then you get from LB4's file encoding or from previous LB5 alphas, which I believe are the same as LB4's ascii encoding.

Here's the code I used to write the file:


'OUTPUT FILE
open "ascii.txt" for output as #1 encoder = ASCII
    'for i=0 to 255 or
    for i=0 to 127
       print  #1, str$(i);" - ";chr$(i)
    next i
close #1

'INPUT FILE
open "ascii.txt" for input as #1
while eof(#1)=0
    line input #1, ascii$
    print ascii$
wend
close #1

Last Edit: Jun 14, 2020 1:56:46 GMT -5 by donnybowers

Using Linux Lite 6.2.
Unless otherwise specified, any code I post here is experimental, nothing guaranteed, public domain.

Rod
Global Moderator

Posts: 2,621

File encoding doesn't work right in Linux - lin64-351 Jun 14, 2020 3:36:12 GMT -5

Quote

Post by Rod on Jun 14, 2020 3:36:12 GMT -5

Don't have Linux to test. But if you specify the encoder for output should you not also specify it for input? Also you write a bunch of control characters at the start 0-31 so text file handlers will get confused by those. Not sure if the files need a header to allow them to be identified properly by the text handlers. New subject to me I am used to plain old extended ascii.

'the default encoder has a range of special characters above 127
'note that I ignore the control characters and spc below 33
open "test.txt" for output as #1 'default encoder
    'for i=0 to 255 or
    for i=33 to 255
       #1 chr$(i)
    next i
close #1


open "test.txt" for input as #1 'default encoder
while eof(#1)=0
    line input #1, c$
    print c$;
wend
close #1
print

'using ascii encoder limits you to ascii 0-127
open "test.txt" for output as #1 encoder = ASCII
    for i=33 to 127
       #1 chr$(i)
    next i
close #1


open "test.txt" for input as #1 encoder = ascii
while eof(#1)=0
    line input #1, c$
    print c$;
wend
close #1
print

'now try utf8
open "test.txt" for output as #1 encoder = utf_8
    for i=33 to 255
       #1 chr$(i)
    next i
close #1


open "test.txt" for input as #1 encoder = UTF_8
while eof(#1)=0
    line input #1, c$
    print c$;
wend
close #1

Last Edit: Jun 14, 2020 8:28:34 GMT -5 by Rod

Running Win11 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz 64bit with 32Gb ram

Chris Iverson
Administrator

Posts: 940

File encoding doesn't work right in Linux - lin64-351 Jun 14, 2020 3:45:40 GMT -5

Quote

Post by Chris Iverson on Jun 14, 2020 3:45:40 GMT -5

Chinese characters is usually the result of misinterpreting Latin UTF8 or ASCII text as UTF16/UCS-2. I bet your editor sees a normal byte followed by a null byte(which happens early in the file due to the null byte being printed out), and assumes it's UTF16(in which ASCII/Latin characters are each followed by null bytes).

In fact, I bet all of your weird editor issues are caused by the null byte in the file. Leafpad sees normal text, and then the null byte, and assumes it's hit the end of the string. Geany assumes it's a UTF-16 file when it isn't.

I bet that 1) if you change the code to NOT output the null byte, you'll have no more issues, and 2) if you ran the code under LB4, you'd see the same results in those editors.

Heck, you can trick regular Windows Notepad(and Notepad++) into doing this by outputting either the hex bytes FEFF or FFFE to the beginning of the file. This will cause Notepad to assume the file is UTF16, with the order of the bytes specifying what order the rest of the code units are stored in(big or little endian).

As for why ASCII only supports 127 characters, it's because the ASCII standard only defines 127 characters. Anything beyond that IS NOT ASCII. It's a localization feature providing further characters. Windows refers to these as "code pages".

Last Edit: Jun 14, 2020 3:52:19 GMT -5 by Chris Iverson

Carl Gundel
Administrator

Posts: 1,535

File encoding doesn't work right in Linux - lin64-351 Jun 19, 2020 20:23:24 GMT -5

Quote

Post by Carl Gundel on Jun 19, 2020 20:23:24 GMT -5

Jun 14, 2020 3:45:40 GMT -5 Chris Iverson said:

Chinese characters is usually the result of misinterpreting Latin UTF8 or ASCII text as UTF16/UCS-2. I bet your editor sees a normal byte followed by a null byte(which happens early in the file due to the null byte being printed out), and assumes it's UTF16(in which ASCII/Latin characters are each followed by null bytes).

In fact, I bet all of your weird editor issues are caused by the null byte in the file. Leafpad sees normal text, and then the null byte, and assumes it's hit the end of the string. Geany assumes it's a UTF-16 file when it isn't.

I bet that 1) if you change the code to NOT output the null byte, you'll have no more issues, and 2) if you ran the code under LB4, you'd see the same results in those editors.

Heck, you can trick regular Windows Notepad(and Notepad++) into doing this by outputting either the hex bytes FEFF or FFFE to the beginning of the file. This will cause Notepad to assume the file is UTF16, with the order of the bytes specifying what order the rest of the code units are stored in(big or little endian).

As for why ASCII only supports 127 characters, it's because the ASCII standard only defines 127 characters. Anything beyond that IS NOT ASCII. It's a localization feature providing further characters. Windows refers to these as "code pages".

It is confusing. I expect that it's actually working the way it should. I chose a single byte encoding as default because I expect most BASIC programmers to not have their brains explode. If I had made UTF-8 the default I would have a support nightmare on my hands.

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

Carl Gundel
Administrator

Posts: 1,535

File encoding doesn't work right in Linux - lin64-351 Jun 20, 2020 8:31:44 GMT -5

Quote

Post by Carl Gundel on Jun 20, 2020 8:31:44 GMT -5

Jun 19, 2020 20:23:24 GMT -5 Carl Gundel said:

Jun 14, 2020 3:45:40 GMT -5 Chris Iverson said:

Chinese characters is usually the result of misinterpreting Latin UTF8 or ASCII text as UTF16/UCS-2. I bet your editor sees a normal byte followed by a null byte(which happens early in the file due to the null byte being printed out), and assumes it's UTF16(in which ASCII/Latin characters are each followed by null bytes).

In fact, I bet all of your weird editor issues are caused by the null byte in the file. Leafpad sees normal text, and then the null byte, and assumes it's hit the end of the string. Geany assumes it's a UTF-16 file when it isn't.

I bet that 1) if you change the code to NOT output the null byte, you'll have no more issues, and 2) if you ran the code under LB4, you'd see the same results in those editors.

Heck, you can trick regular Windows Notepad(and Notepad++) into doing this by outputting either the hex bytes FEFF or FFFE to the beginning of the file. This will cause Notepad to assume the file is UTF16, with the order of the bytes specifying what order the rest of the code units are stored in(big or little endian).

As for why ASCII only supports 127 characters, it's because the ASCII standard only defines 127 characters. Anything beyond that IS NOT ASCII. It's a localization feature providing further characters. Windows refers to these as "code pages".

It is confusing. I expect that it's actually working the way it should. I chose a single byte encoding as default because I expect most BASIC programmers to not have their brains explode. If I had made UTF-8 the default I would have a support nightmare on my hands.

Let me make at attempt to clarify what I mean.

"It is confusing." By this I mean that compared to the way Liberty BASIC v4.5.1 works with files, the new way of encoding characters into a file is complicated, and you have to go digging around on the internet to find explanations of things like Unicode, encodings, etc. It seems like a succinct one pager in the help files could help, but I wonder if I'm smart enough to write it.

"I expect that it's actually working the way it should." Here I was referring to Donny's original posted code.

"I chose a single byte encoding as default because I expect most BASIC programmers to not have their brains explode." By this I meant that by choosing iso8859-1 as the default encoding, you get a popular and simple single byte encoding similar to ASCII but better for many things, and it doesn't do anything unexpected for people who have been programming in LB or even most other BASICs.

"If I had made UTF-8 the default I would have a support nightmare on my hands." I say this because the first time I encountered UTF-8 I was bewildered as to why things didn't work in a sane way (sanity as I defined it), and I ran off to the Smalltalk support forum and gave them a support headache. So I expect as much in this forum over UTF-8.

Support for different encodings is a good thing, and I'm sure some people here have been eagerly waiting for it. For people who just want to simply write stuff to a file and read it back like they always did, they're probably going to try to ignore it. It is getting harder to do.

Last Edit: Jun 20, 2020 8:33:01 GMT -5 by Carl Gundel

-Carl Gundel, author of Liberty BASIC
www.libertybasic.com

donnybowers
Full Member

"I just want to celebrate another day of life" - Rare Earth

Posts: 160

File encoding doesn't work right in Linux - lin64-351 Jun 25, 2020 22:21:54 GMT -5

Quote

Post by donnybowers on Jun 25, 2020 22:21:54 GMT -5

Okay. This works for me:


open "ascii.txt" for output as #1
for i=33 to 255
    print #1, str$(i);" - ";chr$(i)
next i
close #1

open "ascii.txt" for input as #1
for i=33 to 255
    line input #1, ascii$
    print ascii$
next i

Carl said >> It is confusing. I expect that it's actually working the way it should. I chose a single byte encoding as default because I expect most BASIC programmers to not have their brains explode. If I had made UTF-8 the default I would have a support nightmare on my hands. <<

I still don't understand why we lost all those beautiful framing characters we had back in the DOS days. I understand that they're no longer necessary for making "windows" now that we have window frames built in to most modern operating systems. But I personally could sometimes use framing characters like that. For ASCII Art if nothing else.

Anyway. Similarly I don't really understand the need for most of the extended characters we now have, or had with earlier versions of Liberty BASIC. Most of them look pretty useless to me. I do somewhat understand the value of UTF-8 in that it's supposed to somehow make translation to other languages easier.

One of the reasons I even tested this was because I have some programs where I use the extended characters as a simple way to encrypt my files. I mainly use it for storing website passwords.

I may have to figure out a way to translate all my encrypted ASCII files from my LB4 programs to my LB5 programs once I start using LB5 for everything. Not a real big deal. I guess it's to be expected in such a major upgrade. I'm sure there are reasons for this change that are beyond my pay grade.

I also usually have a file handy with these characters in it, in case I need to look up an ascii value for some reason. I'll probably always just use the default encoding to make things simple. I usually try to keep things as simple as possible. I can't wait for the AI version of BASIC. "Hey BASIC. Make me a program that emulates the old DOS ASCII with the little framing characters". Of course, when the AI version comes out I'm sure BASIC will be much more powerful. LOL

I'll just have to remember not to use the null character if I print out an ascii table. Not a big deal. 0-127 are the standard ASCII characters anyway so the three characters I occasionally use below 33 (9,10 &13) will still work the same as always. Chr$(0) is just double quotes in BASIC if I remember correctly. I don't think I've ever used chr$(0) for anything since double quotes does the same thing. I don't know why I used to include it when printing out an ASCII table anyway. It's a wasted byte in my file. Pure bloat. LOL

Still, it kind of sucks that when I want to load my ASCII table into my standard text editor it gives me an "invalid byte sequence" error and I have to specify iso8859-1 every time or it won't read the file. If this is the case every time I want to use a standard Unix text editor to read any ASCII file (since so many of my programs rely on them), many of my programs will be rendered pretty much useless unless I make my own text editor using LBs TEXTEDITOR function and use that text editor only. The problem with that is that LB (in versions prior to LB5) gives an error "can't find TKN file" if you call an LB created text editor program from a different directory. But, perhaps there's a workaround for that that I'm not aware of. That's probably a question for a whole different thread.

Chris said >> I bet your editor sees a normal byte followed by a null byte(which happens early in the file due to the null byte being printed out), and assumes it's UTF16(in which ASCII/Latin characters are each followed by null bytes). <<

Now I get it. Who needs the null string in an ASCII table anyway? Thanks Chris.

Last Edit: Jun 25, 2020 23:11:44 GMT -5 by donnybowers

Using Linux Lite 6.2.
Unless otherwise specified, any code I post here is experimental, nothing guaranteed, public domain.

donnybowers
Full Member

"I just want to celebrate another day of life" - Rare Earth

Posts: 160

File encoding doesn't work right in Linux - lin64-351 Jun 26, 2020 0:27:38 GMT -5

Quote

Post by donnybowers on Jun 26, 2020 0:27:38 GMT -5

I just created with LB5 a file using standard keyboard characters (no extended characters) and opened it with my default text editor and it loaded it without any problems. So for most of my uses this issue isn't a major concern. The only time I can see it being a pain in the behind is for someone who needs to be able to quickly access ASCII files with a text editor that contain extended characters using a Unix system. "Geany" text editor loads the extended ASCII file just fine, so that's definitely a solution as long as Geany is available. And I'm sure there are probably other text editors out there that would work too. I'm actually new to the "Mousepad" editor. I upgraded to the latest version of Linux Lite OS the other day and they've switched to it as the default editor. For some reason the package manager no longer has "Leafpad" which has always been my favorite light weight text editor. I wanted to test it, but it wasn't there.

I hope others will do some testing for their own purposes and report back if they find any issues with the new format. I'm pretty sure I can work around the changes. But I think it would be good if others who sometimes use extended characters in certain ASCII files did some tests and report any issues they might see. Converting files that use extended ASCII to the new encoding, especially for Unix users (because we're converting from Windows to Unix files), will probably be the biggest pain in the arse. But, I'm sure it's worth it for the value these new formats have in the wider community. A little bit of extra work is to be expected with any new version of BASIC or a major upgrade like this to any computer application.

I have no way of testing this in Windows because I don't have a compatible version. I don't expect to have too much trouble converting from Windows files to Unix whenever I get around to making the switch from LB4.x to LB5.x. My encrypted programs will probably just be a matter of compiling the programs in LB5 and then just re-entering whatever data I have in my current program versions. Regular keyboard text shouldn't be an issue at all except removing the chr$(13) character.

Actually, I can't wait to get to that stage because it will mean I'm now licensed for a stable version of LB5.x for Linux and for Raspberry PI. I can't wait for the Android version LOL!!!

Last Edit: Jun 26, 2020 0:37:48 GMT -5 by donnybowers

Using Linux Lite 6.2.
Unless otherwise specified, any code I post here is experimental, nothing guaranteed, public domain.