Today I noticed that one of my text files has the last 229 lines double-spaced when using "os9 copy" to copy a Windows text file to an OS-9 disk image. Those lines have CR,LF converted to CR,CR instead of only CR.
I have never noticed this happening before, and can't see what might have triggered it.
The Windows file has 2480 lines and 72089 bytes in it, but only the last 229 lines are affected.
The command line doing the copy:
os9 copy -l -r KRNBOOT\dskboot.asm NOS9DEV.DSK,KRNBOOT/dskboot.asm
I am using ToolShed v2.2 on a Windows 7 system.
Has anyone else seen this behaviour? Any ideas as to what might be causing it?
Dave W
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After a quick look at the code I can imagine this happens because the file is processed in buffer-size chunks, and each chunk is checked with DetermineEOLType(). If the file is split so that one chunk ends with CR and the next starts with LF, the latter chunk will be detected as a EOL_UNIX type and simple LF->CR conversion will be done on it.
Maybe the simplest fix would be for DetermineEOLType() to go through the whole chunk in search for CR,LF and not just be happy with the first LF it finds.
Obviously the better fix is to determine the line encoding for the file once for all (hopefully the first chunk should be enough to determine it) and then stay with this encoding for all remaining chunks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Maybe you can confirm that the issue disappears (or you get other results) if you specify a buffer size with the -b option different than the default 32768 bytes.
I guess this bug hasn't popped up much because people rarely deal with files larger than 32K on their NitrOS-9 systems.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is also interesting to note that NativeToCoco() uses DetermineEOLType() to "sniff" the encoding of the file, while CoCoToNative() selects the target file encoding based on the platform it is compiled for. This is not mentioned in the ToolShed documentation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2023-07-13
I made some minor changes to the file, and the problem went away. Unfortunately I didn't save a copy that had the problem. But the issue did seem to start around the 64KB mark, so your suspicion as to the cause is probably correct.
What is the largest buffer size that may be specified for "os9 copy"?
My source code files have lots of comments in them, because internal documentation can't be misplaced like external documentation, so that makes them bigger than many other people's files.
Dave W
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't there is any practical limit for the buffer size to worry about, the code deals with as an integer, so if you are running this on a 32-bit computer a 2 GB size could work. It will allocate at least one buffer of that size though, so enough RAM must be available for the process.
-b2111000K seems to work fine here :)
Last edit: Tormod Volden 2023-07-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2023-07-13
I created a test file that had CR,LF split at the 32KB boundary, and used the default 32KB buffer size, and the lines after that point became double-spaced with the v2.2 program, confirming your analysis of the program.
I'm not currently set up to patch the ToolShed source code and run make, so for now will just specify a buffer size larger than the largest text file I expect to ever process with "os9 copy".
Thanks for your prompt help!!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Today I noticed that one of my text files has the last 229 lines double-spaced when using "os9 copy" to copy a Windows text file to an OS-9 disk image. Those lines have CR,LF converted to CR,CR instead of only CR.
I have never noticed this happening before, and can't see what might have triggered it.
The Windows file has 2480 lines and 72089 bytes in it, but only the last 229 lines are affected.
The command line doing the copy:
os9 copy -l -r KRNBOOT\dskboot.asm NOS9DEV.DSK,KRNBOOT/dskboot.asm
I am using ToolShed v2.2 on a Windows 7 system.
Has anyone else seen this behaviour? Any ideas as to what might be causing it?
Dave W
Can you share the file? Or a part of it which reproduces the bug?
After a quick look at the code I can imagine this happens because the file is processed in buffer-size chunks, and each chunk is checked with DetermineEOLType(). If the file is split so that one chunk ends with CR and the next starts with LF, the latter chunk will be detected as a EOL_UNIX type and simple LF->CR conversion will be done on it.
Maybe the simplest fix would be for DetermineEOLType() to go through the whole chunk in search for CR,LF and not just be happy with the first LF it finds.
Obviously the better fix is to determine the line encoding for the file once for all (hopefully the first chunk should be enough to determine it) and then stay with this encoding for all remaining chunks.
Maybe you can confirm that the issue disappears (or you get other results) if you specify a buffer size with the -b option different than the default 32768 bytes.
I guess this bug hasn't popped up much because people rarely deal with files larger than 32K on their NitrOS-9 systems.
It is also interesting to note that NativeToCoco() uses DetermineEOLType() to "sniff" the encoding of the file, while CoCoToNative() selects the target file encoding based on the platform it is compiled for. This is not mentioned in the ToolShed documentation.
You can test this patch. I will also make new snapshot builds for Windows once I have committed this.
I made some minor changes to the file, and the problem went away. Unfortunately I didn't save a copy that had the problem. But the issue did seem to start around the 64KB mark, so your suspicion as to the cause is probably correct.
What is the largest buffer size that may be specified for "os9 copy"?
My source code files have lots of comments in them, because internal documentation can't be misplaced like external documentation, so that makes them bigger than many other people's files.
Dave W
Steps to reproduce and verify:
Last edit: Tormod Volden 2023-07-13
I don't there is any practical limit for the buffer size to worry about, the code deals with as an integer, so if you are running this on a 32-bit computer a 2 GB size could work. It will allocate at least one buffer of that size though, so enough RAM must be available for the process.
-b2111000Kseems to work fine here :)Last edit: Tormod Volden 2023-07-13
I created a test file that had CR,LF split at the 32KB boundary, and used the default 32KB buffer size, and the lines after that point became double-spaced with the v2.2 program, confirming your analysis of the program.
I'm not currently set up to patch the ToolShed source code and run make, so for now will just specify a buffer size larger than the largest text file I expect to ever process with "os9 copy".
Thanks for your prompt help!!
Note this bug is tracked in https://sourceforge.net/p/toolshed/bugs/51/
BTW, I uploaded a new Windows snapshot at https://toolshed.sourceforge.net/snapshots/