Menu

#396 Try to auto-detect UTF-16 without BOM

v4.3
open
Encodings (2)
5
2013-12-09
2011-07-19
Yaron
No

In my case the large majority of files I work with are either UTF-8 with no BOM, or UTF-16LE with no BOM. As I saw mentioned on other bugs and feature requests, these sort of files aren't exactly rare.

As it is now (jEdit can't auto-detect UTF-16LE without BOM) it means that I can't conveniently use jEdit on all of them, since I have to configure it in a way that will consistently load one of these incorrectly and require manual reload.
Worse, it also means that I can't do a search in directory/files that will cover both these file types correctly.

I do recognize that UTF-16 without BOM can't be recognized with 100% reliability. But in my case, and I expect in the large majority of cases, most of the characters in them will be essentially from the ASCII subset. So a heuristic to look at the start of the file, and look for the pattern of alternating \x00 and not-\x00 bytes, would have a very large success rate at correctly guessing if a file is UTF-16LE or UTF-16BE without BOM.

This is also a pattern that is unlikely to be anything else, for the majority of cases where jEdit isn't consistently used for binary files. So the risks of wrong behavior are minimal. (Of course it's also possible to add this heuristic as a configurable option, but my point is that this additional complexity can probably be avoided).

Discussion

  • Kazutoshi Satoda

    • assigned_to: nobody --> k_satoda
    • status: open --> pending-works-for-me
     
  • Kazutoshi Satoda

    Did you know [Global Options] > [Encodings] > [List of fallback encodings] ?
    I think that putting UTF-16LE in that option will likely solve your problem.
    http://www.jedit.org/users-guide/global-opts.html#encodings-pane

     
  • Yaron

    Yaron - 2011-07-19

    No, I noticed and tried this already.

    If the default is UTF-8, then having UTF-16LE in the fallback encoding doesn't work, jEdit opens the UTF-16LE files as UTF-8 with all the customary boxes every other characters.

    If I set the default as UTF-16LE, without any UTF-8 option, then UTF-8 files are opened as UTF-16LE, as a long line of boxes and the occasional odd character.
    If the default if UTF-16LE, and I add UTF-8 as fallback, I still get the UTF-8 files opened as UTF-16LE.

    I tried playing with any combination I could think of as default, autodector, and fallback, but none seems to correctly open all files.

     
  • Kazutoshi Satoda

    Sorry I tried the fallback encodings with some japanese text files,
    which reliably give decoding error for wrong encodings.

    Now I recognized that your problem is with ASCII text files.

    Now, I want to note that jEdit core is already open to add custom
    encoding detector as plugins.
    http://www.jedit.org/api/org/gjt/sp/jedit/io/EncodingDetector.html

    Would you try to write your heuristic in a programming language, so
    that you (or I, or someone else) can craft it as a concrete
    EncodingDetector service plugin for you ?

     
  • Kazutoshi Satoda

    • status: pending-works-for-me --> open
     
  • Yaron

    Yaron - 2011-07-19

    Thanks, I didn't notice that these were plugins and not integral to the core.

    I haven't really worked in Java for years now, and quite little back then, but the logic here is fairly simple. I looked at the ".java" file for the other implementers of EncodingDetector and I think I should be able to write one easily enough. Hopefully I should have some time tomorrow to do this.

    I don't, however, currently have any tools (besides, well, editors) for working with Java.
    Will a complete java file (or detectEncoding method) that possibly includes minor typos/syntax issues be good enough for you/someone-else to go ahead with to complete this? Or will it be better to send working code in something else (C++ with stl should be easy for me and relatively close to java) without "translating" it to java myself first?

     
  • Yaron

    Yaron - 2011-07-20

    This may work as a detector

     
  • Yaron

    Yaron - 2011-07-20

    I've tried making something.

    This compiles, and I think it should work. Not quite sure how to integrate with my running jEdit to check, though.

    Since it's a huristic approach, I had the amounts as possible parameters.
    Not sure if it would be a good idea to make some sanity-checking on them in the code, for a future case if someone will just change the parameter to something invalid, or if it will ever become configurable. For now I assume it's not really an issue, though.

    Can you please try and see if this can work?

     
  • Kazutoshi Satoda

    Thank you very much for trying by yourself.

    Unfortunately, there are some non-essential things to make a
    EncodingDetector class into a jEdit plugin.
    http://www.jedit.org/users-guide/writing-plugins-part.html

    For example, see the attachment in the following post.
    http://old.nabble.com/Re%3A-jEdit---file-encodings-td16482966.html
    (jEdit-plugin-Universalchardet-preview.tar.bz2)

    ... oh, the above Universalchardet plugin might just work as the
    solution for you. I'm sorry not remembering this earlier.

     
  • Yaron

    Yaron - 2011-07-22

    First, I want to say that I really do think something like this should be a part of the core just like the other detectors are. It's less clear-cut, and the default parameters could probably be better, but beyond that it's not an unusual need, and it is something that quite a few other editors do at some level without requiring the user to install plugins/extensions.

    But naturally you don't have to agree with me on that, so a plug-in it is.

    Since I modified it to work as a plugin anyway, I also added an options pane so the parameters could be modified. I didn't really have the time or inclination to study the swing UI too much, so the validation is mostly done not while editing, but by trying to either discard or modifying bad parameters afterwards. So not really very robust, but useable.

    I'm going to attach:
    1. Source java files, prop file, and services.xml file
    2. Jar file for the plugin, that I made by manually compiling the jars, and manually zipping the lot.

    Feel free to make whatever further modifications you think are needed so it could be used by someone other than me. I did modify the copyright messages from some of the sample jedit source files I looked at to be copyrighted to me, but I'm happy to relinquish copyright on this to any of the other devs.

     
  • Yaron

    Yaron - 2011-07-22
     
  • Yaron

    Yaron - 2011-07-22
     
  • Alan Ezust

    Alan Ezust - 2013-12-09
    • labels: core --> Encodings
    • Group: --> v4.3
     

Log in to post a comment.