bugsubtitleeditor - Bugs: bug #20473, BOM in the beginning of a UTF-8...

Show feedback again

You are not allowed to post comments on this tracker with your current authentification level.

bug #20473: BOM in the beginning of a UTF-8 encoding file is not interpreted correctly on plain text import

Submitted by:  Tomáš Hnyk <sup>
Submitted on:  Fri Feb 1 16:31:08 2013  
Category: NoneSeverity: 2 - Minor
Priority: 5 - NormalStatus: Confirmed
Privacy: PublicAssigned to: None
Open/Closed: Open

Mon Jan 5 23:32:14 2015, comment #1:

Related to https://gna.org/bugs/index.php?20169

Tomáš Hnyk <sup>
Project Member
Fri Feb 1 16:31:08 2013, original submission:

Libreoffice saves documents by default in UTF-8 (at least on my system) with no
way to configure it. It also includes BOM ( http://en.wikipedia.org/wiki/Byte_order_mark ) at the start of the file.

Note that it does not break unicode specification: (from http://unicode.org/faq/ utf_bom.html#BOM: ) "Can
a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can
I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the
endianness of the byte stream. UTF-8 always has the same byte order. An initial
BOM is only used as a signature — an indication that an otherwise unmarked text
file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect
a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a
BOM will interfere with any protocol or file format that expects specific ASCII
characters at the beginning, such as the use of "#!" of at the beginning of
Unix shell scripts."

Subtitleeditor opens such a file just fine, but interprets the BOM as "ZERO WIDTH
NON-BREAKING SPACE (ZWNBSP)"*, which seems to be correct when such character is
in the middle of the file. However, it treats it in this way also when it is in
the beginning of the file, which seems to be a bug. It manifests especially
when one needs to convert the file to a different encoding. It complains
with "Save Document Failed.

Could not convert the text to the character coding 'WINDOWS-1250'" (Which is not very helpful - it could tell me like gedit that there are bad characters and even better it could list them, but I digress here).

When saving in a srt format, it also moves tha BOM character in the middle of the file, as it usually starts with something else (like subtitle number in .srt format).

I think that when importing unicode text, first character should be checked whether it is BOM or not.

  • In practice, when a cursor is right of such character, pressing left arrow

does seemingly nothing, but it moves the cursor left of such character.
Backspace seems to do nothing as well but deletes the character. Such character is also counted in CPS, number of characters and so on, which does not make much sense, but is a very narrow corner case.

Tomáš Hnyk <sup>
Project Member


No files currently attached


   bug dependencies.

   bug dependencies.


Carbon-Copy List
  • -unavailable- added by sup (Submitted the item)

    Do you think this task is very important?
    If so, you can click here to add your encouragement to it.
    This task has 0 encouragements so far.

    Only logged-in users can vote.


    Error: not logged in



    Follow 4 latest changes.

    Date Changed By Updated Field Previous Value => Replaced By
    Mon Apr 13 13:14:48 2015supSeverity3 - Normal=>2 - Minor
      Dependencies-=>Depends on bugs #20169
    Mon Apr 13 13:12:21 2015supDependencies-=>bugs #20169 is dependent
    Show feedback again

    Back to the top

    Powered by Savane 3.1-cleanup