bugsubtitleeditor - Bugs: bug #20473, BOM in the beginning of a UTF-8...

 
 
Show feedback again

You are not allowed to post comments on this tracker with your current authentification level.

bug #20473: BOM in the beginning of a UTF-8 encoding file is not interpreted correctly on plain text import

Submitted by:  Tomáš Hnyk <sup>
Submitted on:  Fri 01 Feb 2013 04:31:08 PM UTC  
 
Category: NoneSeverity: 3 - Normal
Priority: 5 - NormalStatus: None
Privacy: PublicAssigned to: None
Open/Closed: Open

Fri 01 Feb 2013 04:31:08 PM UTC, original submission:

Libreoffice saves documents by default in UTF-8 (at least on my system) with no
way to configure it. It also includes BOM ( http://en.wikipedia.org/wiki/Byte_order_mark ) at the start of the file.

Note that it does not break unicode specification: (from http://unicode.org/faq/ utf_bom.html#BOM: ) "Can
a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can
I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the
endianness of the byte stream. UTF-8 always has the same byte order. An initial
BOM is only used as a signature — an indication that an otherwise unmarked text
file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect
a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a
BOM will interfere with any protocol or file format that expects specific ASCII
characters at the beginning, such as the use of "#!" of at the beginning of
Unix shell scripts."

Subtitleeditor opens such a file just fine, but interprets the BOM as "ZERO WIDTH
NON-BREAKING SPACE (ZWNBSP)"*, which seems to be correct when such character is
in the middle of the file. However, it treats it in this way also when it is in
the beginning of the file, which seems to be a bug. It manifests especially
when one needs to convert the file to a different encoding. It complains
with "Save Document Failed.

Could not convert the text to the character coding 'WINDOWS-1250'" (Which is not very helpful - it could tell me like gedit that there are bad characters and even better it could list them, but I digress here).

When saving in a srt format, it also moves tha BOM character in the middle of the file, as it usually starts with something else (like subtitle number in .srt format).

I think that when importing unicode text, first character should be checked whether it is BOM or not.

  • In practice, when a cursor is right of such character, pressing left arrow

does seemingly nothing, but it moves the cursor left of such character.
Backspace seems to do nothing as well but deletes the character. Such character is also counted in CPS, number of characters and so on, which does not make much sense, but is a very narrow corner case.

Tomáš Hnyk <sup>

 

No files currently attached

 

Depends on the following items: None found

Items that depend on this one: None found

 

Carbon-Copy List
  • -unavailable- added by sup (Submitted the item)
  •  

    Do you think this task is very important?
    If so, you can click here to add your encouragement to it.
    This task has 0 encouragements so far.

    Only logged-in users can vote.

     

    Please enter the title of George Orwell's famous dystopian book (it's a date):

     

     

    No Changes Have Been Made to This Item
    Show feedback again

    Back to the top


    Powered by Savane 3.1-cleanup