Sun 10 Apr 2005 09:20:46 AM UTC, comment #47:
"Oops, sorry. In any case, the gettext routines manage the necessary conversions themselves, which was originally my point."
No problems (exceps my Opera crashes after reload of this pages :)). As I understand from original you suppose that in case of non specified codepage in current locale gettext will use current message catalog one.
WBR, Profic
|
Sat 09 Apr 2005 05:09:29 PM UTC, comment #46:
"Last statement is completely wrong. Then no codepage is specified in current locale default LOCALE'S codepage is used, which in case of ru_RU is ISO-8859-5."
Oops, sorry. In any case, the gettext routines manage the necessary conversions themselves, which was originally my point.
The problem you mention (output is in ISO-8859-5; webserver sends it as ISO-8859-15) is well known and we are working on it. Expect it to be solved within the next few weeks.
The problem with two different encodings on one page (ISO-8859-5 and KOI8-R) is known, too. It has nothing to do with the above assumption, but with the call to another gettext function, involving plural forms. It'll vanish as soon as the output from the webserver is UTF-8.
|
Fri 08 Apr 2005 07:47:35 AM UTC, comment #45:
I can't create new bug w/out registering so I'll post comment here
"The on-the-fly conversion is done by the gettext routines, which will check the encoding of the .mo file (e.g. KOI8-R) and the encoding of the output locale (e.g. ru_RU.UTF-8). In the example case, the msgstr will be converted from KOI8-R to UTF-8. If the output locale is ru_RU, the implied encoding is KOI8-R, so no conversion would be done."
Last statement is completely wrong. Then no codepage is specified in current locale default LOCALE'S codepage is used, which in case of ru_RU is ISO-8859-5.
Thus currently if I set Russian as prefered language in my browser (Opera/7.52.3834) all page is unreadable and seems to be in ISO-8859-5. In Opera's "info page" it shows that server tell is to use ISO-8859-15. However I can read content after forcing ISO-8859-5.
Also on some pages there is text in KOI8-R (seems related with bug #1986). For example "%d matching items" on search result page.
Server must send to browser real codepage which in case is ISO-8859-5 and not to mix different encodings in sigle page. As I suppose this occurs due to invalid assumption in quoted text.
WBR, Profic
|
Sat 26 Mar 2005 10:21:12 AM UTC, comment #44:
(1.0.6 is already released, see task #1088)
I think everybody made his point very clear.
I support the proposal of converting po files on the fly before mo files as described by Tobias. Continuing to argue about that would probably cost way more time than maintaining such conversion will ever do.
So we'll implement that; and translators that do not use utf8 will not notice anyway.
|
Sat 26 Mar 2005 10:18:00 AM UTC, comment #43:
OT: Found 1.0.6 from 26-Mar-2005 10:13. I'll try to play with it this weekend.
|
Sat 26 Mar 2005 10:10:00 AM UTC, comment #42:
MR: "... what cost us nothing ..." -- Not so sure. It will certainly cost the compatibility with non-UTF-8 people (including myself) or development effort to maintain this compatibility.
Look, there is no such thing as a pure gain. There is always trade offs: speed for storage, time for money, something for something else.
- UTF-8 version: presumably more quick, takes more storage (no pure gain...), requires either UTF-8 po-files encoding or on-the-fly conversion, requires UTF-8 <-> 8-bits conversion on 8-bits systems (from developer/translator point of view)
- 8-bits version: less storage, but slower (worth noting, though, that gettext uses the hash tables internally, so the lookups and the conversions are done only once), compatible with 8-bits system (from developer/translator point of view).
MR: "... the failure would mean having what you propose to have anyway ..." -- If only this were true... The failure would mean having what I propose minus the time, that could have been spent to improve (and/or maintain) something else.
P.S. Mathieu, could you give me a sign, when 1.0.6 is ready? I would like to try to install it on a clean machine (and to try some unusual configuration).
|
Sat 26 Mar 2005 07:50:14 AM UTC, comment #41:
"In other words, I do not think that this difference will ever make it to the
differences that make the difference; I do not think this is worth the effort,
and it looks just silly."
I do not care how it looks like. I take some pride in trying to do the best in whatever I do. It means doing a bunch of unsignificant little things. I tend to think that this bunch of unsignificant little things makes a difference in the long run.
If there's something that can be easily done and that makes the whole cleaner (like not making a extra conversion each time a string is printed on a screen), what looks silly to me is to avoid it, just because at some point we havent noticed the different on our brand new home workstation with a test installation that get 2 hits per year.
I'm not sure the time we should spend to be able to evaluate what part of the code consume CPU time on www servers hosting a Savane installation is worth anything. But I'm sure that if each time we can save extra computation we do it, it will not harm the overall perfs.
"If the system in general does not require this kind of optimisation, we should
not impose it (but this kind of optimisation should not be forbidden to the
administrator, if his system requires it)."
How many free software are just plainly unusable because they are not optimized at all? Probably because some developers never thought that many users does not buy a new computer each year (Nautilus first version by eazel was very interesting in this regard).
There's no point in being careless and not doing what cost us nothing, even if we do not know if it makes a difference for real. Because it's 100% sure it makes a difference in theory, and that's enough. Good design is not a matter of pragmatism only, it's also a matter of taking care of theory.
"Tobias Toedter: I wouldn't do this. It is just one more point of failure, and
one more piece to maintain."
In the worse case, the failure would mean having what you propose to have anyway. Not really an issue.
Whatever we can save, we'll save.
|
Fri 25 Mar 2005 06:51:36 PM UTC, comment #40:
Mathieu Roy: "I'm not sure that I understood the "against" point of view." -- It is not really "against". It's just the question: Why should we impose something onto the translator, when the system does not relly need that?
In other words, I do not think that this difference will ever make it to the differences that make the difference; I do not think this is worth the effort, and it looks just silly.
Besides, if someone wants this difference in performance (after have found the computer on which this difference will be showing -- which is difficult task by and in itself!), that he must be able to do it locally (and eevntually report the success story).
If the system in general does not require this kind of optimisation, we should not impose it (but this kind of optimisation should not be forbidden to the administrator, if his system requires it).
Tobias Toedter: I wouldn't do this. It is just one more point of failure, and one more piece to maintain.
|
Fri 25 Mar 2005 10:50:00 AM UTC, comment #39:
(Please avoid posting while logged on the test install, it breaks the links) :)
Ok, but make sure you use only core perl functions, to avoid adding new deps.
|
Fri 25 Mar 2005 10:14:43 AM UTC, comment #38:
"is perl really needed? Cannot iconv simply do it?"
I would use iconv for the actual conversion, yes. But I need to change the msgstr header for the .po file as well, because the encoding is specified in there. That would still remain KOI8-R, for example, even though the ru.po file would already be in UTF-8. Obviously, this leads to terrible results or even failures.
Thus the little header snippet has to be replaced as well, that's where perl comes in handy.
|
Fri 25 Mar 2005 10:09:15 AM UTC, comment #37:
Well, I'm not sure that I understood the "against" point of view. What would be the problem? Lack of utf8 compliant text editor?
Whatever. Converting files on the fly before .mo files are generated would be fine (is perl really needed? Cannot iconv simply do it?).
|
Fri 25 Mar 2005 10:00:19 AM UTC, comment #36:
How about this:
Leave the .po files in the encoding the translator prefers and convert them on the fly into UTF-8, before the .mo files are generated. This could be done in the target "make %.mo". Afterwards, those changes are discarded and the .po files stay cleanly in their original encoding.
This would require a little bit of black magic in perl, but it shouldn't be too hard. :-) I'm willing to set this up for release 1.0.7.
I guess this suits both parties, right?
|
Fri 25 Mar 2005 09:45:38 AM UTC, comment #35:
When there are thing we can do that cost us nothing and that could improve perf (even if only theoretically), we should do it.
I think were in this case.
Negligeable differences at some point make a difference. Especially on old computers or heavily used ones.
|
Thu 24 Mar 2005 05:51:37 PM UTC, comment #34:
You are right about the mo-files encoding, Tobias.
The question is: how much CPU cycles will be saved by encoding the .po-files in UTF-8? Is it worth it? It is hard to tell without profiling, but I /guess/ the difference is negligeable.
|
Thu 24 Mar 2005 10:36:18 AM UTC, comment #33:
"If I remember correctly, po-files do not have to be in UTF-8; any 8-bit encoding is converted automatically, when you "compile" po-file into mo-file (that is why there is the line "Content-Type: text/plain; charset=KOI8-R\n")."
Right, but the resulting .mo file will use the exact same encoding as the .po file -- just check it out yourself and have a look at the savane.mo file for Russian -- it still uses KOI8-R internally.
The on-the-fly conversion is done by the gettext routines, which will check the encoding of the .mo file (e.g. KOI8-R) and the encoding of the output locale (e.g. ru_RU.UTF-8). In the example case, the msgstr will be converted from KOI8-R to UTF-8. If the output locale is ru_RU, the implied encoding is KOI8-R, so no conversion would be done.
Summary: .po files don't have to be in UTF-8, the necessary conversion is automatically done. But if Mathieu wants to reduce the usage of CPU cycles for gettext, .po files should be in UTF-8.
|
Thu 24 Mar 2005 10:23:03 AM UTC, comment #32:
Tobias Toedter <toddy>: "it's certainly possible to require the .po files to use UTF-8".
-- If I remember correctly, po-files do not have to be in UTF-8; any 8-bit encoding is converted automatically, when you "compile" po-file into mo-file (that is why there is the line "Content-Type: text/plain; charset=KOI8-R\n").
|
Thu 24 Mar 2005 10:18:31 AM UTC, comment #31:
Fully agreed.
|
Thu 24 Mar 2005 10:12:13 AM UTC, comment #30:
"I don't know what you mean with this. If you refer to the changes I made WRT ngettext, I can assure you that there are no UTF-8 issues mixed in the current code. Are there any other problem areas? "
Ok. My mistake.
"But I'm afraid that there might be some small fraction of database content which is not pure iso-8859-1. This content will most probably not display correctly after the conversion."
Will have to come up with an update script that check content and convert appriately. We cannot propose an upgrade that could cause serious data loss.
Apart from that: there's no problem in publishing a new release just for gettext stuff.
But since it's serious change, this must be done on branch in any case, not on the trunk.
So let's say by the end of the week we publish the 1.0.6 if no serious problem shows up. And after the release, you create a branch and open a related task on the task manager, and the 1.0.7 will only depends on this task (or eventually important bug fixes).
|
Thu 24 Mar 2005 09:42:17 AM UTC, comment #29:
"DIG proposal is interesting. Unfortunately, since the whole utf8 stuff already happened on the trunk, it do not know what we should do exactly."
I don't know what you mean with this. If you refer to the changes I made WRT ngettext, I can assure you that there are no UTF-8 issues mixed in the current code. Are there any other problem areas?
"- Are we sure have only iso-8859-1 content?"
Well, not 100% sure. Most of the users will have used iso-8859-1 (or iso-8859-15) for the browsing. I guess there's only a very little fraction (if at all) which used another encoding. The reason simply is that savane did not work properly with other charsets than iso-8859-1(5). It didn't even display the affected languages like Russian, Japanese and Korean.
But I'm afraid that there might be some small fraction of database content which is not pure iso-8859-1. This content will most probably not display correctly after the conversion.
"- Are we sure only utf-8 will be put in the database afterwards?"
I think so, yes. The reason is that the webpages will be in true UTF-8 encoding, so the browser will send back any forms in UTF-8, too, if the program is not totally brain-dead. So the data savane gets from the user is already in UTF-8 encoding.
"But Savane will not convert stuff to utf8 each time it loads a page, will it? If so, isn't it resources (cpu) waste -- since we'll want utf8 anyway, converting everything once and for all would save resources?"
Well, yes and no ...
The calls to gettext() are using this conversion since the beginning of savane, it's happening in the gettext-library compiled into PHP. The conversion happens if the encoding of the output locale is different from the encoding of the corresponding .mo file.
This overhead is probably rather small WRT the cpu, but this is guessing and I might be wrong. As an example, currently this conversion has to be done for German: the .mo file is in UTF-8, but the webpages are in iso-8859-1.
To minimize this need for conversion, it's certainly possible to require the .po files to use UTF-8. This is a matter of internal policy of savane and of stepping on the translators' toes ;-)
"If this item can be closed by the end of the week, that should not delay the release. Would it be possible?"
Oops, sorry. Obviously, it was too late already yesterday. Somehow, I read "the end of the week" as "the end of April". If you're talking about two or three days left, I very much favor DIG's proposal. I don't have a good feeling of forcing this conversion in the next few hours, especially because it involves a huge amount of valuable data. After all, the database is what savane is all about, the frontend is just for accessing it ... (might be a little bit overemphasized).
So, in conclusion, I think DIG has had a very good idea. Mathieu, do you think that we can release 1.0.7 really shortly after 1.0.6? Like, say, four weeks? And this release should really just include the transition to UTF-8, nothing else.
|
Thu 24 Mar 2005 08:16:57 AM UTC, comment #28:
DIG proposal is interesting. Unfortunately, since the whole utf8 stuff already happened on the trunk, it do not know what we should do exactly.
"3. Convert the file from ISO-8859-1 to UTF-8, using icon"
- Are we sure have only iso-8859-1 content?
- Are we sure only utf-8 will be put in the database afterwards?
"Any necessary conversion is done on the fly."
But Savane will not convert stuff to utf8 each time it loads a page, will it? If so, isn't it resources (cpu) waste -- since we'll want utf8 anyway, converting everything once and for all would save resources?
"Would it be possible to let the release task for 1.0.6 depend on this bug as well?"
If this item can be closed by the end of the week, that should not delay the release. Would it be possible?
|
Thu 24 Mar 2005 02:13:52 AM UTC, comment #27:
Or release 1.0.7 could be done just after 1.0.6, as the "switch to UTF-8" release. Just to do not mix it with other features.
|
Wed 23 Mar 2005 11:15:46 PM UTC, comment #26:
I agree with you, Silvain. I think that enabling of UTF-8 should be possible without too much trouble. I further think that we should do this before we release 1.0.6, as it would be a great opportunity to get a smooth transition.
WRT the database, it is indeed not that trivial as I first thought. Unfortunately, the CONVERT() function is only available in MySQL 4.1 or greater and is therefore clearly ruled out. To do the conversion nonetheless, we have to do it "manually".
But here's the good news: I thought a lot about the way how to do it, and I think I discovered a really easy way which should have no side-effects:
1. Get a dump of the entire savane database
2. Store this dump in a temporary file
3. Convert the file from ISO-8859-1 to UTF-8, using iconv
4. Dump the converted file back into the MySQL database server
Some remarks about step 3: All the "metainformation" for MySQL, like table structures and INSERT commands etc. are plain ASCII, so those are not affected by the conversion. I cannot think of any error which could leave the database in an unusable state, but since this is obviously the most critical task, comments are very welcome.
OK, on we go: The site-specific files can easily be converted using iconv. No problem here.
About gettext/ngettext: The .po files can be in whatever encoding, it doesn't matter anymore. Any necessary conversion is done on the fly. Moreover, the adaptions needed for savane to put out real UTF-8 are trivial and really minor. I've tested them already, those fixes work smoothly.
I would like to point out that I'd really appreciate this bug to be solved before 1.0.6, and I don't think that the changes needed are too intrusive to be performed at this (relatively) late point in the process.
Mathieu, what do you think? Would it be possible to let the release task for 1.0.6 depend on this bug as well?
|
Wed 23 Mar 2005 12:03:35 AM UTC, comment #25:
I think we could activate UTF-8 in the webui and in the site-specific files.
About the database, I'm not sure we can easily CONVERT all the data, especially using old version of mysql (limited SQL structures).
Tobias, can you precise what needs to be kept in mind wrt .po/.mo files? I seem to understand that ngettext() needs .mo to be pre-encoded in UTF-8?
|
Fri 11 Mar 2005 10:04:28 PM UTC, comment #24:
I would just like to pinpoint that Savane is a separate activity from Gna!, hence not everybody has access to the Gna! test install.
Aside from that, regenerating the .po/.mo files should not be necessary, as gettext calls iconv at run time to translate the messages in the appropriate charset on the fly.
Looks like we all have to carefully study CONVERT() now ;)
|
Fri 11 Mar 2005 09:07:36 PM UTC, comment #23:
How do you proceed to do that?
|
Fri 11 Mar 2005 08:21:37 PM UTC, comment #22:
I confirm that GNA! test install now works well in Russian (before there were some problems with double encodings).
As a side note: I would prefer to keep .po-file (at least, ru.po) in 8-bit encoding (KOI8-R), and to convert it into UTF-8 encoding only when .mo-file is being generated.
|
Fri 11 Mar 2005 03:34:20 PM UTC, comment #21:
Hi,
while fixing a bug which prevented the Russian, Japanese and Korean translations to be shown, I played around a bit with the utf-8 issue.
It is not too hard to set up savane to use utf-8; however, there are some caveats.
I used the header() function to explicitely set the charset to utf-8 and added ".UTF-8" to all locales. Moreover, I converted the .po files to utf-8 and regenerated the .mo files.
You can see the result in the savane test install at gna.org/test. It is fully working for me.
Note that some texts are not managed through gettext, but are from the database. Those texts won't display correctly under utf-8, because they were stored in the DB using iso-8859-1 (or -15). As an example, see <https://gna.org/test/projects/nasgaia/>.
In conclusion:
- The switch to utf-8 in the php source code is rather trivial. A couple of lines in php/include/i18n.php have to be modified.
- The site-specific files can easily be converted, e.g. by using iconv.
All those steps are not really hard, but especially the last one involves a huge amount of data. I'd suggest to make a backup of the DB first ... ;-)
|
Sat 22 Jan 2005 10:03:01 AM UTC, comment #20:
"I never said there could be a misconfiguration at Gna!, sorry if I wasn't clear"
I never thought you said that. I was just saying that since the problem occurs on both installs, it is unlikely do to an uncomplete chroot install.
"The encoding used during form submission must be the encoding of the page, unless I'm really missing something. If your page is UTF-8, you get submitted text in UTF-8."
That's my experience too.
"header('Content-type: text/html; charset=utf-8')"
Ok, that's necessary in case a configuration cannot be site-wide. For http://www.gna.org, it is not really necessary but there's no reason not to add that in Savane.
"site-specific/*.txt files "
Cannot we just convert these to UTF8?
|
Sat 22 Jan 2005 01:07:56 AM UTC, comment #19:
The encoding used during form submission must be the encoding of the page, unless I'm really missing something. If your page is UTF-8, you get submitted text in UTF-8.
I also beleive there must not be any conversion on the way PHP <--> database, but here I'm not really sure. I guess some experiments should show it, if you can save non-trivial UTF-8 text in database and retrieve it uncorrupted, that's basically all you want.
|
Fri 21 Jan 2005 11:03:09 PM UTC, comment #18:
Ok, so apparently:
- one can redefine the encoding used by adding ".UTF-8" at the end of the locale name. The gettext bind_textdomain_codeset appeared in PHP 4.2 which is above debian stable, so I didn't test it. When you do that, you also have to set your owm content-type header via header('Content-type: text/html; charset=utf-8').
- include() doesn't seem to take care about encoding. I tried to get force PHP to use UTF-8 via iconv_set_encoding, but for reasons I could not figure out yet, this is messing up https. This means that if we switch to UTF-8, old site-specific/*.txt files will not be valid. Since the page was supposed to be under UTF-8, from the start I don't think this is a big issue.
We could consider trying to guess the files' encoding and do the appropriate conversation, as do Emacs and Jikes, but that would be resources consuming and kinda complicated, I guess.
To sum-up: 1) $locale.='UTF-8'; 2) header(...).
As for texts included from the database, somebody has to check:
- the encoding used during submission by html form
- the encoding used when adding to the database (it may be PHP's internal encoding)
|
Fri 21 Jan 2005 09:48:41 PM UTC, comment #17:
As far as Savannah is concerned, we needed to copy /usr/lib/gconv to the root (I never said there could be a misconfiguration at Gna!, sorry if I wasn't clear).
Regarding the Gna! configuration, I read the admin docs you once sent to Stéphane Urbanovski and I.
Configuring Apache or PHP's charset is a solution, but that is a site-wide solution that would break in the case of are several installed PHP applications; I would rather set the charset in Savane as much as possible. I'll play with the gettext and include functions for a while now.
|
Thu 20 Jan 2005 10:27:05 PM UTC, comment #16:
Unassigning the item because I'd be glad if someone else was taking it over.
|
Thu 20 Jan 2005 10:25:42 PM UTC, comment #15:
"That what I did. At Savannah, all locales were generated. Providded we copied them right in the apache root..
I'll try to make tests din a more normal install tomorrow."
I do not know if you are familiar with the gna server technical design. In gna case, we only have "normal installs" (though ultra minimalistic), our approach to secure our installation. So I do not think broken installation could explain it, because locales are cleanly installed on http://www.gna.org, I'm 100% positive about that.
About the others points you highlighted: these are indeed the questions we must address to resolve our problem. Unfortunately I cannot currently devote time to read documentation.
|
Thu 20 Jan 2005 10:14:09 PM UTC, comment #14:
"At Gna, setting the http header (via apache) default charset to utf-8 ended up in broken output (weird characters)."
This (latin1 interpreted as utf-8) is the opposite of my german
problem (utf-8 interpreted as latin1).
"A thing that could matter is the locales supported by the differents systems. http://www.gna.org support all locales shipped by debian
(both utf8 and iso for both de and fr locales)."
That what I did. At Savannah, all locales were generated. Provided we copied them right in the apache root..
I'll try to make tests in a more normal install tomorrow.
"Doesn't it respect the original files charset (po files and php
files?). If so, how should we recode these files to make sure they are all utf8? I guess there should be documentation about it (I'm completely clueless about that)."
gettext: Maybe we could use the bind_textdomain_codeset PHP
function. I could not find "good practice" for PHP+charsets, that's a pity :/
files: source files should have stuck to us-ascii; your french
templates (.txt) at Savannah, though, are latin-1. 'iconv' or 'recode' is ok indeed, unless PHP is automatically recoding files during includes.
There is also the problem of database data: project description,
tracker comments... I don't know how that is managed.
|
Thu 20 Jan 2005 09:57:33 PM UTC, comment #13:
I was thinking about using "recode", I'll consider iconv. Thanks for the tip.
|
Thu 20 Jan 2005 09:53:59 PM UTC, comment #12:
Mathieu: just try `iconv --help' in your command line. `iconv' should be enough for converting between charsets.
|
Thu 20 Jan 2005 09:53:26 PM UTC, comment #11:
"It seems that the server at home.gna.org sends UTF-8 in HTTP headers, so my UTF-8 XHTML pages there are displayed just fine."
For some reason one of us (maybe me, who knows!) set "AddDefaultCharset off" on home.gna.org.
But at download.gna.org remains the default apache setting. That said, while we definitely want to allow webpages to set the charset, it seems less important ont the download server.
I think you are right, with HTML, the charset of the page was taken into account systematically, while with XHTML it seems that the server overrides it, if the value is conflicting between the http header and the actual webpage.
But for http://www.gna.org, the problem is quite different. We can set the output to be utf8 (or set "AddDefaultCharset off" that would give the same result), but currently it does not renders well.
|
Thu 20 Jan 2005 09:43:25 PM UTC, comment #10:
This has forced me to convert my download page (http://download.gna.org/quarry/) to ISO-8859-1 (was in UTF-8.) Before it was HTML 4.01 Transitional, and the charset was picked up correctly by the browsers, but now it is XHTML 1.0 Strict and it seems that the browsers now prefer to use the charset from the HTTP headers.
It seems that the server at home.gna.org sends UTF-8 in HTTP headers, so my UTF-8 XHTML pages there are displayed just fine.
So, I guess gna.org server should sort of adopt configuration of home.gna.org one.
|
Thu 20 Jan 2005 09:43:17 PM UTC, comment #9:
"what's this doing in the spam"
Please read <https://gna.org/forum/forum.php?forum_id=524>
"- PHP is not sending any charset information by default; it can be configured to set a site-wide charset that will be added in the Content-type HTTP header. "
It is quite simple to configure apache to use utf8, by setting in httpd.conf:
AddDefaultCharset utf-8
"Especially, my test install and Savannah have a strange behavior: when using the French locale, output charset is latin1, and when using the German locale, output charset is utf-8. However, at Gna!, output charset is latin1 for both locales. All those are running Debian stable. "
At Gna, setting the http header (via apache) default charset to utf-8 ended up in broken output (weird characters).
"Especially, my test install and Savannah have a strange behavior: when using the French locale, output charset is latin1, and when using the German locale, output charset is utf-8. However, at Gna!, output charset is latin1 for both locales. All those are running Debian stable. "
A thing that could matter is the locales supported by the differents systems. http://www.gna.org support all locales shipped by debian (both utf8 and iso for both de and fr locales).
"So the first step is to understand what charset is actually used by PHP (or by gettext?) for its output and how to set it. The second step is to choose how to describe it."
Doesn't it respect the original files charset (po files and php files?). If so, how should we recode these files to make sure they are all utf8? I guess there should be documentation about it (I'm completely clueless about that).
|
Thu 20 Jan 2005 09:09:25 PM UTC, comment #8:
Since I have some issues at Savannah (https://mail.gna.org/public/spam/2005-01/msg00428.html - hmm, what's this doing in the spam???), here are some information:
- PHP is not sending any charset information by default; it can be configured to set a site-wide charset that will be added in the Content-type HTTP header.
- If no charset information is set by PHP, Apache adds Latin1 in the Content-type header by default (the charset info is part of the Content-type header); that can be changed too. It should not be removed for twisted security reasons, unless the document is served as application/xml. Check the link referenced in the Apache default configuration file (search for 'charset').
- Savane's template include an XML declaration that specify that the charset used is UTF-8; this declaration has a lower precedence than the HTTP header. Also, this declaration is normally not taken into account if the XML (here XHTML) document is served as HTML (text/html - which is the default). It is taken into account if the document is served as application/xml.
At this point, solutions include making PHP (or Savane's template) set the content type: either it sets the content type as application/xml (and the encoding becomes what is specified in the <?xml?> processing instruction); either it sets the charset itself to, say, UTF-8.
Now, it is not that simple, because we do not know what encoding PHP will use, independently of what the declaration we used.
Especially, my test install and Savannah have a strange behavior: when using the French locale, output charset is latin1, and when using the German locale, output charset is utf-8. However, at Gna!, output charset is latin1 for both locales. All those are running Debian stable.
Note that localized texts are supposed to be automatically converted by gettext to the charset specified by the user in his LANG/LC_ALL/... environment variable (at least it works for console applications).
So the first step is to understand what charset is actually used by PHP (or by gettext?) for its output and how to set it. The second step is to choose how to describe it.
|
Thu 20 Jan 2005 06:06:50 PM UTC, comment #7:
Anyone have experience about that?
|
Thu 20 Jan 2005 05:52:26 PM UTC, comment #6:
In fact, it's not so simple. Savane pretend the content is utf8 while it is not.
Normally, apache override this erroneous information. But if apache is changed in this regard, we end up with broken output.
|
Thu 20 Jan 2005 05:40:29 PM UTC, comment #5:
1. OK
2. Isn't charset configurable with Apache httpd? Basically, all you need is to make HTTP headers specify UTF-8, there is no need for Apache to understand what UTF-8 is.
|
Thu 20 Jan 2005 05:12:08 PM UTC, comment #4:
1. Deciding whether a bug should be closed or not is internal management of the project. Some bugs trackers allow submitter to close item. But Savane have a solid user base against such approach and no real user base in strong favor of it.
(Savane have a specific role model, being able to close item is a key point of it).
2. I see no exact rendering issue. It is true that in fact the utf8 charset will be ignored, due to the http daemon. That's more a Gna issue than a Savane one. Speaking for gna, it is unclear what we should do about it considering we run debian stable.
|
Thu 20 Jan 2005 04:19:54 PM UTC, comment #3:
I was too fast to judge, see
http://validator.w3.org/check?uri=https%3A%2F%2Fgna.org%2Fprojects%2Fquarry%2F
UTF-8 em-dash causes problems with the charset. I assume the validator is not happy because of the mentioned charset conflict?
|
Thu 20 Jan 2005 04:05:39 PM UTC, comment #2:
Sorry, I didn't check it and relied partly on my previous memory. Apparently, Gna! pages are now in UTF-8, so converting entities into characters is not problem.
This bug should be closed as invalid.
BTW, why am not able to close it? I'm the one who opened it, other bug-tracking systems allow bug originator to close it.
|
Thu 20 Jan 2005 02:54:08 PM UTC, comment #1:
But what actual problem does it poses?
Functions that does convert html entities would prevent you to actually be able to insert HTML in your description. But we do not want that.
|
Thu 20 Jan 2005 02:44:17 PM UTC, original submission:
In my project's description I have an em-dash as HTML entity. When I edit this description, the dash is displayed as a character in the edit field, not as an entity. This is because Savane does not convert "& mdash;" sequence to "& amp; mdash;" when generating editing page source (extra spaces to prevent potential problems ony.)
I remember there was a PHP function for this, but don't remember the name of it.
There may be other places where this problem shows up.
|