Mantis Bugtracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0001961 [Quercus] minor always 08-22-07 03:29 09-04-07 12:10
Reporter bago View Status public  
Assigned To nam
Priority normal Resolution fixed  
Status closed   Product Version
Summary 0001961: non US-ASCII chars inside comments results in a failure (BIS)
Description Sorry for the duplicate submission, but you closed my previous report without leaving me the time to provide you an answer to your comment.

You wrote:
Quercus by default reads scripts in UTF-8. If a character is not valid UTF-8, then it reports an error. To change the default encoding, set the following in your resin-web.xml:

<web-app xmlns=""> [^] [^]
  <servlet-mapping url-pattern="*.php"

For 3.1.3, we will allow the option to set unicode.semantics to off. Quercus will assume the default charset is ISO-8859-1 in all cases.

Adding the script-encoding was the first thing I did when I got the first errors in drupal.

In the same drupal I have:
1) One file that does not have any unicode header, but contains php strings with unicode sequences.
2) At least one file (e.g: liquid.module) that contains iso-8859-15 encoded chars in *comments*

The official php interpreter have no problem with such a scenario.
Instead if I use quercus without the script-encoding I get an error loading liquid.module, if instead I use quercus with the script-encoding I get a wrong string from the file.

If you want to ignore such a difference between official PHP and Quercus, then I'm fine, but I think this deserve documentation as at least people running drupal and using additional modules will find similar problems.

I have many similar problems related to unicode, and I'm trying to understand how exactly quercus works differently from PHP (e.g: when I don't use script-encoding I get a lot of errors when posting non US-ASCII content in forms that save content to mysql).
Additional Information
Attached Files

- Relationships

- Notes
08-22-07 09:14

Which encoding do you intend in your *.php file? iso-8859-1? iso-8859-15?
08-22-07 09:22

It is not important I tried both and this does not work.

The fact is that most files in drupal have no special encoding.
Some core file contains UTF-8 sequences inside php strings (see
Some module file contains ISO-8859-1 chars in php *comments*.

I guess official php simply read them all as UTF-8 but is able to ignore the "wrong" ISO-8859-1 char in the comment, or otherwise that it automatically recognize the encoding while reading the content, I don't know.
08-22-07 10:11

"It is not important I tried both and this does not work."

That comment makes no sense at all.

When you write a file, it is in a particular encoding. You can't "try both" unless you're rewriting the source file. Either the file is in one encoding (e.g. utf-8) or it is in another encoding (e.g. iso-8859-15).

If you're saying that parts of the .php file are in utf-8, but other parts are in iso-8859-15, then the .php file is fundamentally broken. Zend's PHP might allow that (and we might be forced to duplicate that hack), but it's really not doing developers any favor.
08-22-07 11:39

I guess your comment is not correct, btw, I will try to be more strict:

ISO-8859-15 is very similar to ISO-8859-1 so if you don't use some very specific char (like the Euro sign) there is no way to know if a file does use one or the other encoding. There is no header in the text files to tell you what is the encoding.

The file has no headers. Is a sequence of mostly US-ASCII bytes and some other 8 but bytes. Every 8bit bytes has a representation in the ISO-8859-1 table.

The file has no header, too. But in this case it is a sequence of mostly US-ASCII bytes and 2 UTF-8 chars (2 bytes each one) that are placed inside a php string (between double quotes).

If you want to take a look on the real files then just download drupal 5.2 ( and [^] (liquid.module)
08-22-07 16:07

Furthermore: I'm speaking of 2 different files. One does contain ISO-8859-1 chars in a comment. The other contains UTF-8 bytes in a php string. That's why changing the environment variable does not help: if I fix one of them I break the other.

As I said previously I don't know why php correctly work: maybe he parse everything as UTF-8 and it is able to ignore the bad 8bit sequence inside a php comment for the second file, or maybe it is able to autorecognize utf8 from iso-8859-1 files.
09-04-07 12:10


- Issue History
Date Modified Username Field Change
08-22-07 03:29 bago New Issue
08-22-07 09:14 ferg Note Added: 0002216
08-22-07 09:22 bago Note Added: 0002217
08-22-07 10:11 ferg Note Added: 0002219
08-22-07 11:39 bago Note Added: 0002220
08-22-07 16:07 bago Note Added: 0002222
09-04-07 12:10 nam Status new => assigned
09-04-07 12:10 nam Assigned To  => nam
09-04-07 12:10 nam Status assigned => closed
09-04-07 12:10 nam Note Added: 0002260
09-04-07 12:10 nam Resolution open => fixed
09-04-07 12:10 nam Fixed in Version  => 3.1.3

Mantis 1.0.0rc3[^]
Copyright © 2000 - 2005 Mantis Group
40 total queries executed.
32 unique queries executed.
Powered by Mantis Bugtracker