Mantis - Quercus
Viewing Issue Advanced Details
1961 minor always 08-22-07 03:29 09-04-07 12:10
bago  
nam  
normal  
closed  
3.1.1 fixed  
none    
none 3.1.3  
0001961: non US-ASCII chars inside comments results in a failure (BIS)
Sorry for the duplicate submission, but you closed my previous report without leaving me the time to provide you an answer to your comment.

You wrote:
---------------------
Quercus by default reads scripts in UTF-8. If a character is not valid UTF-8, then it reports an error. To change the default encoding, set the following in your resin-web.xml:

<web-app xmlns="http://caucho.com/ns/resin"> [^] [^]
  <servlet-mapping url-pattern="*.php"
                   servlet-class="com.caucho.quercus.servlet.QuercusServlet">
    <init>
      <script-encoding>ISO-8859-15</script-encoding>
    </init>
  </servlet-mapping>
</web-app>

For 3.1.3, we will allow the option to set unicode.semantics to off. Quercus will assume the default charset is ISO-8859-1 in all cases.
-------------------

Adding the script-encoding was the first thing I did when I got the first errors in drupal.

In the same drupal I have:
1) One file unicode.inc that does not have any unicode header, but contains php strings with unicode sequences.
2) At least one file (e.g: liquid.module) that contains iso-8859-15 encoded chars in *comments*

The official php interpreter have no problem with such a scenario.
Instead if I use quercus without the script-encoding I get an error loading liquid.module, if instead I use quercus with the script-encoding I get a wrong string from the unicode.inc file.

If you want to ignore such a difference between official PHP and Quercus, then I'm fine, but I think this deserve documentation as at least people running drupal and using additional modules will find similar problems.

I have many similar problems related to unicode, and I'm trying to understand how exactly quercus works differently from PHP (e.g: when I don't use script-encoding I get a lot of errors when posting non US-ASCII content in forms that save content to mysql).

Notes
(0002216)
ferg   
08-22-07 09:14   
Which encoding do you intend in your *.php file? iso-8859-1? iso-8859-15?
(0002217)
bago   
08-22-07 09:22   
It is not important I tried both and this does not work.

The fact is that most files in drupal have no special encoding.
Some core file contains UTF-8 sequences inside php strings (see unicode.inc)
Some module file contains ISO-8859-1 chars in php *comments*.

I guess official php simply read them all as UTF-8 but is able to ignore the "wrong" ISO-8859-1 char in the comment, or otherwise that it automatically recognize the encoding while reading the content, I don't know.
(0002219)
ferg   
08-22-07 10:11   
"It is not important I tried both and this does not work."

That comment makes no sense at all.

When you write a file, it is in a particular encoding. You can't "try both" unless you're rewriting the source file. Either the file is in one encoding (e.g. utf-8) or it is in another encoding (e.g. iso-8859-15).

If you're saying that parts of the .php file are in utf-8, but other parts are in iso-8859-15, then the .php file is fundamentally broken. Zend's PHP might allow that (and we might be forced to duplicate that hack), but it's really not doing developers any favor.
(0002220)
bago   
08-22-07 11:39   
I guess your comment is not correct, btw, I will try to be more strict:

ISO-8859-15 is very similar to ISO-8859-1 so if you don't use some very specific char (like the Euro sign) there is no way to know if a file does use one or the other encoding. There is no header in the text files to tell you what is the encoding.

The file has no headers. Is a sequence of mostly US-ASCII bytes and some other 8 but bytes. Every 8bit bytes has a representation in the ISO-8859-1 table.

The unicode.inc file has no header, too. But in this case it is a sequence of mostly US-ASCII bytes and 2 UTF-8 chars (2 bytes each one) that are placed inside a php string (between double quotes).

If you want to take a look on the real files then just download drupal 5.2 (unicode.inc) and http://ftp.drupal.org/files/projects/liquid-5.x-1.x-dev.tar.gz [^] (liquid.module)
(0002222)
bago   
08-22-07 16:07   
Furthermore: I'm speaking of 2 different files. One does contain ISO-8859-1 chars in a comment. The other contains UTF-8 bytes in a php string. That's why changing the environment variable does not help: if I fix one of them I break the other.

As I said previously I don't know why php correctly work: maybe he parse everything as UTF-8 and it is able to ignore the bad 8bit sequence inside a php comment for the second file, or maybe it is able to autorecognize utf8 from iso-8859-1 files.
(0002260)
nam   
09-04-07 12:10   
php/0015-php/001a