Mantis Bugtracker
  

Viewing Issue Advanced Details Jump to Notes ] View Simple ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0001961 [Quercus] minor always 08-22-07 03:29 09-04-07 12:10
Reporter bago View Status public  
Assigned To nam
Priority normal Resolution fixed Platform
Status closed   OS
Projection none   OS Version
ETA none Fixed in Version 3.1.3 Product Version
  Product Build 3.1.1
Summary 0001961: non US-ASCII chars inside comments results in a failure (BIS)
Description Sorry for the duplicate submission, but you closed my previous report without leaving me the time to provide you an answer to your comment.

You wrote:
---------------------
Quercus by default reads scripts in UTF-8. If a character is not valid UTF-8, then it reports an error. To change the default encoding, set the following in your resin-web.xml:

<web-app xmlns="http://caucho.com/ns/resin"> [^] [^]
  <servlet-mapping url-pattern="*.php"
                   servlet-class="com.caucho.quercus.servlet.QuercusServlet">
    <init>
      <script-encoding>ISO-8859-15</script-encoding>
    </init>
  </servlet-mapping>
</web-app>

For 3.1.3, we will allow the option to set unicode.semantics to off. Quercus will assume the default charset is ISO-8859-1 in all cases.
-------------------

Adding the script-encoding was the first thing I did when I got the first errors in drupal.

In the same drupal I have:
1) One file unicode.inc that does not have any unicode header, but contains php strings with unicode sequences.
2) At least one file (e.g: liquid.module) that contains iso-8859-15 encoded chars in *comments*

The official php interpreter have no problem with such a scenario.
Instead if I use quercus without the script-encoding I get an error loading liquid.module, if instead I use quercus with the script-encoding I get a wrong string from the unicode.inc file.

If you want to ignore such a difference between official PHP and Quercus, then I'm fine, but I think this deserve documentation as at least people running drupal and using additional modules will find similar problems.

I have many similar problems related to unicode, and I'm trying to understand how exactly quercus works differently from PHP (e.g: when I don't use script-encoding I get a lot of errors when posting non US-ASCII content in forms that save content to mysql).
Steps To Reproduce
Additional Information
Attached Files

- Relationships

- Notes
(0002216)
ferg
08-22-07 09:14

Which encoding do you intend in your *.php file? iso-8859-1? iso-8859-15?
 
(0002217)
bago
08-22-07 09:22

It is not important I tried both and this does not work.

The fact is that most files in drupal have no special encoding.
Some core file contains UTF-8 sequences inside php strings (see unicode.inc)
Some module file contains ISO-8859-1 chars in php *comments*.

I guess official php simply read them all as UTF-8 but is able to ignore the "wrong" ISO-8859-1 char in the comment, or otherwise that it automatically recognize the encoding while reading the content, I don't know.
 
(0002219)
ferg
08-22-07 10:11

"It is not important I tried both and this does not work."

That comment makes no sense at all.

When you write a file, it is in a particular encoding. You can't "try both" unless you're rewriting the source file. Either the file is in one encoding (e.g. utf-8) or it is in another encoding (e.g. iso-8859-15).

If you're saying that parts of the .php file are in utf-8, but other parts are in iso-8859-15, then the .php file is fundamentally broken. Zend's PHP might allow that (and we might be forced to duplicate that hack), but it's really not doing developers any favor.
 
(0002220)
bago
08-22-07 11:39

I guess your comment is not correct, btw, I will try to be more strict:

ISO-8859-15 is very similar to ISO-8859-1 so if you don't use some very specific char (like the Euro sign) there is no way to know if a file does use one or the other encoding. There is no header in the text files to tell you what is the encoding.

The file has no headers. Is a sequence of mostly US-ASCII bytes and some other 8 but bytes. Every 8bit bytes has a representation in the ISO-8859-1 table.

The unicode.inc file has no header, too. But in this case it is a sequence of mostly US-ASCII bytes and 2 UTF-8 chars (2 bytes each one) that are placed inside a php string (between double quotes).

If you want to take a look on the real files then just download drupal 5.2 (unicode.inc) and http://ftp.drupal.org/files/projects/liquid-5.x-1.x-dev.tar.gz [^] (liquid.module)
 
(0002222)
bago
08-22-07 16:07

Furthermore: I'm speaking of 2 different files. One does contain ISO-8859-1 chars in a comment. The other contains UTF-8 bytes in a php string. That's why changing the environment variable does not help: if I fix one of them I break the other.

As I said previously I don't know why php correctly work: maybe he parse everything as UTF-8 and it is able to ignore the bad 8bit sequence inside a php comment for the second file, or maybe it is able to autorecognize utf8 from iso-8859-1 files.
 
(0002260)
nam
09-04-07 12:10

php/0015-php/001a
 

- Issue History
Date Modified Username Field Change
08-22-07 03:29 bago New Issue
08-22-07 09:14 ferg Note Added: 0002216
08-22-07 09:22 bago Note Added: 0002217
08-22-07 10:11 ferg Note Added: 0002219
08-22-07 11:39 bago Note Added: 0002220
08-22-07 16:07 bago Note Added: 0002222
09-04-07 12:10 nam Status new => assigned
09-04-07 12:10 nam Assigned To  => nam
09-04-07 12:10 nam Status assigned => closed
09-04-07 12:10 nam Note Added: 0002260
09-04-07 12:10 nam Resolution open => fixed
09-04-07 12:10 nam Fixed in Version  => 3.1.3


Mantis 1.0.0rc3[^]
Copyright © 2000 - 2005 Mantis Group
40 total queries executed.
32 unique queries executed.
Powered by Mantis Bugtracker