(0002852)
|
ferg
|
03-14-08 13:10
|
|
=============================
Bug 1
=============================
This reluctant quantifier may or may not be a bug, but I added an additional check into the Regcomp.java class to do an additional check. If the current character is '*' and the next one is '?', then the code will instantiate a CharUngreedyLoop even if "isGreedy()" returns false. To fix this, I added the following switch statement lines into Regcomp.java line 181:
case '*':
if (tail == null)
throw error(L.l("'*' requires a preceeding regexp"));
//special case for reluctant quantifier
switch(pattern.peek()) {
case '?':
pattern.read();
tail = tail.createLoopUngreedy(this, 0, Integer.MAX_VALUE);
break;
default:
tail = createLoop(pattern, tail, 0, Integer.MAX_VALUE);
}
return parseRec(pattern, tail.getTail());
=========================
Bug 2 ------- BIG SUBTLE BUG
=========================
This one was ultra difficult to catch and I believe it may either be a JDK issue or a Linux issue. When resin is first started, quercus runs and parses code without any problems. The problem starts appearing after 4 times of refresh (removing the phpbb3 cached files and then refresh the same page in the browser). This was consistent under my hardware setup, which is running CentOS 5 with JDK 1.5.0_14. With remote debugging, I could never catch the problem in the act.
With recompilation and bunch of System.out statements, I was able to find out that somehow, Integer.MAX_VALUE comparison is causing weird errors. In RegexNode.java, there are lines such as these:
for (; i <= max; i++) { <================= CULPRIT
tail = next.match(string, offset + i, state);
if (tail >= 0)
return tail;
if (node.match(string, offset + i, state) < 0) {
return -1;
}
}
These are located in various classes in the source file. The weirdest thing is that after a few times of running the same code, the conditional statement "i <= max" is somehow returning false. This happens when i = 0 and max = Integer.MAX_VALUE. Techncially, Integer.MAX_VALUE should be considered a -1. There is probably some sort of subtle bug here either on the hardware, the OS, or the JDK. It could even be a JIT compilation issue.
I fixed this problem by doing a conditional check in each individual constructor:
if (_max == Integer.MAX_VALUE)
_max = Integer.MAX_VALUE - 1;
This actually worked. Things became stable again. This is obviously a hack since I couldn't figure out what the real problem may be. The other possibility is the fact that somehow JIT compiler compiled this particular type of for statement (a for statement that does not have the first part initialization code) into something that was being run improperly.
I didn't get a chance to test this by either disable JIT or by changing the for statement into a while statement.
============================================
Given that bug 0000001 and the bug you mentioned is not being parsed properly, I was able to do a workaround for the regexp. I changed this original phpbb regexp:
#<!-- ([^<].*?) (?:.*?)? ?-->#
to
#<!-- ([^<].*) (.*)? ?-->#U
In this way, the U stands for Non-greedy. This effectively makes the '.*?' unnecessary. I began running this with the MAX_VALUE hack and phpbb3 now renders all the smarty templates properly. Some other regular expressions that I tried:
#<!-- ([^<].*?) (.*)? ?-->#U == works
#<!-- ([^<].*?) (?:.*)? ?-->#U == fails to match properly with ?: added
#<!-- ([^<].*) (?:.*)? ?-->#U == fails to match properly as above statement
#<!-- ([^<].*?) (.*?)? ?--># == works seemingly with no problems (using my patch from above)
So it looks like "(?:.*?)?" is a very special case that requires some particular mmm... parsing and evaluation.
=========================================
For reference, I have attached to this email the two source files that I modified. For those who are interested in making this work, you are more than welcome to download Quercus 3.1.5 source code and replace the two files with these. |
|
(0002857)
|
ckchris
|
03-19-08 21:29
|
|
Copying my reply to the mailing list and attaching it to here.
I found the bug for 0000002. It's not as subtle as I thought it'd be.
I was checking why this particular bug with the MAX_VALUE comparison only occurred in CharUngreedyLoop and not in other RegexpNode classes. So I compared the for statements and this is what I found. In other node classes, the for loops look like this:
CharLoop.match():
for (; i < max; i++) {
if (node.match(string, offset + i, state) < 0) {
break;
}
}
But for CharUngreedyLoop.match():
for (; i <= max; i++) {
tail = next.match(string, offset + i, state);
if (tail >= 0)
return tail;
if (node.match(string, offset + i, state) < 0) {
return -1;
}
}
All comparisons with MAX_VALUE is a "<" except for CharUngreedyLoop, which uses "<=". I believe this is the culprit. I changed it to "<" like all the other for statements and removed the hacks (assigning MAX_VALUE -1 to max if the value is MAX_VALUE). phpbb continues to run fairly stable with no hiccups at the moment.
========================================
It also appears that my patch for the reluctant quantifier (X*?) appears to work properly. Given the above fix for the MAX_VALUE, the regex is functioning properly:
#<!-- ([^<].*?) (.*?)? ?--># = works
#<!-- [^<].*? (?:.*?)? ?--># = works
PHPBB3 now runs fine without any modifications to the phpbb3 code templates. Of course, I ran into a problem getting the CAPTCHA images to render consistently without problems, but I believe that's fixed in the latest SVN codebase based on what I saw in the svn log.
=========================================
On a separate note, I tried to install vbulletin and test it. However, I am not able to get it to work properly. It looks like vbulletin uses more advanced regular expressions that includes conditionals and lookaheads. Quercus' regex module currently doesn't support those.
I believe that having a complete regex module in quercus will solve many compatibility problems with php software. Also, I'd like to suggest that the regex module is tested with a complete set of fixture data to make sure that it is compliant. Of course, I can probably guarantee that even fixture data won't be able to catch that MAX_VALUE comparison bug from above. :) |
|