Mantis - Quercus
Viewing Issue Advanced Details
1562 major always 01-17-07 03:40 06-25-07 12:45
obaltz  
sam  
normal  
closed 3.1.0  
fixed  
none    
none 3.1.2  
0001562: Problem with back references to subpatterns in preg_match_all
When using a back reference within the pattern, the behaviour of preg_match_all differs from the original php implementation. The PEAR template engine (class HTML_Template_IT) doesn't work due to this bug. See Additional info for a demo script. The pattern used in the script is the same as used in the PEAR class.

The demo script contains the same pattern twice, firstly as a single-quoted, secondly as a double-quoted string. The original php implementation treats those differently, Quercus does not. Quercus always behaves as if it were double-quoted.
Demo script:
<?php
$pattern = '@<!--\s+BEGIN\s+([0-9A-Za-z_-]+)\s+-->(.*)<!--\s+END\s+\1\s+-->@sm'; // this will work with original php interpreter ONLY
// $pattern = "@<!--\s+BEGIN\s+([0-9A-Za-z_-]+)\s+-->(.*)<!--\s+END\s+\1\s+-->@sm"; // this will never work
$string = "pre block <!-- BEGIN testblock --> inside block <!-- END testblock --> post block";
$regs = array();
$result = preg_match_all( $pattern, $string, $regs, PREG_SET_ORDER );
var_dump( $result );
var_dump( $regs );
?>

The original php interpreter outputs:

int(1)
array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(60) "<!-- BEGIN testblock --> inside block <!-- END testblock -->"
    [1]=>
    string(9) "testblock"
    [2]=>
    string(14) " inside block "
  }
}

Quercus outputs:
int(0)
array(0) {
}
has duplicate 0001561closed nam Problem with back references to subpatterns in preg_match_all 
has duplicate 0001560closed nam Problem with back references to subpatterns in preg_match_all 

Notes
(0001723)
obaltz   
01-17-07 03:46   
I'm sorry, the file upload didn't work but the rest of the bug was saved. Forget about 1560 and 1561.
(0001789)
obaltz   
03-27-07 07:07   
Today I found out that the back reference actually works. A different problem causes zero results on quercus in the example above. In fact, it's the whitespace \s+ right AFTER the back reference!

Try this pattern instead:
$pattern = '@<!--\s+BEGIN\s+([0-9A-Za-z_-]+)\s+-->(.*)<!--\s+END\s+\1 \s*-->@sm';

The output will be:
int(1)
array(1) {
  [0]=>
  array(3) {
    [0]=>
    string(60) "<!-- BEGIN testblock --> inside block <!-- END testblock -->"
    [1]=>
    string(9) "testblock"
    [2]=>
    string(14) " inside block "
  }
}

However, the original php engine works with \1\s+ just like it should.
(0001834)
obaltz   
04-11-07 07:59   
Here are some simpler examples focusing more on the actual problem:

<?php
$pattern = '/F(O)\1\s+BAR/';
$result = preg_match( $pattern, "FOO BAR" );
var_dump( $result );
?>

original php output: int(1)
quercus output: int(0)

Those two patterns work fine:
$pattern = '/F(O)\1 \s*BAR/'; // back reference not followed by \s+
$pattern = '/FOO\s+BAR/'; // no back reference before \s+

Actually it does not matter whether \s+ or \. or whatever comes after the back reference - if it just starts with a backslash, it won't work:

<?php
$pattern = '/F(O)\1\.BAR/';
$result = preg_match( $pattern, "FOO.BAR" );
var_dump( $result ); // outputs int(0)
?>

However, the first expression must be a back reference to reproduce that error, just two "backslashed" expressions in a row won't make it:

<?php
$pattern = '/FOO\.\.BAR/';
$result = preg_match( $pattern, "FOO..BAR" );
var_dump( $result ); // outputs int(1)
?>
(0001836)
nam   
04-11-07 13:46   
Thanks for the additional information. It appears to be a very involved issue and we are still deciding how and when to tackle it.
(0002085)
sam   
06-25-07 12:45   
php/1530