The X-Server double-escapes some XML character entity references, which, left unresolved, will affect the display of certain letters and symbols in your results. This article describes a simple post-processing fix to solve this problem.
Character references in XML
In XML, the ampersand is a reserved character. Together with a semicolon, it is used to delimit character entity references: hexadecimal or numeric codes used to represent some accented characters, diacritics, and other symbols. A character like é (e-accute) will sometimes be represented as é for example.
It is illegal in XML to have a “bare” ampersand, such as:
<title>Jack & Jill's Adventures</title>
In this case, we need to escape the ampersand by converting it to the special ampersand character reference:
<title>Jack & Jill's Adventures</title>
The problem
In order to prevent bare ampersands from getting included in the output, the X-Server converts them to the special ampersand reference. That’s a good thing. But some of the databases that Metalib searches also include these XML character references. The X-Server respects and preserves some of the basic entities, but escapes the leading ampersand of hexidecimal entity references. A reference such as é will be converted to &#xe9; for example.
This is what we call double-escaping. This proves problematic since an XML parser won’t recognize these double-escaped character references. Rather, it sees them for what they have become: an ampersand (&) followed by some extra letters and numbers (#xe9;). What your users will see in their browser, then, is “San José”.
To complicate matters, some databases (and especially those that are screen-scrapped) have HTML character references in them, such as é for é. Although perfectly valid in HTML, they are illegal in XML without a supporting DTD definition. Oddly, it might be a good idea for the X-Server to double-escaped these references.
Confused? Don’t worry.
The solution
Using XSLT we can actually come to a very simple and convenient solution to this problem. All we need to do is convert the X-Server response (double-escaped characters and all) to HTML. Once we have the HTML as a string, we can do a quick find-and-replace to convert all ampersand references (&) back to the regular ampersand (&).
The Xerxes PHP code looks like:
// get xml response from x-server $xml = $metalib->retrieve( $result_set, $start, $max); // transform to html $html= $page->transform($xml,"xsl/results.xsl"); // undo double-escaping $html= str_replace("&", "&", $html);