Quantcast
Channel: David Walker
Viewing all articles
Browse latest Browse all 8

Dealing with double-escaping in the X-Server

$
0
0

The X-Server double-escapes some XML character entity references, which, left unresolved, will affect the display of certain letters and symbols in your results. This article describes a simple post-processing fix to solve this problem.

Character references in XML

In XML, the ampersand is a reserved character. Together with a semicolon, it is used to delimit character entity references: hexadecimal or numeric codes used to represent some accented characters, diacritics, and other symbols. A character like é (e-accute) will sometimes be represented as é for example.

It is illegal in XML to have a “bare” ampersand, such as:

<title>Jack & Jill's Adventures</title>

In this case, we need to escape the ampersand by converting it to the special ampersand character reference:

<title>Jack &amp; Jill's Adventures</title>

The problem

In order to prevent bare ampersands from getting included in the output, the X-Server converts them to the special ampersand reference. That’s a good thing. But some of the databases that Metalib searches also include these XML character references. The X-Server respects and preserves some of the basic entities, but escapes the leading ampersand of hexidecimal entity references. A reference such as &#xe9; will be converted to &amp;#xe9; for example.

This is what we call double-escaping. This proves problematic since an XML parser won’t recognize these double-escaped character references. Rather, it sees them for what they have become: an ampersand (&amp;) followed by some extra letters and numbers (#xe9;). What your users will see in their browser, then, is “San Jos&#xe9;”.

To complicate matters, some databases (and especially those that are screen-scrapped) have HTML character references in them, such as &eacute; for é. Although perfectly valid in HTML, they are illegal in XML without a supporting DTD definition. Oddly, it might be a good idea for the X-Server to double-escaped these references.

Confused? Don’t worry.

The solution

Using XSLT we can actually come to a very simple and convenient solution to this problem. All we need to do is convert the X-Server response (double-escaped characters and all) to HTML. Once we have the HTML as a string, we can do a quick find-and-replace to convert all ampersand references (&amp;) back to the regular ampersand (&).

The Xerxes PHP code looks like:

// get xml response from x-server 
$xml = $metalib->retrieve( $result_set, $start, $max);

// transform to html 
$html= $page->transform($xml,"xsl/results.xsl");

// undo double-escaping 
$html= str_replace("&amp;", "&", $html);
That will restore our hexidecimal references back to what they should be. It will also leave some bare ampersands and the HTML character references in the output, but since we’re now in HTML instead of XML, it doesn’t matter. We can hand all those references to the browser, and it displays them just as you would expect.

Viewing all articles
Browse latest Browse all 8

Latest Images

Trending Articles





Latest Images