Convert spaces between PRE tags, via DOM parser

Regex was my original idea as a solution, although it soon became apparent a DOM parser would be more appropriate… I’d like to convert spaces to between PRE tags within a string of HTML text. For example:

adfa a adfadfaf>

dfa dfa


abc 123
abc 123

123 123

into (note the space in the span tag attribute is preserved):

adfa a adfadfaf>

dfa dfa


abc 123
abc 123

123 123

The result needs to be serialised back into string format, for use elsewhere.
…………………………………….

This is somewhat tricky when you want to insert Entities without DOM converting the ampersand to & entities because Entities are nodes and spaces are just character data. Here is how to do it:
$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query(‘//text()[ancestor::pre]‘) as $textNode)
{
$remaining = $textNode;
while (($nextSpace = strpos($remaining->wholeText, ‘ ‘)) !== FALSE) {
$remaining = $remaining->splitText($nextSpace);
$remaining->nodeValue = substr($remaining->nodeValue, 1);
$remaining->parentNode->insertBefore(
$dom->createEntityReference(‘nbsp’),
$remaining
);
}
}Fetching all the pre elements and working with their nodeValues doesnt work here because the nodeValue attribute would contain the combined DOMText values of all the children, e.g. it would include the nodeValue of the span childs. Setting the nodeValue on the pre element would delete those.
So instead of fetching the pre nodes, we fetch all the DOMText nodes that have a pre element parent somewhere up on their axis:
DOMElement pre
DOMText “abc 123″ <-- picking this
DOMElement span
DOMText “abc 123″ <-- and this one
DOMElement
DOMText “123 123″ <-- and this oneWe then go through each of those DOMText nodes and split them into separate DOMText nodes at each space. We remove the space and insert a nbsp Entity node before the split node, so in the end you get a tree like
DOMElement pre
DOMText “abc”
DOMEntity nbsp
DOMText “123″
DOMElement span
DOMText “abc”
DOMEntity nbsp
DOMText “123″
DOMElement
DOMText “123″
DOMEntity nbsp
DOMText “123″Because we only worked with the DOMText nodes, any DOMElements are left untouched and so it will preserve the span elements inside the pre element.
Caveat:
Your snippet is not valid because it doesnt have a root element. When using loadHTML, libxml will add any missing structure to the DOM, which means you will get your snippet including a DOCTYPE, html and body tag back.
If you want the original snippet back, you’d have to getElementsByTagName the body node and fetch all the children to get the innerHTML. Unfortunately, there is no innerHTML function or property in PHP’s DOM implementation, so we have to do that manually:
$innerHtml = ”;
foreach ($dom->getElementsByTagName(‘body’)->item(0)->childNodes as $child) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($child,true));
$innerHtml .= $tmp_doc->saveHTML();
}
echo $innerHtml;Also see
innerHTML in PHP’s DomDocument?
Noob question about DOMDocument in php
http://stackoverflow.com/search?q=user%3A208809+dom

…………………………………….

I see the short coming of my previous answer. Here is a workaround to preserve tags inside the

 tag:
$test = file_get_contents('input.html');
$dom = new DOMDocument('1.0');
$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
$pre = $xpath->query('//pre//text()');
// manipulate nodes of type XML_TEXT_NODE
foreach($pre as $e) {
    $e->nodeValue = str_replace(' ', '__REPLACEMELATER__', $e->nodeValue);
    // when you attempt to write   in a dom node
    // the & will be converted to &  
}
$temp = $dom->saveHTML();
$temp = str_replace('', '', $temp);
$temp = str_replace('', '', $temp);
$temp = str_replace('', '', $temp);
$temp = str_replace('', '', $temp);
$temp = str_replace('', '', $temp);
$temp = str_replace('__REPLACEMELATER__', ' ', $temp);
echo $temp;
?>Input
paragraph 1 remains untouched

preformatted 1

 
preformatted 2


 
preformatted 3 span text preformatted 3


 
preformatted 4 span bold test text preformatted 3

Output
paragraph 1 remains untouched

preformatted 1

 
preformatted 2


 
preformatted 3 span text preformatted 3


 
preformatted 4 span bold test text preformatted 3

Note #1
DOMDocument::saveHTML() method in PHP >= 5.3.6 allows you to specify the node to output. Otherwise you can use str_replace() or preg_replace() to elimitate doctype, html and body tags.
Note #2
This trick seems to work and results in one less line of code but I am not sure if it is guaranteed to work:
$e->nodeValue = utf8_encode(str_replace(' ', "\xA0", $e->nodeValue));
// dom library will attempt to convert 0xA0 to  
// nodeValue expects utf-8 encoded data but 0xA0 is not valid in this encoding
// hence replaced string must be utf-8 encoded

Convert spaces between PRE tags, via DOM parser

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112