XHTML to HTML WordPress plugin

 

Important

At last there is a new version of this plugin. This version is no longer supported. Please download the new version.

 

This tiny plugin filters WordPress’s output to produce HTML instead of XHTML. It is simple to use and will ensure that your WordPress Blog stands at least a fighting chance of being properly standards compliant.

Background

The WordPress platform is based with the best intentions, from boots to brow, on XHTML and has been ever since I’ve been using it. Now this is a shame really because that single fact may be preventing your Blog from being properly standards compliant.

Put your hands up if you are running a blog that serves documents using a MIME type of “application/XHTML+XML”? What’s that deathly silence I hear – what, nobody is doing that? Then in that case nobody (regardless of what doctype you are using) is serving proper XHTML and worse, no one is running a standards compliant website. Perhaps even worse still, you probably shouldn’t even try to use an XHTML MIME type on your website either. Now I bet you didn’t know that.

The vast, vast majority of people whose blogs are served as MIME type “text/html” should be using the HTML 4.01 doctype rather than XHTML. The issues surrounding this problem are considered unfortunately rather technical so I will endeavour to write a lay person’s guide to the subject soon, covering the issues a bit more simply than they’re are in most places I know of.

In the meantime there is a very good article by WebDevout about this which is worth reading.

What The Plugin Does

XHTML to HTML is a simple output filter that translates XHTML documents into valid HTML 4.01.

Do I need to write in HTML now?

No, you don’t have to change a thing. All your carefully coded XHTML will translate into pristine HTML 4.01 seamlessly. Remember, WordPress is an XHTML platform – all headers, plugins, themes and filters assume you will be using XHTML as the output Doctype. Whilst this is, technically, wrong (unless you use the right MIME type) you can continue to write in XHTML if you want to. However, you could alternatively switch to writing in HTML 4.01 instead. You do not have to write sloppy hard to read code, keep your tags lowercase if you like, close your tags a’ la XHTML, use a DOCTYPE etc. because that’s all valid HTML too!

Installation
  1. Download the plugin using the link above and extract the ZIP archive onto your computer somewhere.
  2. Copy the folder “XHTML-to-HTML” to your /wp-content/plugins/ folder
  3. Activate the plugin in WordPress
  4. That’s it!
Development

The plugin really does very little, which is not a bad thing for an output filter. It has no bugs I’m aware of, but suggestions for improvements are always welcome.

1.
Microsoft browsers do not support XML (XHTML is a type of XML). IE7 has limited support but lower versions have none whatsoever. So, unless you’d like to banish that audience from your blog you can’t even consider trying to use XHTML properly – that is, by using the correct MIME type.

36 Comments  (sorry, comments are closed now)

  1. Edward says:

    This plugin seemed to be a quick solution of “XHTML vs HTML” issue, but I had a week of nightmares getting to work WordPress built-in WYSIWYG plugin (as well as TinyMCE Advanced) instead. None of the solutions from WordPress support blog helped. The only cause of the problem was “XHTML to HTML” plugin. I was able to get WYSIWYG to works as soon as this plugin was disabled.
    You can find an issue an several solutions to WYSIWYG issue here.
    http://wordpress.org/support/topic/164990

    BTW, this plugin also changes the look of “Ozh’ Admin Drop Down Menu” plugin. It is not as serious as previous issue, but it can be a problem to somebody with small screen resolution. The issue is that white spaces appear between menu entires, and this extremely enlarge menus.

    If you ask me about how to fix those (and many other issues that I didn’t experienced, but somebody probably did) issues, I would suggest to make this plugin work only on the frond end of WordPress and disable it on the backend. So that admin area will be left untouched, and this should minimize plugin incompatibilities.

    Thank you in advance. Waiting for your answer.

  2. John Kilroy says:

    Hi Edward,
    I’m sorry you’ve been having trouble with the plugin. This is the first problem that’s been reported, so I’d like to get to the bottom of it.

    Please let me know what version of WordPress you’re using. Also, please can you email me the source code of your dashboard page (or whichever page the plugin is messing up) so I can see what’s going on.

    I’ll get on to this as soon as possible and hopefully get it fixed for you.

    By the way, you make an interesting point about enabling the plugin solely for ‘frontend’ pages. I’ll look at that too.
    John.

  3. John Kilroy says:

    Dear Milbits,
    I’m sorry but my language skills don’t extend to Spanish (I assume you’re spanish) – I’ll try to get your message translated. Anyway, thanks for the pingback.

  4. Edward says:

    Unfortunately, I don’t know your email to send you the source code of my dashboard, but you know mine. Could you send me an email so I’ll reply with attachments?

    Thank you for your swift reply.

  5. Jonathan says:

    John,

    Nice plugin. I’ve converted it to work with Textpattern, and I’ve added a couple lines to get rid of the xml:lang=”foo” bits and some erronious spaces at the end of tags in the header. Here’s what I ended up with on lines 13 and on:

    $xhtml[0] = '/XHTML 1.0 Transitional|XHTML 1.0 Strict|XHTML 1.0 Frameset|XHTML 1.1|XHTML Basic 1.0|XHTML Basic 1.1/';
    $xhtml[1] = '/xhtml1\/DTD\/xhtml1-transitional.dtd|xhtml1\/DTD\/xhtml1-strict.dtd|xhtml11\/DTD\/xhtml11.dtd|xhtml-basic\/xhtml-basic10.dtd|xhtml-basic\/xhtml-basic11.dtd/';
    $xhtml[2] = '/\/>/';
    $xhtml[3] = '/\/\s+>/';
    $xhtml[4] = '/\s+\/>/';
    $xhtml[5] = '/\s+xmlns="http:\/\/www.w3.org\/1999\/xhtml"/';
    $xhtml[6] = '/\s+xml:lang="(.*)"\s+lang="(.*)"/';
    $xhtml[7] = '/\s+>/';

    $html[0] = 'HTML 4.01';
    $html[1] = 'html4/strict.dtd';
    $html[2] = '>';
    $html[3] = '>';
    $html[4] = '>';
    $html[5] = '';
    $html[6] = '';
    $html[7] = '>';

  6. Jonathan says:

    Haha! The plugin works for comments, too, I guess.

  7. John Kilroy says:

    Good work Jonathan.

    I’ve yet to test your changes but, out of interest, I’m not sure what your $xhtml[7] pattern changes that isn’t already changed by $xhtml[3]. $xhtml[7] just strips out spaces before the closing angle bracket, which isn’t erroneous in either XHTML or HTML. Unless I’ve missed something there.

    Glad you found the plugin useful.
    John.

  8. John Kilroy says:

    Just thought I’d write a quick update regarding Edward’s query.

    Turns out the Ozh plugin wasn’t well-formed and was thus being mangled by XHTML-to-HTML (XTH). However, there is a bug in XTH and it’s interactions with certain TinyMCE functions. This affects the Admin interface only, not the frontend.

    In short, XTH needs to ignore [CDATA], in particular, Javascript regex, which obviously use the forward slash extensively. Thus unfortunately at the moment XTH clobbers JS regex.

    I’ve simply (and with apologies) not had time to sit down and address this but plan to do so soon as I can. In the meantime it’s still safe to use XTH with your blog’s frontend, though it might cause bugs with the dashboard if you use the Visual editor (I don’t use it which is why I never noticed the backend bug).

    If anyone gets time to address this issue before I do, please post your solution here.

  9. Jonathan says:

    I noticed that there were some tags that still had a space before the bracket. Adding that line removed them. It seems that [3] was cleaning the slash, but not the space.

    Then again, [4] should do that for these particular tags. I added [4] after [7]. Some tags are formatted as , others as , and still others as . So [2], [3], and [4] may take care of it without the need of [7].

  10. Jonathan says:

    blah. That should have been , , and .

  11. Jonathan says:

    Let’s try it with HTML entities:

    <foo/>, <foo;/ >, and <foo />

  12. John Kilroy says:

    Hi Jonathan,
    I’m not really a PHP programmer, my main language is Perl. But as far as I know PHP regex are modelled quite closely on Perl regex, and in Perl, $xhtml[3] would clear any forward slash followed by one or more spaces followed by an angle bracket.

    If a tag is closed a la HTML, without a slash, it won’t remove trailing space. This is by design, as trailing space is a matter of coding style/preference and is perfectly valid in HTML.

    The objective of XTH is to create valid HTML, not stylistically and subjectively pleasing HTML. Seeing as the effects of [4] and [7] are simply stylistic (they don’t affect validity) I’m ambivalent about including them though of course people are free to hack the plugin around to suit their preferred coding styles.

    If you’re sure [4] and [7] address validity rather than style, then I would be interested to see an example of an XHTML closing tag that is fixed by $xhtml[7] but not by $xhtml[3], as this might show an important difference between Perl and PHP regex engines.

    Cheers again,
    John.

  13. Jonathan says:

    Hey John,

    As you pointed out, [7] is simply for removing space at the end. I didn’t know it wasn’t errouious. I thought HTML 4.01 Strict was a bit less lenient in this regard. Such as using lower-case tags. But again, that’s a matter of which DOCTYPE is chosen. So you’re right in saying something like that should be added by the end user.

    [4] isn’t stylistic. I’ve seen tags closed three ways: <foo/>, <foo />, and <foo/ >. [4] just takes care of the last one. But three lines could possibly be reduced to one with something like:

    $xhtml[2] = '/(\s+)?(\/)?(\s+)?>/';

    If that’s even possible in Perl/PHP regex style. That’s more like .htaccess style! I don’t know how it would be formed in this case.

  14. John Kilroy says:

    Hi Jonathan,

    I’m trying politely to tell you that you are being seduced by a red herring. As I said, the bit of code you added [4] and [7], whilst you are free to add it, actually creates no improvement in validity over the original.

    Please read the HTML and XHTML specifications and try running code throught the W3C validators and you’ll see that this is true. I understand you want your code to look a certain way – that’s fine, you can do that, however please understand that it IS a stylistic and not a functional improvement.

    For reference you should know that the ONLY valid XHTML self-closing tag is ‘/>’. Space around those two characters is utterly irrelevant and space between them is utterly invalid. XTH already fixes ALL valid XHTML self-closing tags.

    BTW, all versions of html are case insensitive. Html Strict and Transitional are mainly different in their support for deprecated tags. If you want to learn about the differences there is some good info on the W3C website.

    happy reading,
    John.

  15. calliope says:

    this is brilliant and very informative plug in, I am facing an rss issue though, which occurs when I have your plug in activated, the feeds are not working for me no matter what.

    Would you have any suggestion about this?

  16. John Kilroy says:

    Hi Calliope,
    Unfortunately the plugin is still young and unruly and it interferes with things like Javascript CDATA which it shouldn’t. Also bearing in mind that your RSS feeds are XML I’m not utterly surprised you’re having a problem :)

    Can you give me more information about what the set up is (other plugins, how are you doing the rss) and any error messages you get to help me target the problem better.

    It’s all talk at the moment, but we’re about to start work on a new version of XTH which will only filter wordpress frontend pages and posts (so no backend or rss filtering). We’ll clean up the parser too. Hope to have this out before Easter.

    John.

  17. Nate says:

    Thank you for this! Saved me a lot of frustration when a quick Google search led me here.

  18. Chris says:

    Hi,
    sorry i’am not that skilled in this topic- so maybe you can enlighten me..

    after despairing on my rss2 feed i found out your plugin causes some problems with rss2 validation.
    2 examples:
    - a proper enclosure tag should look like:
    -> with your plugin it look it looks like:

    - including the atom link should look like this:

    -> this one doesn’t work:

    I guess.. because rss is xml.. it insists on being validate for xml and not for html? – either we need to remove whitespace (i didn’t test it) or to stop the plugin for feed filtering

    so.. again – i have not really an idea what i talking about – any help would be appreciated

    reagards
    Chris

  19. Chris says:

    ok.. no code – anyway – this one: text /> ends up this way: text >

    • John Kilroy says:

      Hi Chris,

      RE your 2nd message.
      That behaviour is what XTH is supposed to do: remove xhtml end tags, leaving you with a valid html end tag.

      There is a problem with XTH and certain sorts of RSS feed and also with some javascript. I haven’t had the time to make it fully compatible with either yet – I was supposed to do this at easter but am so busy I haven’t had time.

      I’m afraid as it stands you might have to disable XTH if you decide that your RSS feed is more important. I will release a new version as soon as I can.

      best wishes,
      John.

  20. Chris says:

    Hi John – yes i know that’s what it is supposed to do.. but it seems this is not so good for the rss sites – i extended your plugin (with my limited knowledge.. so probably there is an more elegant way..) in checking for the current url – excluding all rss url’s..

    check url: http://www.bradino.com/php/get-current-url/

  21. Paul Novitski says:

    Hi John, nice plugin. The reason I like your idea so much is that, while I’m seriously bending toward moving “back” to HTML, I’m addicted to the W3C validator’s precision in checking my markup which I’ll lose if I code in native HTML4, even /Strict. I love having my cake and eating it too.

    The criticism that your logic tramples CDATA is significant. You could lessen the likelihood that your plugin would interfere with non-HTML elements by being more specific about those empty element closures: Instead of replacing ‘/\/>/’ with ” you could replace ‘@(@’ with ‘$1>’.

    Or specify all of the empty XHTML elements:

    ‘@(@’

    You could also improve on .* by excluding ; and other characters that might occur in JavaScript and more to the point aren’t permitted in valid XHTML.

    To be rigorous, though, you should really smarten your code enough to exclude inline script, CDATA, and HTML comments. In PHP there’s a handy fourth argument ‘flags’ to preg_split() that lets you capture the delimiters that split a string into an array. Using this you can splinter a page of HTML with several patterns at once such as:

    ‘((<!–)|(<\/)|(<))’ to split on <!–, </, and <

    Then as you walk the output array it’s easy to see which delimiting strings precede each chunk of content, and perform operations on only the ones you want. Applying this technique to <script, <style, and XHTML’s funky CDATA escapes would be easy.

    Finally, a minor coding note: using slashes as your regexp delimiters means having to escape the slashes that occur within your patterns. Since those slashes are integral to the task at hand, why not simply use another delimiter (I chose @ above) to make your code easier for us hyoomins to read?

    Regards,
    Paul

  22. JK says:

    Hey Paul,

    >> I’m seriously bending toward moving “back” to HTML

    I like to see it more as a preparatory move forwards to the next markup standard, HTML 5 :)

    Thanks for your comments on the XTH plugin. We’ve been aware for many months that the it dorks on cdata and some RSS feeds. It was never meant as more than a ready-roll 5-minute monkey wrench to get the job done for now. Trouble is ‘for now’ has lasted eight months and there’s no obvious sign when there’ll be enough time to come back to it. Though it’ll happen at some point.

    If we’d intended to be ‘rigorous’ there’s a number of things we’d have done, including choosing a wholely different way of accomplishing this task :) But as far as your difficulty reading the delimiters is concerned: I’m sorry if you find it hard to read, my background is in Perl rather than PHP and I’ve noticed generally that Perl programmers seem to have fewer probs reading escapes than do people who use PHP.

    Basically, feel free to change the code as you like! If you get your ideas working please send us a copy.

    best wishes,
    John.

  23. Leanne says:

    To get the plugin to work without affecting admin or rss feeds, here’s what I did (I did incorporate Jonathan’s changes to remove the trailing space…I know it’s aesthetic only, but it satisfies my perfectionist streak)

    <?php
    /*
    Plugin Name: XHTML to HTML
    Plugin URI: http://www.kilroyjames.co.uk/2008/09/xhtml-to-html-wordpress-plugin
    Version: 1.0
    Description: This plugin filters XHTML output to create a document which is valid HTML 4.01 Strict.
    Author: John Kilroy
    Author URI: http://www.kilroyjames.co.uk
    */

    function HTMLify($buffer) {

    if(!((is_admin() || is_feed()) ){

    $xhtml[0] = ‘/HTML 4.01|HTML 4.01|HTML 4.01|HTML 4.01|HTML 4.01|HTML 4.01\/’;
    $xhtml[1] = ‘/xhtml1\/DTD\/xhtml1-transitional.dtd|xhtml1\/DTD\/xhtml1-strict.dtd|xhtml11\/DTD\/xhtml11.dtd|xhtml-basic\/xhtml-basic10.dtd|xhtml-basic\/xhtml-basic11.dtd/’;
    $xhtml[2] = ‘/\/>/’;
    $xhtml[3] = ‘/\/\s+>/’;
    $xhtml[4] = ‘/\s+\/>/’;
    $xhtml[5] = ‘/\s+xmlns=”http:\/\/www.w3.org\/1999\/xhtml”/’;
    $xhtml[6] = ‘/\s+xml:lang=”(.*)”\s+lang=”(.*)”/’;
    $xhtml[7] = ‘/\s+>/’;

    $html[0] = ‘HTML 4.01?’;
    $html[1] = ‘html4/strict.dtd’;
    $html[2] = ‘>’;
    $html[3] = ‘>’;
    $html[4] = ‘>’;
    $html[5] = ”;
    $html[6] = ”;
    $html[7] = ‘>’;

    return preg_replace($xhtml, $html, $buffer);
    }
    ob_start(“HTMLify”);

    }

    ?>

  24. Leanne says:

    P.S. Watch out for those pesky quotes/double quotes. I neglected to encode them before posting.

    P.P.S. I love this plugin — thank you for promoting forward movement in html standards. Do you accept donations?

  25. Tim Snadden says:

    Hey – I’m really grateful for the plugin, but it’s a bit disingenuous to say you are going to be writing about my google search terms at the top of the page.

    • John Kilroy says:

      I didn’t mention Google anywhere on this page !?! not sure what you mean…

  26. Tim Snadden says:

    I got around the RSS issue by searching for the xml declaration before filtering (line 25).

    if(strpos($buffer, ‘?’) == 1) return $buffer;

    • John Kilroy says:

      Tim I recommend you use Leanne’s solution right above your comment – she simply tests for a feed before applying the filter.

      • Ben says:

        I get a parse error when using Leanne’s version of the code, it says there is an unexpected “{” on line 13… was I supposed to append her code to yours or replace it? I simply replaced it and changed the ‘s and “s to the proper format. What am I missing?

        • Teo says:

          Ben, I was receiving the same parse error when trying Leanne’s version of the code, even after replacing the quotation marks to their proper format.

          I believe I found the solution (someone correct me if I’m wrong).

          There’s an extra ‘(‘ in the original if statement on line 13.

          if(!((is_admin() || is_feed()) ){

          should be changed to:

          if(!(is_admin() || is_feed()) ){

          After doing that, the plugin was able to be activated.

          I am running WP 2.8.2, and on a local host at that, just in case that changes things.

          -T

          • John Kilroy says:

            I don’t believe that is the problem. I removed the pointless/invalid brackets when Ben first told me about his problems with Leanne’s code, and though fixing the brackets means the plugin will activate it still doesn’t work properly.

            I haven’t had time to focus on this but I will when I can. If you want to look at it, I would guess that the real problem is that the plugin, regardless of the admin and feed checking, still parses javascript and other CDATA that it oughtn’t to be touching. And it may still interfere with RSS loaded via Ajax.

            XTH has got to deal with that problem – and focus solely on inline XHTML code – before it can really work generally and reliably.

  27. Jeroen - XprsYrslf says:

    In the aboce post, I’ve added two lines to change target=”" into a onclick event.

  28. Street.Walker says:

    thank u, nice plugin..
    till now i am changing my DTD doctype manually from WP themes header file.
    i hope it’ll help me :)

  29. Best WP Plugins says:

    You have some great plugins on your post. Your insight and expertise would be a welcome addition to our new community, i hope you will consider joining, and thanks for sharing!

  30. nuevayores blogs says:

    Thanks for this plug

Trackbacks/Pingbacks

  1. ¿Habla HTML? « Weblog Tools Collection - [...] looking around the web I also came upon this website which has a plugin for WordPress that does the …
  2. Virgulă To Sedilă | Best Plugins - wordpress – widgets – plugin 2012 - [...] Inspirat de plugin-ul lui John Kilroy, XHTML to HTML – http://www.kilroyjames.co.uk/2008/07/xhtml-to-html-wordpress-plugin/ [...]
  3. HTML to XHTML | Anand Verma - [...] Most people don’t realise that to use XHTML properly it must be served using the new MIME TYPE “application/xhtml+xml”. …