Convert WordPress Into Markdown Files With YAML Front Matter (PHP Script)

Now we’re ready to begin our journey on extracting our WordPress content into a format needed for our static site generator, the big question becomes:

What format do we want to be able to extract our WordPress data into?

We have a plethora of choices available, but I like having YAML front-matter at the top of my posts and pages (to control HOW the content will be embedded), which is what many of the popular static site generators prefer too. However, I want the flexibility of being able to write my own custom header information which is what a YAML-like header on my posts and pages will provide.

For the content I’d like it to be Markdown based, even though when we get the data it will be a mix of HTML and Markdown (unless you’ve specifically written in <p> tags for every paragraph break in your content). I’d prefer to continue writing in Markdown and then be able to process my content into HTML for uploading to my static site.

Therefore, the way I have constructed my output is a little like this:

---
title: This is the title of the page
date: 2012-10-02 14:30
author: Ryan
tags:
 - hello
 - world
 - blog
excerpt: This is a wonderful post about title tags on pages
template: post
redirects:
- /90/this-is-the-title-of-the-page/index.html
---

What has helped to drive the design of the YAML header in my output has been the help of a node package YAML Front Matter to JSON which helps to transform my entire page into JSON.

The remainder of the page would then contain the content, like:

---
YAML header (as above)
---
This is the content of the blog post. 

When we extract it from the WordPress database we'll be noticing that there will be 
&lt;a href="/"&gt;anchor tags&lt;/a&gt; and image tags and div tags (etc) in our document, 
but there will be no paragraph tags. If you don't like Markdown, you could change my PHP 
script below so that it changes line breaks in &lt;br /&gt; or &lt;p&gt;&lt;/p&gt; tags.

So now we have an idea of what we’d like to output from all the content we’ve written in our WordPress install, let’s run a PHP script on our server’s terminal (yes this would therefore assume you have that type of access, you could still possibly run it if you only have FTP access, just be sure to edit the elements within the code and then run the file).

Anyway, here’s the script with detailed comments on how you can extract WordPress content and convert it to YAML front-matter with Markdown content:

<?php
// MAIN CONFIG AREA
$website = "http://www.webbyworks.com.au"; // enter your website here, see stripContent function for more details
$username = 'root'; // enter your database's username here
$password = 'password'; // enter your database username's password here
$hostname = 'localhost'; // enter the hostname of your database here
$dbname = 'dbname'; // enter the name of your database here
$output_extension = ".md"; // type of file to output
$out = "/tmp/"; // set where you want the output files to go
$author = "Your Name"; // insert your name as the author for each page and post
// END MAIN CONFIG AREA
/**
* This function removes the slashes, and any absolute references in your posts & pages
* It will also change any references to "/wp-content/uploads/..." to "/img/..."
* So be wary of this when you download and the re-upload to your static site
*
* @param string $content
* @param string $website
* @return string
*/
function stripContent( $content, $website ) {
$arr = array( $website, "/wp-content" );
$str = str_replace( "\r\n", "\n", stripslashes( $content ) );
$str = str_replace( $arr, "", $str );
return str_replace( "/uploads", "/img", $str );
}
/**
* This function translates the array of tags in your posts and pages and
* converts it into a string for YAML Front Matter friendly input.
*
* @param string $tags
* @return string
*/
function outputTags( $tags ) {
$arr = explode( ",", $tags );
$result = "";
foreach( $arr as $i ) {
$result .= " - " . trim( $i ) . "\n";
}
return $result;
}
/**
* This function checks whether you have an excerpt for your post and pages
* and if not creates one from the opening paragraph of your content.
*
* @param string $excerpt
* @param string $content
* @return string
*/
function getExcerpt( $excerpt, $content ) {
if ( strlen( trim( $excerpt ) ) > 0 ) return trim( $excerpt );
$result = strip_tags( substr( $content, 0, strpos( $content, "\n" ) ) );
$result = str_replace( ":", ".", $result ); // YAML plugin doesn't like colons anywhere in header text
return trim( $result );
}
// let's connect to the database
$dbhandle = mysql_connect( $hostname, $username, $password ) or die("Unable to connect to MySQL");
// let's select the database we need
$db = mysql_select_db( $dbname, $dbhandle ) or die("could not select" . $dbname );
// set the output to UTF-8, if you need another format enter that here, otherwise leave.
mysql_query("SET NAMES 'utf8'");
// show that everything's all good.
echo("Connected to db\n");
// grab the data from the database
$result = mysql_query( "SELECT p.ID as id, p.post_date as postdate, p.post_title as title, p.post_excerpt as excerpt, p.post_name as URI, p.post_type as posttype, p.post_content as content, group_concat( t.name separator ', ' ) as tags
from wp_posts p
left outer join wp_term_relationships r on (p.ID = r.object_id)
left outer join wp_terms t on (r.term_taxonomy_id = t.term_id)
group by id;" )
or die(mysql_error());
// loop through the array of rows we now have and output to the file accordingly
while( $row = mysql_fetch_array( $result ) ) {
// check that the row has a URI
if ( strlen($row['URI']) > 0 ) {
// check that it's a post or page
if ( $row['posttype'] == 'post' || $row['posttype'] == 'page' ) {
// remove absolute links and amend links to /wp-contents/upload/... to /img/...
$content = stripContent( $row['content'], $website );
// prepare the file name for output
$file = $out . $row['URI'] . $output_extension;
$handle = fopen( $file, "w" );
$file = "\xEF\xBB\xBF".$file; // this is what makes the magic for outputting UTF-8
// START YAML front matter
$output = "---" . "\n";
$output .= "title: " . $row['title'] . "\n";
$output .= "author: " $yourname . "\n";
$output .= "date: " . $row['postdate'] . "\n";
// check if the post or page has tags
if ( $row['tags'] ) $output .= "tags: \n" . outputTags( $row['tags'] );
$output .= "excerpt: " . getExcerpt( $row['excerpt'] , $row['content'] ) . "\n";
$output .= "template: " . $row['posttype'] . "\n";
// As post and pages can have different output we'll need to create different redirects
// for each. Generally 'posts' follow the permalink structure, whereas pages are just the URI
if ( $row['posttype'] == 'post' ) $output .= "redirects: \n" . " - /" . $row['id'] . "/" . $row['URI'] . "/index.html" . "\n";
if ( $row['posttype'] == 'page' ) $output .= "redirects: \n" . " - /" . $row['URI'] . "/index.html" . "\n";
$output .= "---" . "\n";
// END YAML header
// START CONTENT AREA
$output .= $content;
// END CONTENT AREA
// output the result to file and close
fwrite( $handle, $output );
fclose( $handle );
}
}
}
// Once we're done show it!
echo("finished!" . "\n");