Converting a Tumblr blog to a rstblog

Tue 21 June 2011

I have been writing a private (password protected) blog for family and friends on Tumblr for almost half a year and it suddenly freaked me out that Tumblr had all my carefully written blog posts. What if they lost it all, what if they went out of business, etc? Shortly before I freaked out I had discovered rstblog, a static blog generator that generates a blog from reStructuredText formatted text files. Could I somehow get all my posts out of Tumblr, convert them to reStructuredText and generate a rstblog? Turns out I could, though it did involve a fair bit of manual editing and writing a Python script.

Getting my posts out of Tumblr

Tumblr has a backup tool that was released as a beta on Dec 16th, 2009. It has not been updated since and never got bumped to release status. It works just fine though. However the fact that it has not been updated in 1.5 years seems seems to suggest that they do not consider it a high priority for users to be able to backup their data. That means you will either have to trust Tumblr or get your data out while you still can.

Tumblr's backup tool does not work with password protected blogs (nor does their search feature, but that's something entirely different). Hence I had to temporarily turn off the password protection, run the backup and re-enable the password protection. The backup tool writes all blog posts, images and other relevant data into the following directory structure:

$ ls -la
total 32
drwxr-xr-x    9 gkoller  staff   306 May 31 14:15 .
drwx------   15 gkoller  staff   510 Jun 17 22:16 ..
drwxr-xr-x    6 gkoller  staff   204 May 31 12:50 archive.noindex
-rw-r--r--    1 gkoller  staff  5616 May 31 12:50 avatar.png
drwxr-xr-x   17 gkoller  staff   578 May 31 12:50 images
-rw-r--r--    1 gkoller  staff  1138 May 31 12:50 index.html
drwxr-xr-x  216 gkoller  staff  7344 Jun  8 11:41 posts
-rw-r--r--    1 gkoller  staff   498 May 31 12:50 style.css
drwxr-xr-x    3 gkoller  staff   102 May 31 14:15 theme

The posts directory contains all the posts. The images directory contains all the images. Both directories are flat; they have no further hierarchy.

Converting HTML to reStructuredText

Each post in the posts directory is a simple HTML file. For rstblog to be able to generate a complete static blog these posts need to be converted to reStructuredText. Initially I thought I had to write a script to do this conversion for me, but as it turns out, there is a tool that already does this: Pandoc. Converting (most) HTML files to reStructuredText files was as simple as:

for f in `ls -1 *.html`
do
    echo processing ${f}
    pandoc -f html -t rst ${f} -o ${f%%html}rst
done

Unfortunately Pandoc is not perfect. Converting HTML tables to their reStructuredText equivalents was a no-go; Pandoc simple hang. For those posts with tables I took an intermediate route via Markdown (also supported by Pandoc); HTML -> Markdown -> reStructuredText. The remaining posts generally converted reasonably well. I say reasonably, as posts with images with alt text's generated incorrect reStructuredText equivalent markup where the :alt: marker and the corresponding text were on separate lines instead of on the same line. This required a fair bit of manual editing to fix.

rstblog Metadata and Directory Structure

rstblog requires more than just a bunch of reStructuredText files. First, each post needs to have some metadata in YAML at the top; this includes, among other things, tags. Furthermore rstblog requires the posts to be in a specific directory hierarchy where year, month and day each are a directory. Last, all the images needed to be copied to subdirectory of rstblog's static directory and the references to them from the reStructuredText files adjusted. Doing all of this manually for 100+ posts would have been a major pain. Hence I wrote a small Python 3 script to do exactly that.

It is run by executing:

python converter.py -tumblr_path ~/tmp/tumblr_backup -rstblog_path ~/tmp/rstblog

This assumes that ~/tmp/rstblog contains a rudimentary rstblog setup.

The source code for the script (mind the dependency on pytz):

import glob
import sys
import argparse
from xml.etree.ElementTree import fromstring
import locale
import os
import re
from datetime import datetime
import pytz
import shutil

parser = argparse.ArgumentParser(
    description='Convert Tumblr posts backup into rstblog posts.')
parser.add_argument('-tumblr_path', required=True,
                    help='The root directory of the Tumblr backup.')
parser.add_argument('-rstblog_path', required=True,
                    help='The root directory of the rstblog blog.')
args = parser.parse_args(sys.argv[1:])


def verify_path(path, path_argname, path_name, path_sig):
    format_params = {'path': path,
                     'path_argname': path_argname,
                     'path_name': path_name}
    if os.path.isdir(path):
        missing = path_sig - set(os.listdir(path))
        if missing:
            format_params['missing'] = ','.join(missing)
            return "{path_argname} does not look like to be a {path_name} directory. It's missing '{missing}'".format(
                **format_params)
    else:
        return '{path_argname} does not refer to a directory.'.format(
            **format_params)
    return None

# set of files or directories that are typical for the type of directories
# we're dealing with. This allows us to perform a few sanity checks before we
# start reading and writing files.
rstblog_sig = {'config.yml', 'static', '_templates'}
tumblr_sig = {'post', 'index.html', 'archive.noindex'}

verify_path(args.tumblr_path, '-tumblr_path', 'Tumblr Backup', tumblr_sig)
verify_path(args.rstblog_path, '-rstblog_path', 'rstblog', rstblog_sig)

# create image dir in rstblog_dir.

def get_local_timezone():
    locale.setlocale(locale.LC_ALL, "")
    # ('nl_NL', 'ISO8859-1') -> NL
    # works even with explicit char encoding present, eg nl_NL.UTF-8
    country = locale.getlocale()[0][3:5]
    timezone_names = pytz.country_timezones[country]
    if len(timezone_names) > 1:
        print(
            "Multiple timezones {} found for country '{}', picking the first one '{}'".format(
                timezone_names, country,
                timezone_names[0]))
    tz = pytz.timezone(timezone_names[0])
    return tz

local_tz = get_local_timezone()

tumblr_img_dir = os.path.join(args.tumblr_path, 'images')
tumblr_posts_glob = os.path.join(args.tumblr_path, 'posts/*.html')
for tumblr_post in sorted(glob.iglob(tumblr_posts_glob)):
    rst_post_src = os.path.splitext(tumblr_post)[0] + '.rst'
    with open(tumblr_post) as tp:
        contents = tp.read()
        tags = []
        # The Tumblr backup stores the tags in XML document contained in an
        # HTML comment. Before we can use an XML parser to extract the tags
        # we need to extract the comment from the HTML.
        m = re.search(r'<!--\s*BEGIN TUMBLR XML\s*(.*)\s*END TUMBLR XML\s*-->',
                      contents, re.DOTALL)
        if m and len(m.groups()) > 0:
            comment = m.group(1)
            post_elem = fromstring(comment)

            # strptime ignores timezones even if %Z is specified. This is
            # because it relies on time.strptime which does not handle
            # timezones. Hence we need to take care of it ourselves. We expect
            # the string dates to be in the following format:
            # 2011-05-13 07:32:00 GMT
            # BTW given the attribute name one could argue the date will
            # always be in GMT. However it is not that much work to actually
            # parse the date including timezone information. They violated the
            # SPOT rule, so there must be a reason for it (eg other timezones
            # possible?)
            date_str, tz_str = post_elem.get('date-gmt').rsplit(' ', 1)
            date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S')
            date = date.replace(tzinfo=pytz.timezone(tz_str))
            local_date = local_tz.normalize(date.astimezone(local_tz))

            # clean up the tags. Apparently it is possible for Tumblr tags
            # to contain ','. rstblog can't deal with them.
            tags = [tag.text.strip().replace(',', '') for tag in
                    post_elem.findall('tag') if tag.text.strip() != '']
            print(tags)

            date_components = ['{:02d}'.format(i) for i in
                (local_date.year, local_date.month, local_date.day)]
            rst_post_dir = os.path.join(args.rstblog_path, *date_components)
            os.makedirs(rst_post_dir, 0o744, True)

            day_order_glob = os.path.join(rst_post_dir, '*.rst')
            day_order = len(glob.glob(day_order_glob)) + 1
            shutil.copy(rst_post_src, rst_post_dir)
            rst_post_dst = os.path.join(rst_post_dir,
                                        os.path.basename(rst_post_src))

            p = re.compile(r'\.\./images/(\w+\.\w+)')
            with open(rst_post_dst, 'r+') as rpd:
                lines = rpd.readlines()
                header = []
                header.append('public: yes\n')
                header.append('tags: [{}]\n'.format(', '.join(tags)))
                header.append('day-order: {:d}\n'.format(day_order))
                header.append('\n')
                lines = header + lines[1:]

                for i, line in enumerate(lines):
                    m = p.search(line)
                    if m:
                        image_fn = m.group(1)
                        shutil.copy(os.path.join(tumblr_img_dir, image_fn),
                                    rst_post_dir)
                        lines[i] = p.sub(r'/static/images/\1', line)

                rpd.seek(0, 0)
                rpd.truncate()
                rpd.writelines(lines)

Comments