A python solution to a secure backup of CouchDB via replication

This guest post was written by Reto. He works at Bluevalor, Nelmio’s very first client ever. Reto is a crack at analyzing economic data using MATLAB and Python. But he can not help himself and sometimes enjoys working on our infrastructure as well.

When we started our project with Nelmio, Pierre proposed to use CouchDB as a container for highly dimensional data items we receive from our third parties. We have been happy with CouchDB so far. One of the nice features of couch is its reliance on sequence IDs to assure very easy synchronisation between different CouchDB instances. It is even possible to use these sequence IDs to set up synchronisation between say SQL and CouchDB, since there is a nice API to query for changes in the CouchDB server.

A very convenient way to set up a backup of the data is to just configure a second CouchDB on another machine and replicate the data onto that machine. There is a feature called “continuous replication”. This seems to imply that you would have to set up the replication only once… However there is quite a big drawback as of CouchDB 1.2.: If the server is restarted, the replications will not be re-initiated. Even worse, sometimes replications just break down without any apparent reason.

Update: If you set up the replication via the _replicator database it fixes the restart issue.

In short: CouchDB’s “continuous replication” is not reliable enough as a backup system.

I’ve written a small Python script that you can run as a cronjob to check if a replication exists for a list of CouchDBs. As a little bonus, I added email notification in case something is wrong, so you can sleep well knowing your CouchDB backup is still working. With this script, it should be viable to backup your CouchDB databases via replication. I’ve attached the code after my little fanboy praise of Python.

I’ve studied finance and basically taught myself programming for scientific purposes. I’m trying really hard to write good code, but sometimes, I lack experience because I do not have a true programming background. If you’d like to point out things i could do better in terms of form, structure or function, please comment!

I’ve worked extensively with MATLAB so far. However, recently I stumbled over Python as a language for scientific computing and I’m absolutely loving it, so I would like to take the opportunity to praise on Python a little:

There are various reasons to use Python for scientific computing:

  1. high level language (good productivity, easy to learn for people like me)
  2. general purpose and object oriented (can interface with everything, bigger projects possible)
  3. beautiful, easy to read syntax
  4. ability to interface with low level languages if speed is first priority
  5. very rich libraries that support scientific computing needs
  6. open source (the MATLAB commercial license is 10-20k CHF depending on toolboxes)

With the open source packages numpy, scipy, ipython and pandas, Python pretty much trumps over every other scientific toolbox (R, MATLAB, Mathematica) while remaining super easy to use.

Especially pandas (an open source library that was developed at a hedge fund – true story!) improves data handling of time series ten-fold. I truly believe that if you need to do research with time series data, Python with pandas is the future.

So if you ever run into a problem where you need to do a lot of data cleaning and wrangling, look at pandas. There is a very good book called Python for Data Analysis written by Wes McKinney, the main developer of pandas (albeit only released as an “early release”).

Now to the code: Note that you will need the couchdb library to make this code work, so either install couchdb-python in your Python folder, or simply put it into the folder of the script. Note that I’ve only tested it with Python 2.7. You need to configure the “CONFIG” part of the script, and you should be all set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import couchdb
import datetime
import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText

#-----CONFIG------------------------------------------------------
#source & target adresses
SOURCE = 'http://admin:pwd@host:5984'
TARGET = 'http://admin:pwd@host:5984'
#list of dbs (must have equal length)
SOURCE_DBS = ['db1', 'db2']
TARGET_DBS = SOURCE_DBS
#email credentials
GMAIL_USER = "gmail_user"
GMAIL_PWD = "gmail_pwd"
TO = "your email"
# set to False if no email desired
SEND_MAIL = True
#-----------------------------------------------------------------


class CheckReplicator(object):
    """
    Checks if a replication for a list of dbs on the target exists between two
    CouchDB instances.

    Input a connection string for the source and the target of the
    replication and provide two lists with the names of the databases you want
    to have replicated. (source_dbs[0] will be replicated to target_dbs[0] etc)

    Note that this class prints to console, so if you want to log progress,
    print output to file in cronjob.

    Note that all the dbs need to be created. It is smart to initiate the
    first continuous sync via futon or http api!
    """

    def __init__(self, source, target, source_dbs, target_dbs):
        db_equality = len(source_dbs) == len(target_dbs)
        assert db_equality, "source length must equal target length"

        self.source = couchdb.client.Server(source)
        self.target = couchdb.client.Server(target)
        self.source_string = source
        self.target_string = target

        self.desired_reps = zip(source_dbs, target_dbs)
        self._check_connections()
        self.active_reps = self._get_active_reps_on_target()

    def check(self):

        if self._check_if_all_desired_reps_exist():
            print str(datetime.datetime.now())[:-7] + " ok"
        else:
            self._fix_replications()

    def _check_if_all_desired_reps_exist(self):
        res = True
        for d in self.desired_reps:
            if d not in self.active_reps:
                res = False
        return res

    def _fix_replications(self):
        for d in self.desired_reps:
            if d not in self.active_reps:
                source_str = self._build_source_string(d[0])
                self.target.replicate(source_str, d[1], continuous=True)

        self.active_reps = self._get_active_reps_on_target()
        if not self._check_if_all_desired_reps_exist():
            raise EmailError("""
                             could not replicate all targets. Please
                             check if the couch instances are running
                             and all the dbs are created!
                             """
, SEND_MAIL)
        else:
            print str(datetime.datetime.now())[:-7] + " replicators created"

    def _build_source_string(self, db):
        string = self.source_string
        if string[-1] == '/':
            string = string + db
        else:
            string = string + '/' + db

        return string

    def _get_active_reps_on_target(self):

        tasks = self.target.tasks()

        #parse source and target of the task string
        #from the replication information
        active_reps = list()
        replications = [t['task'] for t in tasks if t['type'] == 'Replication']
        for r in replications:
            first_split = r.split('/ -> ')
            target = first_split[-1]
            second_split = first_split[-2].split('/')
            source = second_split[-1]
            active_reps.append((source, target))
        return active_reps

    def _check_connections(self):
        try:
            self.source.version()
        except:
            raise EmailError('could not connect to source', SEND_MAIL)

        try:
            self.target.version()
        except:
            raise EmailError('could not connect to target', SEND_MAIL)


class EmailError(Exception):

    def __init__(self, value, send_mail=False):
        self.value = value
        if send_mail:
            self._mail('Watchman Error', value)

    def __str__(self):
        return repr(self.value)

    def _mail(self, subject, text):
        msg = MIMEMultipart()

        msg['From'] = GMAIL_USER
        msg['To'] = TO
        msg['Subject'] = subject

        msg.attach(MIMEText(text))

        mailServer = smtplib.SMTP("smtp.gmail.com", 587)
        mailServer.ehlo()
        mailServer.starttls()
        mailServer.ehlo()
        mailServer.login(GMAIL_USER, GMAIL_PWD)
        mailServer.sendmail(GMAIL_USER, TO, msg.as_string())
        # Should be mailServer.quit(), but that crashes...
        mailServer.close()

#run it!
if __name__ == '__main__':
    try:
        check_replicator = CheckReplicator(SOURCE, TARGET, SOURCE_DBS, TARGET_DBS)
        check_replicator.check()
    except:
        raise EmailError('program code failed', SEND_MAIL)
October 1, 2012 by Nelmio in Development // Tags: , 3 Comments

3 Responses to A python solution to a secure backup of CouchDB via replication

  1. Reto says:

    a little update: it is possible to set up a replication that would survive a restart. it’s documentation is quite hidden, but see https://gist.github.com/832610

    i do not know if the issue of the replication task getting killed randomly persists via this method…

  2. It’s also in the NEWS/CHANGES in the source, and listed on the wiki in http://wiki.apache.org/couchdb/Replication. You’re right though the persistence functionality is not blindingly obvious, I’ll get that fixed. Did you assume that replication would be a persistent by default? I think we will merge the _replicate and _replicator APIs in the next few releases.

    • Reto says:

      Hey Dave: I think it is a very wise decision to merge _replicate and _replicator API because it confuses casual user. For example in http://guide.couchdb.org/editions/1/en/replication.html it is explicitly mentioned that replication needs to be reinitiated at server restart, so i was well aware it wasn’t persistent. and it kind of stuck in my head. (And I could swear that some of the sources that document persistent replication wasn’t there a couple of month ago)

      Note however that the server restart issue was not the main reason, but the random crashing of the replication processes which kind of scared me.

      I do not know though if they also happen if you set up replication via the _replicator db

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>