Why several Database Systems?

It’s not a secret that I perform most of my daily work in a Python environment. In my previous blog entries, I already gave some insights into the variety of tools of data processing that it provides. But besides processing, the storage of data is playing field with a large potential for optimization.

We usually use so-called relational databases (most famous candidate: MySQL) in our data analysis tasks. These store data sets in tables with well defined columns and relations among each other.
blog_document_based_en

MongoDB, however, is a document-oriented database system: There are no tables, but only collections of documents. The purpose of the former lies mostly in grouping and assigning indices, while the latter represent the individual data sets as a data object, that may include various levels of nesting.

blog_relational_en

Which database concept is superior depends largely on the intended purpose: In a relational database, you always know the type of a column’s value – or even if the column exists – which can have enormous advantages for an efficient analysis of large quantities of data. The document-oriented database on the other hand offers for example:

  1. Each object can be extended by arbitrary attributes.
  2. Hierarchical structures can be mapped directly.
  3. It’s unnecessary to define a structure beforehand.

This implies that it may be used easily ad hoc – you don’t have to define a structure in the database itself, as it is already part of the saved object. Furthermore, many objects in Python are structured hierarchically, which enables a data conversion with minimal friction in both directions.

It therefore has multiple areas of application where relational databases would yield enormous overhead. For some of them this article will provide Python recipes, following an insight into the interface of MongoDB for Python:

  • Caching: Especially our webserver’s on-demand reports benefit from intelligent caching, that stores required objects with an expiry date in the database.
  • Logging: We connect the native logging module of Python with our MongoDB to be able to store, search and load arbitrary information.

The database system has a long list of other features that I won’t touch in this article. It’s main purpose is therefore to help users with some Python experience along their first steps with MongoDB.

The Basics

MongoDB

There are several cloud services providing MongoDB access, but for first steps and many small applications in your intranet a local installation is easily sufficient. It’s available for all relevant operating systems and usually a matter of a couple of minutes.

The current standard configuration does not use authentication and access is only allowed locally – so you can start using it right away. However, it must be highlighted that is your responsibility to verify this, before storing sensitive data or permanently exposing an interface to your network!
The steps to enforce authentication are best described in the official documentation. As a newbie it might help you to know, that users only exist within individual databases, i.e. a user in database “test” only exists there and can therefore only authenticate there – excluded from this rule are users on the “admin” database, that have global rights. It might also be confusing, that MongoDB creates databases on demand, but does not save them permanently if they stay empty.

PyMongo

The interface for Python is named PyMongo and can be easily installed via pip on systems with a properly configured Python environment:

$ pip install pymongo

 

Connecting

The base class that manages all operations on the database system is MongoClient. To establish a connection, you only need the hostname (default: “localhost”) and the port (default: 27017):

from pymongo import MongoClient

client = MongoClient(host='localhost', port=27017)

 

Main Components

The most important method of the client object is get_database. For administration purposes you can for example get the names of all databases with database_names, but you need special rights for that, and usually only one database is needed in a single session. MongoClient is also available as a context manager (= with statement in Python), which closes the connection by calling the close method automatically:

>>> with MongoClient() as client:
>>>     database = client.get_database('employees')
>>>     print(type(database))
<class 'pymongo.database.Database'>

The Database class represents an individual database. If it does not exist, it is simply created as needed. It is the instance against which you have to authenticate – if the server is configured that way:

>>> database.authenticate('gilfoyle', 'cat123')

This is the object that you will use most of the time. It is therefore quite handy to include it in a context manager:

from contextlib import contextmanager
from functools import partial

@contextmanager
def get_mongo_database(host='localhost', port=27017, 
                       database='admin', user=None, password=None):
    with MongoClient(host, port) as client:
        db = client.get_database(database)
        if user and password:
            db.authenticate(user, password)
        yield db

get_default_mongo_database = partial(
    get_mongo_database, database='employees', 
    user='gilfoyle', password='cat123')

Here, the function get_default_mongo_database uses authentication data, that obviously needs to be set beforehand for the respective database. It will be used from here on in many examples for convenience and may be implemented in many different ways.

The elements of the database are the collections, that group data sets by topic and have additional purposes like managing indices. The collection_names method of a database returns a list of all existing collections. Just like databases, they are created on demand and only saved if they don’t stay empty. They are accessed via name, which is possible in multiple ways:

>>> with get_default_mongo_database() as db:
>>>     collection = db.beverages
>>>     print(collection)
>>>     print(db['beverages'] == collection)
Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'employees'), 'beverages')
True

As documents are converted directly to Python dictionaries in query responses, collections are already the lowest level object of the pymongo module with relevance for us in this article. They are the starting point for all search, delete, and replacement procedures of data sets.

Storage

Data sets within the MongoDB are based on BSON, which except for some details (e.g. datetime fields) corresponds to a binary JSON format. Therefore each data set is convertible to a Python dictionary with any level of nesting and unambiguous data types. Some peculiarities apply however:

  • A key/value pair in a dictionary is called a field.
  • As in JSON, field names must be strings.
  • BSON knows datetime field values, that convert to Python datetime.datetime objects – and vice versa.
  • Field names must not contain neither whitespaces nor dots nor leading “$” signs.
  • Each entry contains a unique “_id” field, that is created automatically if not provided.

Write

In order to insert documents there is a choice between single execution with insert_one and batch execution with insert_many:

>>> result_one = collection.insert_one({
>>>     'name': 'beer', 
>>>     'properties': {'temperature': 'cold'}})
>>> result_one.acknowledged
True
>>> result_one.inserted_id
ObjectId('5995578535edfc76e2a60499')
>>> result_many = collection.insert_many([
>>>     {'name': 'coffee',
>>>      'properties': {
>>>         'temperature': 'hot'
>>>     }},
>>>     {'name': 'moonshine',
>>>      'properties': {
>>>         'temperature': 'whatever',
>>>         'before_6_pm': False
>>>     }}
>>> ])
>>> result_many.acknowledged
True
>>> result_many.inserted_ids
[ObjectId('5995578535edfc76e2a6049a'), ObjectId('5995578535edfc76e2a6049b')]

The collection object used here is of course a previously created instance of a pymongo Collection. The insert_x methods return objects containing the results of the operation (like all delete and replacement methods later on). The value of acknowledged indicates the success of the operation, and inserted_id(s) provides the _ids of the concerning objects within the database, for their unambiguous identification.

Read

Finding and reading documents is the task of the find_one and find (not find_many!) methods. The former returns exactly one document as dictionary (if it could find one at all), while the latter returns a Cursor object, which allows iterating over the found documents, or applying further operations like counting the matched documents with count() or sorting them with sort(). Otherwise the handling of the two find methods is identical.

>>> collection.find_one()
{'_id': ObjectId('5995578535edfc76e2a60499'), 'properties': {'temperature': 'cold'}, 'name': 'beer'}
>>> collection.find().count()
3
>>> for doc in collection.find():
>>>     print(doc['_id'])
5995578535edfc76e2a60499
5995578535edfc76e2a6049a
5995578535edfc76e2a6049b

So find_one always returns the first matching document. Called without arguments, any document matches. The first and most important argument is the filter, which itself is a JSON-compatible dictionary, just as the second one – the projection. This is used to reduce the returned documents to certain fields. It is always a dictionary, whose keys are field names, and the values are either True or False – depending on whether the field is desired or not. If a field is set True, all unmentioned fields are automatically False – except for _id, which has to be set False explicitly – turning the argument into a whitelist instead of a blacklist. I’ll be using this in the following for demonstration and to improve readability of the results:

>>> def find_filtered_names(filter={}):
>>>     return [x for x in collection.find(filter=filter, projection={'name': True, '_id': False})]
>>> 
>>> find_filtered_names({'name':'fanta'})
[]
>>> find_filtered_names({'name':'beer'})
[{'name': 'beer'}]
>>> find_filtered_names({'properties.temperature':'whatever'})
[{'name': 'moonshine'}]
>>> find_filtered_names({'properties.temperature': {'$regex': 'al'}})
[{'name': 'beer'}, {'name': 'moonshine'}]
>>> find_filtered_names({'properties.before_6_pm': {'$exists': True}})
[{'name': 'moonshine'}]
>>> find_filtered_names({'$or': [{'properties.before_6_pm': True}, {'properties.before_6_pm': {'$exists': False}}]})
[{'name': 'beer'}, {'name': 'coffee'}]

The first two examples are simple queries by exact field value. From the third example on you’ll see the syntax “FieldA.FieldB” regularly: it is used to progress along nested documents and therefore filter by field values in deeper levels of the hierarchy. The fourth example introduces operators, identified by a leading “$” sign. First a string pattern is searched using “$regex”, then the existence of a particular field is tested via “$exists”, and finally “$or” is used to logically connect the check for a fields existence with its value. There are many more operators for you to explore.

Delete

Extending the pattern of the inserts, there are two methods for deletion: delete_one and delete_many. Instead of an object to insert they take a filter as argument and delete the first or all matching documents respectively. Their return value is an object with the result, whose deleted_count property gives the number of deleted entries.

replace and update

For replacements there is only the method replace_one, that takes a filter and the replacement document as arguments, and replaces the first match with it. Interesting here is the optional upsert argument, that causes the replacement document to be written even if the filter doesn’t match any documents.

Updates are more versatile: There is update_one and update_many, both taking a filter as argument and modifying the first or all matches respectively. The second argument is the update operation: It can set values to fields directly, or modify them based on the current values, for example by incrementing or – in case of  an array – add or remove elements.

>>> collection.find_one({'name': 'beer'}, {'_id': False})
{'properties': {'temperature': 'cold'}, 'name': 'beer'}
>>> collection.update_one({'name': 'beer'}, {'$set': {'properties.consumed': 1}})
>>> collection.find_one({'name': 'beer'}, {'_id': False})
{'properties': {'consumed': 1, 'temperature': 'cold'}, 'name': 'beer'}
>>> collection.update_one({'name': 'beer'}, {'$inc': {'properties.consumed': 4}})
>>> collection.find_one({'name': 'beer'}, {'_id': False})
{'properties': {'consumed': 5, 'temperature': 'cold'}, 'name': 'beer'}

The optional upsert argument exists here too, in order to guarantee, that at least one matching document exists after the operation.

>>> assert collection.find_one({'name': 'fanta'}) is None
>>> collection.update_one({'name': 'fanta'}, {'$inc': {'properties.consumed': 2}}, upsert=True)
>>> collection.find_one({'name': 'fanta'}, {'_id': False})
{'properties': {'consumed': 2}, 'name': 'fanta'}

You can find more detailed information on read and write processes in the documentation of Collections.

Simple Methods

A simple but common scenario is the storing and loading of Python objects, where a possibly existing one should be overwritten, so you can for example store intermediate states of a complex process. For that purpose, one can store the object as a field value in a document of a certain collection and add a field for identification:

def to_mongodb(data, collection, key, pickling=False):
    """ write object to default MongoDB database.

    :param data: BSON compatible object
    :param str collection: collection name
    :param str key: key to identify
    :param bool pickling: store data pickled
    :return: None
    """
    with get_default_mongo_database() as db:
        collection = db[collection]
        if pickling:
            data = pickle.dumps(data)
        collection.delete_many({'name': key})
        collection.insert_one({'name': key, 'data': data})

def from_mongodb(collection, key, pickling=False):
    """ retrieve object from MongoDB.

    :param str collection: collection name
    :param key: key to identify
    :param bool pickling: stored data is pickled
    :return: data
    """
    with get_default_mongo_database() as db:
        collection = db[collection]
        data = collection.find_one({'name': key})['data']
    if pickling:
        data = pickle.loads(data)
    return data

This way one has to choose just the name of a collection and the name for the entry, and the object will be written in the “data” field of a document – or loaded from it. This works in a direct manner with many Python objects, and for more complex ones (e.g. pandas.DataFrame) the pickling flag enforces a conversion to a string.

Caching

We can now expand this principle: many of my reports are generated on demand in a Flask webserver – as shown in my last article – which can consume substantial processing power. As they are requested multiple times a day by different people, there is a lot of overlap, e.g. in request parameters or underlying data sets. So why not simply store intermediate states of data or even full websites together with a timestamp in a MongoDB, and adding a grace period to the requesting method? In total it looks like this:

import pickle
import datetime

def _mongodb_upsert_entry(name, attributes, data, collection):
    """ Update objects in the MongoDB, set a timestamp for caching purposes.

    :param str name: main identifier
    :param dict attributes: attributes for further identification
    :param object data: data to be stored
    :param PyCollection collection: pymongo collection object to operate on
    :return: None
    """
    doc_filter = attributes.copy()
    doc_filter['name'] = name
    new_doc = doc_filter.copy()
    new_doc['timestamp'] = datetime.datetime.now()
    new_doc['data'] = data
    collection.replace_one(doc_filter, new_doc, upsert=True)

def _mongodb_get_entry(name, attributes, collection, grace=None):
    """ Get an entry created by :fcn:`mongodb_upsert_entry` with an optional grace period
    to discard old entries.

    :param str name: main identifier
    :param dict attributes: attributes for further identification
    :param PyCollection collection: pymongo collection object to operate on
    :param datetime.timedelta grace: positive timedelta limiting the age of the existing entry
    :return: stored data
    :rtype: dict
    """
    doc_filter = attributes.copy()
    doc_filter['name'] = name
    doc = collection.find_one(doc_filter, {'timestamp': True})
    if doc is None:
        return None
    if grace is not None and datetime.datetime.now() - doc['timestamp'] > grace:
        return None
    return collection.find_one({'_id': doc['_id']})['data']

def get_cached_object(name, fcn, grace_hours=12, **attributes):
    """ helper function to retrieve or update cached objects.

    :param str name: key for storage in MongoDB
    :param callable fcn: function to retrieve final value
    :param int grace_hours: hours until cached content is invalidated
    :param int return_val_index: index of the desired obeject in the return tuple from fcn
    """
    with get_default_mongo_database() as db:
        collection = db.website_cache
        obj = None
        data = None if grace_hours <= 0 else \
            _mongodb_get_entry(name, attributes, collection, 
                               grace=datetime.timedelta(hours=grace_hours))
        if data is not None:
            obj = pickle.loads(data)
            print('cache hit')
        else:
            obj = fcn(**attributes)
            data = pickle.dumps(obj)
            _mongodb_upsert_entry(name, attributes, data, collection)
            print('cache miss')
    return obj

Entry point is the get_cached_object function. In your procedure it replaces a function call with certain parameters that would return the desired object (DataFrame, HTML string etc.). The replaced function becomes the fcn argument, and its arguments are simply passed as keyword arguments. This is completed by passing a name for identification and a grace period in hours. The collection “website_cache” is hardcoded here. If the grace period is zero or negative, the cache (= MongoDB) is not queried. Otherwise _mongodb_get_entry searches for an entry with the given name and the same arguments, verifies that the timestamp lies within the grace period and – if successful – returns the data object, which is then unpickled. If no dataset could be retrieved from the cache, it is created by calling the function before being stored in the database via _mongodb_upsert_entry as pickle-string along all necessary information. The usage is demonstrated in the following:

>>> def test_fcn(x, y):
>>>     return '%i * %i = %i' % (x, y, x * y)

>>> get_cached_object('test_function', test_fcn, x=3, y=4)
cache miss
'3 * 4 = 12'
>>> get_cached_object('test_function', test_fcn, x=3, y=4)
cache hit
'3 * 4 = 12'
>>> get_cached_object('test_function', test_fcn, x=4, y=3)
cache miss
'4 * 3 = 12'
>>> get_cached_object('test_function', test_fcn, x=2, y=5)
cache miss
'2 * 5 = 10'
>>> get_cached_object('test_function', test_fcn, x=3, y=4, grace_hours=0)
cache miss
'3 * 4 = 12'

It is worth mentioning here, that in rapidly growing collections it can quickly become beneficial to create indices using the create_index method.

Logging

With the loggin module, Python provides a very flexible logging interface, that you can also connect to a MongoDB. This way, access to logs is detached from any file system of the logging computer system, and the logs can be browsed efficiently without command line voodoo. Core element of such a connection is a logging.Handler, which structures the information and writes them to the MongoDB:

import datetime
import traceback
import logging

class ToMongoHandler(logging.Handler):
    """ A very basic logger that commits a LogRecord to the MongoDb """

    def emit(self, record):
        trace = None
        exc = record.__dict__['exc_info']
        if exc:
            trace = traceback.format_exception(*exc)
        log = {'logger': record.__dict__['name'],
               'level': record.__dict__['levelname'],
               'trace': trace,
               'msg': record.__dict__['msg'],
               'timestamp': datetime.datetime.today()}
        collection_name = 'log_' + log['logger'].replace('.', '__')
        with get_default_mongo_database() as db:
            db.get_collection(collection_name).insert_one(log)

In case of an exception, the Handler saves the available stack traceback. As collection name it chooses “log_” plus the name of the logger – which is individual for each module by Python convention. The replacement of dots in the name is necessary, as they are not allowed in collection names. One way to attach this logger to a module is by adding

if ToMongoHandler not in [type(h) for h in logging.getLogger(__name__).handlers]:
    logging.getLogger(__name__).addHandler(ToMongoHandler())

to its head. One shortcoming here is, that __name__ for the name of the logger can also be “__main__” if the module itself is executed, which leads to the probably unwanted collection “log___main__”.

Now you just need access to the concerning database in MongoDB to search specific log entries:

>>> with get_default_mongo_database() as db:
>>>     for doc in db.log_my_module.find(
>>>         {'level': 'ERROR', 
>>>          'timestamp': {'$gt': datetime.datetime(2017, 4, 1)}}):
>>>         print(doc['timestamp'])
2017-04-04 07:34:49.623000
2017-04-04 07:34:49.706000
2017-04-05 07:33:11.187000
2017-04-05 07:33:11.271000

A simple interface for comfortably browsing the logs may be realized for example in a Jupyter Notebook using ipywidgets.

Conclusion

In larger scale productive systems there are often obvious arguments for and against certain database types, and a quick migration to another type is an improbable scenario. However, it costs only little effort and no strong computing infrastructure to launch a MongoDB instance with the capabilities to support many small but useful tasks within a Python environment. Therefore – even in computing environments with otherwise contrary requirements concerning data storage – it might very well pay to take this peek beyond the horizon. If you have questions, ideas or want to share your experiences, I’m looking forward to your comments!

Matthias Huschle

Matthias Huschle is a real data expert. As a former particle physicist he has been working at PAYMILL on data analyses and finance at a scientific level since 2015. Here he shares some of his experience.