Sunday 22 April 2018

Dynamically Typed Stacks Make Me Nervous

Ten years ago Ted Dziuba wrote Python Makes Me Nervous, I agree with everything he wrote back then - I suppose I'm what Steve Yegge would call a Software Conservative. Ten years on, the static vs dynamic language debate is no closer to being over and now what makes *me* really nervous is entire dynamically typed system stacks.

To be more accurate, what I mean by dynamically typed stacks is: systems built with dynamically typed languages and composed of schema-less services, end-to-end. Let me explain ...

When I was a young programmer, if you wanted to created a web service you used XML-RPC or SOAP. I liked SOAP (yeah, I said it!), with a well defined WSDL and some XSD you knew exactly what your client/server was going to send/receive. You generated client code and server side stub classes with Apache Axis and you got serialisation, de-serialisation, parsing, validation and error handling all for free.

Now everyone uses REST and JSON. Instead of well defined XML services, RESTful web services have to try and shoehorn requests into a HTTP GET/POST/PUT/DELETE method along with some path parameters and/or query parameters and/or request/response headers. Serialisation and validation for RESTful web services are often made an implementation concern of the application with custom serialisation/de-serialisation handlers and bespoke validation code.

I like Relational databases (You heard me!). With a well defined schema you know exactly what data you're going to store and retrieve. Database constraints enforce data correctness and referential integrity and it all gets managed for free in one place.

Now we have schema-less NoSQL databases. These types of data stores are supposedly popular because of their horizontal scalability and fault tolerance across network partitions, but in reality, they are popular because they can be used as a data dumping ground with no need for data modelling, schema design, normalisation/de-normalisation, transaction handling, index design, query plan analysis or need to learn a query language. Data consistency, typing, referential integrity, transactions etc. are all concerns pushed on to the application to implement.

Over the last ten years, knowing fuck all about the data your system operates on until run-time has become trendy.

Enough ranting. Lets look at some code, here's a (contrived) example. Let's say we have an existing Java code base, with a PersonController class for persisting a person's contact details, for use in a contacts list application or something. How do you use this API? Well, the classes method signatures and a good IDE tell you everything you need to know with a minimum of key strokes:

I know I need to pass a Person object to the save method. My IDE will tell me what properties I can set on the Person object. The method throws a checked exception if anything goes wrong, or returns a UUID if the entity is persisted correctly. Awesome, I've got everything I need to use this API in my application, I don't need to care about the implementation details.

Now let's do the same thing with Python:

The save method takes one argument, that's all I know. I'd better go have a look at the code...

class PersonController(object):
    URL = 'http://%s:%s/person'

    def __init__(self, host='localhost', port=8888):
        self.url = self.URL % (host, port)

    def save(self, person):
        data = person if isinstance(person, dict) else person.__dict__
        response = requests.post(self.url, data=json.dumps(data))
        if response.status_code != 201:
            raise ControllerSaveException(response.status_code, response.json()['error'])

        return uuid.UUID(response.json()['id'])

... it makes a REST call. person can be anything that can be serialised to JSON and posted to the /person URL. I'd better go try and find the code for the web service...

class Application(tornado.web.Application):

    def __init__(self):
        handlers = [
            (r'/person/?', Handler)
        ]
        tornado.web.Application.__init__(self, handlers)

    def listen(self, address='localhost', port=8888, **kwargs):
        super(Application, self).listen(port, address, **kwargs)

... it's a Tornado REST web service, lets go check the handler class...

class Handler(tornado.web.RequestHandler):

    def __init__(self, application, request, **kwargs):
        super(Handler, self).__init__(application, request, **kwargs)
        self.publisher = Publisher()

    def set_default_headers(self):
        self.set_header('Content-Type', 'application/json')

    def prepare(self):
        try:
            self.request.arguments.update(json.loads(self.request.body))
        except ValueError:
            self.send_error(400, message='Error parsing JSON')

    def post(self):
        response = json.loads(self.publisher.publish(self.request.body.decode('utf-8')))
        self.set_status(response['status'])
        self.write(json.dumps(response))
        self.flush()

... this tells me nothing about what the person object's JSON representation should contain, WTF is Publisher for. I'd better go find that code and take a look...

class Publisher(object):

    def __init__(self, host='localhost', queue='person'):

        self.connection = pika.BlockingConnection(pika.ConnectionParameters(host=host))
        self.channel = self.connection.channel()
        result = self.channel.queue_declare(exclusive=True)
        self.callback_queue = result.method.queue
        self.channel.basic_consume(self.on_response, no_ack=True, queue=self.callback_queue)
        self.response = None
        self.correlation_id = None
        self.queue = queue

    def on_response(self, channel, method, properties, body):
        if self.correlation_id == properties.correlation_id:
            self.response = body

    def publish(self, data):

        self.correlation_id = str(uuid.uuid4())
        self.channel.basic_publish(exchange='',
                                   routing_key=self.queue,
                                   properties=pika.BasicProperties(
                                       reply_to=self.callback_queue,
                                       correlation_id=self.correlation_id,
                                   ),
                                   body=data)
        while self.response is None:
            self.connection.process_data_events()

        return self.response

... FFS, it publishes the JSON to a RabbitMQ message queue. I'd better go find the code for the possible consumers ...

class Consumer(object):

    def __init__(self, host='localhost', queue='person', bucket='person'):

        self.connection = pika.BlockingConnection(pika.ConnectionParameters(host))
        self.channel = self.connection.channel()
        self.channel.queue_declare(queue=queue)
        self.channel.basic_qos(prefetch_count=1)
        self.channel.basic_consume(self.on_request, queue=queue)
        self.dataStore = datastore.DataStore(bucket)

    def on_request(self, channel, method, properties, body):

        request = json.loads(body)
        errors = self.validate(request)
        if errors:
            response = {
                'status': 400,
                'error': ', '.join(errors)
            }
        else:
            response = self.save(request)

        self.channel.basic_publish(exchange='',
                                   routing_key=properties.reply_to,
                                   properties=pika.BasicProperties(
                                       correlation_id=properties.correlation_id),
                                   body=json.dumps(response))
        self.channel.basic_ack(delivery_tag=method.delivery_tag)

    def consume(self):
        self.channel.start_consuming()

    def validate(self, request):

        errors = []

        if 'first_name' not in request or not request['first_name']:
            errors.append('Invalid or missing first name')

        if 'last_name' not in request or not request['last_name']:
            errors.append('Invalid or missing last name')

        return errors

    def save(self, request):

        id = str(uuid.uuid4())
        try:
            self.dataStore.save(id, request)
            response = {
                'id': id,
                'status': 201,
            }
        except Exception as e:
            response = {
                'status': 500,
                'error': str(e)
            }

        return response

... some bespoke validation code tells me I have to have first_name and last_name keys in my JSON object. Then the object gets saved to the person bucket in a Riak database. But, what else should be in my object? Let's curl an existing record and have a look...

$ curl http://127.0.0.1:10018/riak/person/b8aa0197-89db-4550-9fba-2c0d4b132b67
{"first_name": "Adrian", "last_name": "Walker"}

... and I'm no closer to knowing exactly what should or shouldn't be in a person object.

What a waste of time.

Source Code

Sunday 1 April 2018

Riak - Building a Development Environment From Source

Building a Riak development environment, like anything involving Linux, is needlessly complicated for no good reason. This method to build from source worked for me from a clean install of Lubuntu 17.10.1:

First, update your package index and install the dependencies and utilities you will need:

$ sudo apt-get update
$ sudo apt-get install build-essential autoconf libncurses5-dev libpam0g-dev openssl libssl-dev fop xsltproc unixodbc-dev git curl

Next, navigate to user home, download kerl and use it to build and install the Basho version of Erlang (WHY?!?!). These steps took a while to complete on my machine, bear with it:

$ cd ~
$ curl -O https://raw.githubusercontent.com/kerl/kerl/master/kerl
$ chmod a+x kerl
$ ./kerl build git git://github.com/basho/otp.git OTP_R16B02_basho10 R16B02-basho10
$ ./kerl install R16B02-basho10 ~/erlang/R16B02-basho10
$ . ~/erlang/R16B02-basho10/activate

With Erlang installed, clone the Riak source repository from GitHub and build:

$ git clone https://github.com/basho/riak.git
$ cd riak
$ make rel

Finally, create 8 separate copies of Riak to use in a cluster:

$ make devrel

Start 3 (or more) Riak instances:

$ dev/dev1/bin/riak start
$ dev/dev2/bin/riak start
$ dev/dev3/bin/riak start

Then join instances 2 and 3 with instance 1 to form a cluster:

$ dev/dev2/bin/riak-admin cluster join dev1@127.0.0.1
$ dev/dev3/bin/riak-admin cluster join dev1@127.0.0.1

Check and commit the cluster plan:

$ dev/dev3/bin/riak-admin cluster plan
$ dev/dev3/bin/riak-admin cluster commit

Monitor the cluster status until all pending changes are complete:

$ dev/dev3/bin/riak-admin cluster status
---- Cluster Status ----
Ring ready: false

+--------------------+------+-------+-----+-------+
|        node        |status| avail |ring |pending|
+--------------------+------+-------+-----+-------+
| (C) dev1@127.0.0.1 |valid |  up   |100.0|  34.4 |
|     dev2@127.0.0.1 |valid |  up   |  0.0|  32.8 |
|     dev3@127.0.0.1 |valid |  up   |  0.0|  32.8 |
+--------------------+------+-------+-----+-------+

$ dev/dev3/bin/riak-admin cluster status
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+
|        node        |status| avail |ring |pending|
+--------------------+------+-------+-----+-------+
| (C) dev1@127.0.0.1 |valid |  up   | 34.4|  --   |
|     dev2@127.0.0.1 |valid |  up   | 32.8|  --   |
|     dev3@127.0.0.1 |valid |  up   | 32.8|  --   |
+--------------------+------+-------+-----+-------+

Check the cluster member status:

$ dev/dev3/bin/riak-admin member-status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      34.4%      --      'dev1@127.0.0.1'
valid      32.8%      --      'dev2@127.0.0.1'
valid      32.8%      --      'dev3@127.0.0.1'
-------------------------------------------------------------------------------
Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Congratulations, you have a development Riak cluster. Test the cluster by writing some data to a node:

$ curl -XPUT http://127.0.0.1:10018/riak/test/helloworld -H "Content-type: application/json" --data-binary "Hello World!"

Use a browser to read the data from each node:
http://127.0.0.1:10018/riak/test/helloworld
http://127.0.0.1:10028/riak/test/helloworld
http://127.0.0.1:10038/riak/test/helloworld