Sunday, 1 April 2018

Riak - Building a Development Environment From Source

Building a Riak development environment, like anything involving Linux, is needlessly complicated for no good reason. This method to build from source worked for me from a clean install of Lubuntu 17.10.1:

First, update your package index and install the dependencies and utilities you will need:

$ sudo apt-get update
$ sudo apt-get install build-essential autoconf libncurses5-dev libpam0g-dev openssl libssl-dev fop xsltproc unixodbc-dev git curl

Next, navigate to user home, download kerl and use it to build and install the Basho version of Erlang (WHY?!?!). These steps took a while to complete on my machine, bear with it:

$ cd ~
$ curl -O https://raw.githubusercontent.com/kerl/kerl/master/kerl
$ chmod a+x kerl
$ ./kerl build git git://github.com/basho/otp.git OTP_R16B02_basho10 R16B02-basho10
$ ./kerl install R16B02-basho10 ~/erlang/R16B02-basho10
$ . ~/erlang/R16B02-basho10/activate

With Erlang installed, clone the Riak source repository from GitHub and build:

$ git clone https://github.com/basho/riak.git
$ cd riak
$ make rel

Finally, create 8 separate copies of Riak to use in a cluster:

$ make devrel

Start 3 (or more) Riak instances:

$ dev/dev1/bin/riak start
$ dev/dev2/bin/riak start
$ dev/dev3/bin/riak start

Then join instances 2 and 3 with instance 1 to form a cluster:

$ dev/dev2/bin/riak-admin cluster join dev1@127.0.0.1
$ dev/dev3/bin/riak-admin cluster join dev1@127.0.0.1

Check and commit the cluster plan:

$ dev/dev3/bin/riak-admin cluster plan
$ dev/dev3/bin/riak-admin cluster commit

Monitor the cluster status until all pending changes are complete:

$ dev/dev3/bin/riak-admin cluster status
---- Cluster Status ----
Ring ready: false

+--------------------+------+-------+-----+-------+
|        node        |status| avail |ring |pending|
+--------------------+------+-------+-----+-------+
| (C) dev1@127.0.0.1 |valid |  up   |100.0|  34.4 |
|     dev2@127.0.0.1 |valid |  up   |  0.0|  32.8 |
|     dev3@127.0.0.1 |valid |  up   |  0.0|  32.8 |
+--------------------+------+-------+-----+-------+

$ dev/dev3/bin/riak-admin cluster status
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+
|        node        |status| avail |ring |pending|
+--------------------+------+-------+-----+-------+
| (C) dev1@127.0.0.1 |valid |  up   | 34.4|  --   |
|     dev2@127.0.0.1 |valid |  up   | 32.8|  --   |
|     dev3@127.0.0.1 |valid |  up   | 32.8|  --   |
+--------------------+------+-------+-----+-------+

Check the cluster member status:

$ dev/dev3/bin/riak-admin member-status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      34.4%      --      'dev1@127.0.0.1'
valid      32.8%      --      'dev2@127.0.0.1'
valid      32.8%      --      'dev3@127.0.0.1'
-------------------------------------------------------------------------------
Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Congratulations, you have a development Riak cluster. Test the cluster by writing some data to a node:

$ curl -XPUT http://127.0.0.1:10018/riak/test/helloworld -H "Content-type: application/json" --data-binary "Hello World!"

Use a browser to read the data from each node:
http://127.0.0.1:10018/riak/test/helloworld
http://127.0.0.1:10028/riak/test/helloworld
http://127.0.0.1:10038/riak/test/helloworld

Thursday, 8 February 2018

Tell 'em Steve-Dave!

SoundCloud's web interface is rubbish for downloading podcasts, but their API is pretty good, so here's a handy Python script for downloading all of your favourite Tell 'em Steve-Dave! episodes:

import requests
import os.path

API_URL = "http://api.soundcloud.com"
TRACKS_URL = API_URL + "/users/%(USER_ID)s/tracks" \
                       "?client_id=%(CLIENT_ID)s" \
                       "&offset=%(OFFSET)s" \
                       "&limit=%(LIMIT)s" \
                       "&format=json"
DOWNLOAD_URL = "https://api.soundcloud.com/tracks/%(TRACK_ID)s/download?client_id=%(CLIENT_ID)s"
CHUNK_SIZE = 16 * 1024
TESD_USER_ID = "79299245"
CLIENT_ID = "3b6b877942303cb49ff687b6facb0270"
LIMIT = 10
offset = 0

while True:

    url = TRACKS_URL % {
        "USER_ID": TESD_USER_ID,
        "CLIENT_ID": CLIENT_ID,
        "LIMIT": LIMIT,
        "OFFSET": offset
    }

    tracks = requests.get(url).json()

    if not tracks:
        break

    tracks = [(track["id"], track["title"]) for track in tracks]

    for (id, title) in tracks:

        url = DOWNLOAD_URL % {"TRACK_ID": id, "CLIENT_ID": CLIENT_ID}

        filename = "%s.mp3" % title
        print "downloading: %s" % filename

        if os.path.exists(filename):
            continue;

        request = requests.get(url, stream=True)

        with open(filename, 'wb') as fd:
            chunks = request.iter_content(chunk_size=CHUNK_SIZE)
            for chunk in chunks:
                fd.write(chunk)

    offset = offset + LIMIT

4 colors 4 life

Saturday, 27 January 2018

Overengineering Shit

I’ve had enough of Flickr, for all the standard reasons.

So I set out to build a scalable, secure, distributed, image sharing platform of my own, using open source components, tried and tested tech, with no bullshit.

It would be great, I thought, I could start small, just hosting my own photos; then I could open it up to friends and family, working out the bugs as I go, seamlessly scaling up the hardware as required. Then, who knows, I could open it up to the internet! It was going to be awesome!

I wanted nothing fancy for the implementation - only mature tech, battle tested stuff, which wasn’t going to become unsupported any time soon. And only the right tools for the job:
  1. Bulk uploading files over HTTP is bollocks, transferring files is a solved problem, the system should use FTP.
  2. No custom user database, no hand rolled permissions pseudo-framework, don’t re-invent the wheel, the authentication and authorisation should be handled by LDAP.
  3. Image storage should be implemented using a distributed, scalable filesystem, I want to just add more nodes when disk starts running low.
  4. Image processing, such as thumbnail generation, should be asynchronous with jobs taken from a scalable message queue, that way I can add more message processors when I need to.
  5. Simple REST webservices should be used by a client to fetch images and image metadata from the server.
  6. And finally the web UI should be simple, responsive and avoid JavaScript framework bloat.

My chosen implementations to satisfy the above included:

Also using imgscalr-lib for preview generation, Apache Avro and Apache commons-lang for serialization, Apache Tika for image format detection, SLF4J and Logback for logging and finally Ansible for deployment.

The components logically hang together something like this:

The code was looking good enough to get something up and running - it was time to investigate some hosting costs. I figured I would need a big-ish box to put Apache Cassandra and Apache Directory Server on, a small-ish box for the Apache Tomcat and Apache Webserver, another small-ish box for Apache FTP Server, and a medium sized box for Apache Kafka and the consumer processes.

Time to check out some recommended production hardware requirements for Cassandra:

a minimal production server requires at least 2 cores, and at least 8GB of RAM. Typical production servers have 8 or more cores and at least 32GB of RAM

And for Kafka:

A machine with 64 GB of RAM is a decent choice, but 32 GB machines are not uncommon. Less than 32 GB tends to be counterproductive (you end up needing many, many small machines).

Are you fucking kidding me? When did minimum requirements for a database and a queue become 32GB of fucking RAM each?!

At DigitalOcean's current prices, an 8GB droplet is $40 a month and a 32GB droplet is $160 a month, and the smaller droplets anything between $5 and $20 a month.

Memory vCPUs SSD Disk Transfer Price
1 GB 1 vCPU 25 GB 1 TB $5/mo
$0.007/hr
2 GB 1 vCPU 50 GB 2 TB $10/mo
$0.015/hr
4 GB 2 vCPUs 80 GB 4 TB $20/mo
$0.030/hr
8 GB 4 vCPUs 160 GB 5 TB $40/mo
$0.060/hr
16 GB 6 vCPUs 320 GB 6 TB $80/mo
$0.119/hr
32 GB 8 vCPUs 640 GB 7 TB $160/mo
$0.238/hr
... ... ... ... ...

I guess writing and running scalable systems requires a scalable bank balance - mine does not scale to over $200 a month just to upload some photos. Time to scrap this idea.

What I’ve ended up with is a $10 a month DigitalOcean droplet running nginx and SFTP and a half arsed Python script which uses ImageMagick to generate thumbnails and some static HTML - and you know what? It’s absolutely perfect for me.

Time to reflect on some almost-ten-year-old, but still relevant, wisdom from Ted Dziuba: I'm Going To Scale My Foot Up Your Ass

Python script to generate HTML and thumbnails:

import os
from subprocess import call

CONVERT_CMD = "convert"
PREVIEW_SIZE = "150x150"
PREVIEW_PREFIX = "preview_"
HTML_EXTENSION = ".html"
IMG_EXTENSION = ".jpg"
INDEX = "index.html"
ALBUM_IMG = "/album.png"

LIST_TEMPLATE = """
<!DOCTYPE html>
<html>
  <head>
    <title>taffnaid.photos</title>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" type="text/css" href="/taffnaidphotos.css">
  </head>
  <body>
    <div id="list" class="list">
      <div id="list-nav" class="nav">
        {0}
      </div>
      <div id="list-previews" class="previews">
        {1}
      </div>    
    </div>
  </body>
</html>
"""

LIST_NAV_TEMPLATE = """
<a href="{0}" class="parent">☷</a>
"""

PREVIEW_TEMPLATE = """
<div class="preview">
  <a href="{0}">
    <img src="{1}" alt=":-("/>
    <div class="name">{2}</div>
  </a>
</div>
"""

VIEW_TEMPLATE = """
<!DOCTYPE html>
<html>
  <head>
    <title>taffnaid.photos</title>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" type="text/css" href="/taffnaidphotos.css">
  </head>
  <body>
    <div id="view" class="view">
      <div id="view-nav" class="nav">
        {0}
      </div>
      <div id="view-image" class="image">
        <img src="{1}" alt=":-("/>
        <link rel="prefetch" href="{2}">
        <link rel="prefetch" href="{3}">
      </div>
    </div>
  </body>
</html>
"""

VIEW_NAV_TEMPLATE = """
<a href="{0}" class="previous">⟨</a>
<a href="{1}" class="parent">☷</a>
<a href="{2}" class="next">⟩</a>
"""

cwd = os.getcwd()

for root, dirs, files in os.walk(cwd):

    dirs = sorted(dirs, reverse=True)

    files = filter(lambda file: file.lower().endswith(IMG_EXTENSION), files)
    files = filter(lambda file: not file.startswith(PREVIEW_PREFIX), files)
    files = sorted(files)

    preview_html = ""

    for dir in dirs:
        preview_html = PREVIEW_TEMPLATE.format(
            os.path.join(dir, INDEX),
            ALBUM_IMG,
            dir) + preview_html

    for i, file in enumerate(files):
        previous = files[i - 1]
        parent = os.path.join(root.replace(cwd, ""), INDEX)
        next = files[(i + 1) % len(files)]

        nav_html = VIEW_NAV_TEMPLATE.format(
            previous + HTML_EXTENSION,
            parent,
            next + HTML_EXTENSION
        )

        preview_html = preview_html + PREVIEW_TEMPLATE.format(
            file + HTML_EXTENSION,
            PREVIEW_PREFIX + file,
            file)

        view_html = VIEW_TEMPLATE.format(
            nav_html,
            file,
            previous,
            next
        )

        image = os.path.abspath(os.path.join(root, file))
        preview = os.path.abspath(os.path.join(os.path.dirname(image), PREVIEW_PREFIX + os.path.basename(image)))

        if not os.path.exists(preview):

            cmd = [
                CONVERT_CMD,
                "-define", "jpeg:size=%s" % PREVIEW_SIZE,
                image,
                "-thumbnail", "%s^" % PREVIEW_SIZE,
                "-gravity", "center",
                "-extent", PREVIEW_SIZE,
                preview]

            call(cmd)

        view_file = image + HTML_EXTENSION

        with open(view_file, 'w') as view_file:
            view_file.write(view_html)

    parent = os.path.join(os.path.dirname(root.replace(cwd, "")), INDEX)
    nav_html = LIST_NAV_TEMPLATE.format(parent)

    list_html = LIST_TEMPLATE.format(
        nav_html,
        preview_html)
    index_file = os.path.join(root, INDEX)

    with open(index_file, 'w') as index_file:
        index_file.write(list_html)

nginx config to resize and cache images:

proxy_cache_path /var/www/html/cache levels=1:2 keys_zone=resized;

server {
        listen 80 default_server;
        listen [::]:80 default_server;

        root /var/www/html;

        index index.html index.html;

        server_name _;

        location / {
                try_files $uri $uri/ =404;
        }

        location ~ ^/.*(\.jpg|\.JPG)$ {
                proxy_cache resized;
                proxy_cache_valid 200 10d;
                proxy_pass http://127.0.0.1:9001;
        }
}

server {
        listen 9001;
        allow 127.0.0.1;
        deny all;

        root /var/www/html;

        location ~ ^/.*(\.jpg|\.JPG)$ {
                image_filter_buffer 10M;
                image_filter resize 1920 1080;
        }
}

Source Code