Matt Andrews
Software engineer making apps – that aren’t apps – and more at the FT. 会说汉语.
Big Data Analytics Tokyo 2017 — day 1

I recently attended Big Data Analytics Tokyo.

Here are my honest and unfiltered notes from the talks I attended on day 1:-

Building Innovation Ecosystems: What Can Tokyo Learn from Cambridge?

Overview - Slides

  • I liked Tim’s assertion that to be real innovation, an idea must be used by society, at large scale.
  • Innovation happens when you bring together Money Ideas and Talent (MIT).
  • Tim spent a long time emphasising the importance of co-location — people collaborate more with each other and the quality of the collaboration is higher when they work on the same floor. This was a slightly depressing message for me personally as I am working in a team that is split between London and Tokyo.
  • Boston (USA) seems like good place for all this data stuff.

Behind-the-Scenes Peek of an Analytics Startup

Overview - Slides

Honestly I really struggled with this. The topic wasn’t really relevant to my current projects and I think quite a lot was lost in translation — it was the first talk I’d watched with only simultaneously translation available via a headset (the previous talk being presented in English and Japanese) and slides were only in Japanese — I can read Mandarin Chinese reasonably well, so I could understand some of the Japanese Kanji so I think I had an advantage over other non-Japanese speakers but it was still pretty hard.

I think the core messages were:-

  • Analytics firms would be best to target their products at the marketing department first — to help them achieve business results by optimising something.
  • In Japanese firms people performing data or technology leadership roles are substantially less likely to report to the firm’s CEO. Is the implication that analytics has less importance/a lower status?

The New Vanguard for Business Connectivity, Design & The Internet of Things

Overview - Slides

Probably the most enjoyable of all the two days of talks.

  • I loved the term ‘enchanted objects’ as a category of ‘Internet of Things’.
  • Image recognition is really powerful now. We can literally buy the clothes our friends have been photographed wearings, or buy flights to the locations our friends have taken photos at and more, …
  • Will we have sarcastic dustbins that watch what we throw away, reorders our groceries or warns us for eating too many cookies?

Uncovering Team Performance Dynamics with Data & Analytics in Complex Engineering Projects

Overview - Slides

This was introduction some academic research into optimising large scale projects. For example, if project’s members are split across multiple regions with multiple time zones what would be the most effectively way to split up tasks between the various locations. The assertion being a project’s costs could be dramatically reduced with the clever utilisation of software:-

We will be able to predict and provide teams with real-time adaptive tools and thinking leading to great performance.

My personal experience of large creative software engineering projects, which can be very unpredictable and even assigning the same task to different engineers within the same team can lead to quite difficult results, so I felt somewhat sceptical about the conclusions on offer here.

The Dirty Little Secret of Enterprise Data

Overview - Slides

Although the talk was a little bit ‘a word from our sponsors’ there were some useful ideas I took away from it…

  • The secret is that data is silo’d into different systems and owned by different teams.
  • Cleansing and organising data is the biggest challenge for companies, which splits into 3 systems:- matching records, classifying items and mapping columns/attributes.
  • You could use machine learning to merge data sets together.

Big Data Analysis for Cyber Security

Overview - Slides

Key takeaways:

  • Cyber Security is a great use case for Big data and Machine Learning.
  • Recommendation algorithms can be used to detect infected devices (and similar devices).

The Investor’s View of Emerging Data Marketspace in Japan & the US

Overview - Slides

Some common sense suggestions for aspiring data entrepreneurs in Japan:-

  • focus on solutions that bring data ubiquity — making data accessible, shareable, social …
  • learn to dance with elephants — work with big companies, recognise data is very valuable, build trust, …
  • ‘multi-sided markets’ — I think the point here is to deliver value to multiple stakeholders at once (helping the client and helping the customers)

Things to worry about:-

  • Don’t miss the boat
  • But don’t leap before you look
  • And don’t get squashed (by the elephants that you’re dancing with…)

Artifical Intelligence Sparks the Fourth Industrial Revolution

Overview - Slides

Another ‘a word from our sponsors’ talk but I didn’t really take anything away from it.

Data Science Initiatives at a FinTech Company

Overview - Slides

I really struggled with the language issues here again — the slides were almost all in Japanese. That said, it was fascinating to learn about a highly successful Japanese FinTech startup.

The speaker’s favourite algorithms were Support Vector Machine and State Space Model.

And finally.

  • The venue was breathtaking. Taking place on the 49th floor at the Roppongi Hills Academy, we were greeted by stunning views of Mount Fuji in the morning and glorious sunsets and nightscapes in the evenings.
  • If I ever do a conference talk here I should make sure the slides are bilingual.
  • To get the most out of my time in Japan I really need to learn more Japanese.
  • Many of the western speakers presented in Japanese. I would be curious to learn whether it was better from the Japanese-speaking audience’s perspective to hear talks presented in Japanese as a 2nd language or in native English and simultaneous translation.
  • I am inspired and want to learn more about AI, ML, NLP and Big Data.
Semver as a Service

I’ve been continuing learning bits and pieces with mini projects… This time: Semver as a Service built with AWS Lambda, AWS API Gateway, AWS CloudFormation and Golang.

What is ‘Semver as a Service’?

https://github.com/matthew-andrews/semver-as-a-service/

Semver as a Service is a simple API that will look at any GitHub repository’s releases/tags, sort them and tell you the highest version or, if you specify a constraint, the highest version that meets a constraint.

Try it out here:-

Why?

Well, the main purpose was to learn Go, AWS, etc, but it’s also handy for writing install scripts. For example, this could be a simple script to install the latest version of s3up on your Mac:-

1
2
3
curl -sf https://api.mattandre.ws/semver/github/matthew-andrews/s3up \
| xargs -I '{}' curl -sfL https://github.com/matthew-andrews/s3up/releases/download/{}/s3up_darwin_386 -o /usr/local/bin/s3up \
&& chmod +x /usr/local/bin/s3up
Catching All Errors in AWS Lambda and API Gateway

When building applications with AWS Lambda and API Gateway I’ve found error handling quite difficult to work with.

You first define what status codes your API method is able to serve (200, 404 and 500, for example). You are encouraged to choose 200 as the default. Then you can write regular expressions that match against ‘Lambda Errors’.

According to Amazon’s documentation:-

For Lambda error regex […] type a regular expression to specify which Lambda function error strings (for a Lambda function) […] map to this output mapping.

Note
The error patterns are matched against the errorMessage property in the Lambda response, which is populated by context.fail(errorMessage) in Node.js or by throw new MyException(errorMessage) in Java.
Be aware of the fact that the .\ pattern will not match any newline (\n).

This seems simple enough.

Lambda functions that have run successfully shouldn’t have errorMessages so I should be able to:-

  1. Set a Lambda Error Regex that looks for .*404 Not Found.* and maps that to 404 errors — this works fine
  2. and then I should be able to map all other errors to 500 with (\n|.)* (note the \n is there because I heeded the warning in the documentation above in case one of my errors has a new line).

Whilst the Lambda Error Regex does indeed now map all errors to 500 responses, unfortunately it also maps all the successful Lambda response to 500s as well.

Lambda Error Regex

WARNING: THE LAMBDA ERROR REGEX WILL TRY TO MATCH AGAINST SUCCESSFUL RESPONSES FROM LAMBDA FUNCTIONS AS WELL AS FAILED ONES.

So, how do we fix it?

Easy. Whilst the Lambda Error Regex is used to compare against successful Lambda responses, in this case errorMessage is set to something like an empty string.

Just set the Lambda Error Regex that you want to match to your ‘catch all’ error response to (\n|.)+.

Like this:-

Thoughts

I’m really surprised that this is so difficult and that none of the documentation encourages (or helps) developers to write Lambda Error Regexs that match against all possible errors.

If I had to write regular expressions against all the errors I anticipated having to handle I would never feel 100% confident that I got them all and would have needlessly risked returning 200 responses containing errors to users.

Uploading static files & websites to Amazon S3 efficiently with s3up

I’ve been using Amazon S3 at work and at home a lot recently and have grown to really like its features. Versioning, lifecycle rules and event streams can be used in really cool ways to make rock solid and super performant websites.

When it comes to actually uploading files to S3 there are plenty of choices for command line tools but they all seemed to a bit more than I wanted or not quite enough and I’m learning Go at the moment so…

Introducing s3up!

https://s3up.mattandre.ws

A new cross platform command line tool for uploading files to S3.

If you’d like to try it out or report bugs, installation instructions and more information is up on GitHub.

Features

  • Optimised for uploading static websites
  • Uploads multiple files concurrently (or can be set to upload one at a time — this can be controlled via the --concurrency option)
  • Only uploads files that are new or have changed
  • Automatically detects and sets an appropriate Content-Type for each file uploaded
  • Allows for easy configuration of ACLs and Cache-Control headers for files
  • Splits large files up and uploads them in smaller pieces
  • Written in Go and compiled for all platforms, which means it is fast, can be installed quickly, and is standalone — it does not rely on other dependencies (like Python or Node)
  • Allows manipulation of the path that files get uploaded to
  • Has a --dry-run so that the changes it will make to objects in S3 can be previewed

Manipulating upload path

When deploying a static website to S3 it’s useful to be able to upload files from a different local directory than the one you’re working in or to a directory other than the root in the S3 bucket.

With s3up, files can be uploaded into subdirectories via the --prefix option and leading components to be stripped off file names (for example a generated index.html in a dist folder can be uploaded to the root of an S3 bucket like this: s3up --strip 1 dist/index.html --bucket s3up-test)

I hope you like it and find it useful. Please report bugs if you find them.

Building a half-a-second website

I’ve spent the past couple of weekends rebuilding my website. Previously it was a really old, slow, out-of-date WordPress site running on ridiculously expensive (for what it was) GoDaddy shared hosting. Converting it to a statically generated (Jekyll or similar) site had been on my to-do list for years…

This is it.

Tools and architecture

  • It’s built with Hexo.io (although I swapped out the Sass compilation with one we developed for the Financial Times and removed the client side JavaScript entirely.
  • It’s hosted on S3 (provisioned with CloudFormation).
  • Circle CI runs the builds and pushes to production on green (when linting passes and the pages build).
  • It’s behind a CDN (CloudFlare) who provide SSL for free (thank you CloudFlare <3). They also support HTTP2 and have a nice API that you can use to do some clever cache optimisations with…

Purge on deploy

Currently the CDN in front of https://mattandre.ws is configured to store everything for up to 1 month (and I’m talking to CloudFlare to see if I can increase this to a year) but only instruct users’ browsers to only cache pages for up to 30 minutes. Then, I have set things up to call the CloudFlare API to automatically purge the files that have changed — and only the files that have changed.

Now clearly since Circle CI is already running all my build steps for me and knows what files have changed it could easily coordinate purging of the CDN. Indeed, we use this pattern a lot at the FT. But that was nowhere near over-engineered enough to qualify for a weekend hack project.

Instead, I created a Lambda function that was connected to my website’s S3 bucket’s ObjectRemoved and ObjectCreated streams. Each change in the S3 bucket generates an event that then triggers a Lambda function (written in Go) that purges the CDN for the associated pages. See the code.

Making this change caused the cache hit ratio to jump and even though the website was already fast before making this change, it’s now even faster still. Pages no longer need to travel all the way from Ireland (where my S3 bucket is) to reach every user — it would be as if the site had servers in every one of these cities around the world.

HTTP2 + S3 + CDN make a very fast website

When you add together HTTP2, S3 and smart use of a CDN you get a very performant website.

The above image shows that, occasionally, pages take the almost same amount of time to load in production (right) as they do on my local machine (left). Production isn’t always this quick (a few, very unscientific and statistically invalid spot checks of pages on https://mattandre.ws shows that most of the site loads in about half a second, but is sometimes as slow as 800ms) but it does show that a crazy level of performance is possible.

And there’s so much more left to optimise.