Even though I left RNZ in August 2016, I still get people asking for website performance tips, and for detail on how I made www.radionz.co.nz load so fast.
Why would you want your site to load quickly? There are many, many, many, many reasons why, but in a nutshell:
- A fast loading website is perceived as being more credible.
- A better user experience
- Engagement is higher
- People consume more content
This blog post is intended to unpack the culmination of 5 years research, thinking and actual practice.
In order to make a fast-loading website you need to consider the entire technology stack, from ISP right down to the database, so I’ll look at each of these in turn. There is nothing here that is proprietary; everything I’ll reveal is well known among technology folk, and when combined together and managed, can provide outstanding on-going performance.
On this I should note upfront: to get good performance from your site, and to keep it good, you have to manage the whole stack, and this includes an understanding of each layer, how they interact, and how your daily traffic profiles (including peaks) cause changes in that stack. You also need to understand what the business needs are.
Obviously much of this advice applies to what I learnt and did at RNZ. The important take-away is the overall thought process, and the need to manage performance as ‘a system‘.
ISP and Connectivity
In order to have a fast website you have to have a good internet connection, but what qualifies as good?
All ISPs are not equal, and this was underlined when, about 8 years ago, I put out RNZ’s website connectivity to tender. The responses were all commercial in confidence, but I can talk about general trends.
There was a wide range of latency and propagation times, with the difference between the worst and best being an order of magnitude (10x). One ISP listed near perfect propagation time from Auckland to Wellington (average of 22mS) while another guaranteed it to be never worse than 200ms. Another had a minimum latency that was 5 times worse than the highest of the rest of the group. I am not sure why there was such huge variation, and you should test your connection to verify that it meets the published specs, including the speed.
Peering is when your ISP shares traffic directly with other ISPs, rather than relying on their upstream provider. For example, say you are in Christchurch, and your small local ISP gets their internet connectivity from Large ISP Ltd, a nation-wide company. Another local ISP has customers that want your content, and they also use Large ISP Ltd as their upstream provider. Their customer’s request for a page on your site will go upstream to Large ISP, downstream to your ISP, and to your server (and then back). If Large ISP happens to clear all their traffic through Auckland, and this does happen, this makes for a long round trip as well as extra latency (delays in the transfer of packets).
If the two local ISPs agreed to exchange traffic directly (and therefore locally), it would not have to make a round trip to Auckland, resulting in much faster response and load times for those visiting your site. Both local ISPs would also have lower costs, as that traffic does not have to go via Large ISP limited.
In New Zealand many ISPs do peer, the smaller one prolifically, the larger ones not so much. It makes a big difference if (for example) your ISP sends all you traffic to Auckland before peering it off to other ISPs.
The trick is to find an ISP who peers widely. The more peering connections, the better.
At the time I chose FX Networks (now Vocus), as they had the most peering relationships, and the best bandwidth between these peering points.
Good peering and fast connectivity means is that any requests (and responses) take the shortest path, and that that path often has more bandwidth capacity.
I organised to have two connections – one in Auckland and one in Wellington – for a combination of diversity, and so that I could make content available at both locations.
This meant that consumers near Auckland get their traffic from Auckland, Wellington and south, get it from Wellington.
What this means in practice is that a request for the RNZ website anywhere in NZ takes probably the shortest path and has the best response times (on average) of any local website. (It didn’t cost a fortune either.) For the consumer it means that the site is no more than about 22ms away.
Front End Caching
The next step in the stack is the front end cache. Most large high-performing sites use a caching server of some sort. You should too. This is designed to quickly respond to requests and cushion the application servers from large peaks in traffic. You do not have to cache for long, and the perfect cache time is one that balances publishing cycle, traffic flow, and visitor expectations.
I highly recommend Varnish running in asynchronous delivery mode with a small grace period. This ensures that under normal traffic profiles no one ever has to wait for your application to deliver a page. The exact setting of the cache and grace times are critical, and you need a good understanding of your traffic flows and the needs of the business in regard to how frequently they need to publish updates to content. Getting these settings wrong could frustrate publishers as new content does not show up quickly enough.
By way of example, the cache time for the RNZ home page was 15 seconds, with 5 seconds of grace. This allowed for fast updates of content, but also had some cushioning during large events, with the chance that some user would get a page that was old by about 5 seconds. Not the end of the world for site with news.
It also meant that any flow-own traffic generated by a large event would also be cached and graced during that peak, ensuring good performance, and protecting the main server. An example is breaking news. People hit the home page, then click through to the story. Or more likely, they come in direct to the story via social media, and click on any of the related stories. This cluster of links is quickly cached, making things stay fast under the higher load.
Cache times for the site, and for individual pages should be controlled from the CMS to allow fine-grained control, not set globally in the caching layer. More on that below.
One tip I can give is not falling into the trap of using the front-end cache to hide from back-end problems. A problem in your app needs to be fixed, as increasing cache times in the misguided belief that you’ve fixed the problem is just to going to make the day of reckoning much worse.
Here is an example from my experience. Back in early 2016 some new content was added to the home page. This content required a new database query, and when deployed the load times on the servers went up, the application server because of the extra time to process the data, and the database because of the extra query.
I could have reduced this load by increasing the cache time, but all this would be doing to reducing how often the slower code ran, it would not make it faster. I fixed this by optimising the front-end code and the database query. A side note: increasing the cache time does not proportionally decrease load times, and the impact on load time diminishes as the cache time is increased.
Had I not done this, but rather increased the cache time, the next time some poorly written code was deployed I could again be tempted just to increase the cache times. What happens over time is that because long cache times are increasingly being used to run poor code less often, the chances increase that more of that code is going to have to run at the same time. And when it does it’ll take down the whole system, possibly forcing a reboot. I learnt this lesson the hard way so you do not have to.
Extending the cache time on a page/site does not fix the problem of slow code. It just means that the code in question will run less frequently, and that you now don’t have as much control on when exactly that is. On a high-traffic site, that is a big problem. Fix the underlying code first.
You could throw more hardware at your application, and this is one approach, but that path is not ever going to win you awards for the speed of your websites (or for running a lean server budget). And its’s just plain lazy.
I will talk briefly about Content Delivery Networks (CDNs), something I have used in the past but not in the current design (at the time I left). The theory of the CDN is that it shares the load and moves it closer to end-users. All true, but it does not always lead to better performance. In the case of RNZ 80% of the traffic was local, and at the time I designed things the various CDNs did not diverse enough local connections and peering arrangements. This may have changed over the last few years, although I suspect the current system is still the fastest and most cost effective.
If you do use a CDN, still set the cache times in the CMS, as recommended above.
The Web Server
nginx. That’s it, pretty much. If you are running a modern web application then chances are it’ll work with nginx. It is fast, efficient, supports http2 & https, both of which you should be using for speed and security.
I did not get around to implementing http2 and https at RNZ as it required a proxy in front of Varnish. Varnish now has experimental support for http2, but still requires a proxy for https.
There is much that can be done to tune nginx. Some of the best tweaks involve sendfile, workers, and keepalive. Best left to your sysadmin.
Your CMS is a big part of the performance equation. In the case of RNZ we switched from an off-the-shelf PHP-based solution (MySource Matrix) and replaced it with a solution built on Ruby on Rails. The primary reason was that Rails allowed us to create first-class business objects that aligned with our business processes (I might write a post on this), but also because it allowed fine-grained configuration and control of caching, and it was designed to allow agile development processes. Also, MySource Matrix was not really designed for high traffic sites with frequent updates, even with heavy-grade servers (this may have changed, trust but verify, always).
Of course with all the power and control Rails gives you, code changes have to be managed carefully, so you don’t introduce performance regressions. There is lots of tooling for Rails apps that allows you to check code for performance. I used to run rack-mini-profiler all the time, and would frequently re-write code based on what it revealed. Bullet is another.
The other thing about Rails is the Phusion Passenger Enterprise server extension, which has some major performance benefits over other types of server technology. It is only slightly harder to set up than a PHP-based system, and you do need to keep an eye on the tuning parameters over time, especially if your site is growing in traffic.
As I mentioned above, page cache times should be set in the application. Why? Because this gives you very fine-grained control, and you can set different times for different sections, different pages, and even different times of the day if you want.
In Rails this can be set via a method in the application’s main controller, and any other controller in the system can call it. On the RNZ site I used to set news pages to have short cache times, programmes pages to have a slightly longer cache time, and corporate information pages to have a longer cache time. (Fixed asset files like CSS, JS and images get a 1 year cache time, set via the webserver, and these files are fingerprinted automatically by Rails during the deployment process.)
The overall aim is to give the appearance that content is always up-to-date, and the feeling that content updates appear quickly, while not having to provision a system that actually does that. (It would cost a lot more.)
Recently someone asked my if I would use WordPress or Drupal for a media site. I would not consider either due to the level of customisation required.
WordPress and Drupal are often recommended as service providers can quickly deliver a working site. But once you have deep customisations, it is very heard to update the core of the system without massive rewrites of the customisations. And the cost of adding those missing features is going to be at least the same as (say) Rails, plus you have the overhead of upgrades.
Rails has less functionality out-of-the box, and while it can be difficult updating to the latest version, it does not generally causes massive breakages of your current code. The other thing is that Rails has a full test framework built in, so it is much easier to test what you’ve done after an upgrade.
There are many things WordPress and Drupal are fantastic for, and many things I would use them for. I would just say, go in with your eyes wide open to ensure that the chosen system is operating in its ‘sweet spot’ for your use-case, and will continue to allow innovation as your business grows.
At the scale most websites run, I don’t think the actual database type makes much difference. Postgres (which I use for new projects), MySQL, pick what works with your CMS, and that you and your team are most familiar with.
At RNZ I used the Percona fork of MySQL in production, and this replaced an earlier Oracle (and older MySQL) version. Percona was chosen because it’s additional toolkit makes for much easier management and tuning.
The big performance tweak for MySQL is to try to have your entire database in memory. The other is to keep an eye on the slow-query log, and get your database administrator to come up with optimisations for any slow queries.
The last thing to watch – and it’s often missed – is that you add the appropriate indexes (on associations, for example) when you add or make changes to the database schema. Rails makes doing this easy.
HTML, CSS & Images
Good HTML and CSS does have an impact on page rendering time, once the page arrives at the browser. This is particularly true for mobile, where only a small portion of the page is showing. At RNZ the site was designed to be mobile-friendly, but with no other special optimisations. Because of the speed the pages arrive (based on the work outlined above) the impact of other tweaks is pretty minimal.
Browser performance is significantly better than it was even 12 months ago, and I think that while optimising CSS and DOM performance is important, most sites are not going to able to justify the costs for the relatively small returns.
You can get more bang for your buck by ensuring that image sizes and quality are optimised, and using the <picture> tag or some other technique to ensure the best image is served for each device.
If you want to have (and keep) a high-performing site you need to have an understanding of the technology, how the internet works, and your traffic, and be able to balance that and the needs of your business on an on-going basis.
If you don’t know how you should outsource this to someone who does, and who you can trust to draw the right balance in consultation with you.