Ticket #588 (closed maintenance: fixed)
RSS feed caching
Reported by: | chris | Owned by: | chris |
---|---|---|---|
Priority: | minor | Milestone: | Maintenance |
Component: | Live server | Keywords: | |
Cc: | ed, jim | Estimated Number of Hours: | 0.0 |
Add Hours to Ticket: | 0 | Billable?: | yes |
Total Hours: | 2.69 |
Description
At the meeting on 5th September ticket:585 one thing we discussed was that:
The biggest single source of bandwidth by URL is Rob's RSS feed, 34.5GB of data, 150k hits. This represents half the total site data transfer, the RSS files is 0.4MB but will hopefully be mostly served gzipped at 0.1MB.
This ticket is to follow up on that -- are the RSS feeds served directly by Nginx?
Change History
comment:1 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.35
- Total Hours changed from 0.0 to 0.35
comment:2 Changed 3 years ago by jim
Hmm... odd. This could be deliberate - looking at issue BOA/Drupal issue queues. More in a mo...
comment:3 follow-up: ↓ 4 Changed 3 years ago by jim
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 0.35 to 0.5
See https://drupal.org/node/1894804
So it appears SSL requests are cached by Drupal (in Redis), but only for 1s in the speedcache to avoid DoS attacks.
We should consider putting a rule in /data/conf/override.global.inc to force RSS requests to HTTP?
comment:4 in reply to: ↑ 3 Changed 3 years ago by chris
Replying to jim:
Hmm, we don't do any file based caching?
We should consider putting a rule in /data/conf/override.global.inc to force RSS requests to HTTP?
We shouldn't force any requests from HTTPS to HTTP as that would cause things to be unaccessible to users of HTTPS Everywhere, see ticket:574
comment:5 Changed 3 years ago by jim
The handful of HTTPS Everywhere, who then also try to read an RSS feed from the browser is a vanishingly small % of the users of the site. And this RSS feed accounts for a LOT of traffic.
It's about balancing needs and wants, priorities with ideals. My vote is to make any HTTPS *.rss feed request do a 301 Moved Permanently to its HTTP counterpart. If that breaks things for 2-5 people a month then that's a cost we can easily bear.
Ed's decision ultimately...
comment:6 Changed 3 years ago by jim
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 0.5 to 0.65
So we need to improve cachin, which in in order of speed looks like:
- Feed in Nginx speed cache (file based cache for each request) -- near Varnish speed
- Redirects at nginx level HTTPS->HTTP (means 2 requests, second would hit speed cache) -- harder to implement since the nginx rules are controlled by Aegir, though I'm sure overrides are possible.
- Redirects in global.inc in php HTTPS->HTTP (still very fast, happens before Drupal starts, same as 2. and 1. above after that) - easy to implement in override.global.inc.
- Feed delivered in from Drupal's cache, in early bootstrap (drupal has part-loaded, but no modules have arrived yet, the feed/page is returned from Redis very quickly in around 100-250ms
- Feed generated by Drupal and returned (plus cached) -- slow, 0.3-3 seconds per page/feed.
Currently we're doing a 4. for the HTTPS RSS feeds including Rob's blog, and a 1 for HTTP versions. If we can get to the HTTPS calls to use 3., then we can expect an order of magnitude speed increase, and similar load decrease for all HTTPS RSS requests.
comment:7 Changed 3 years ago by ed
Hi both,
given what's going on we need to be in a utilitarian mindset. Please force RSS through http. We will get to the more strategic project goals and values when we're doing strategy stuff - for now we're digging in and getting the operations sorted.
comment:8 Changed 3 years ago by jim
- Add Hours to Ticket changed from 0.0 to 1.35
- Status changed from new to closed
- Resolution set to fixed
- Total Hours changed from 0.65 to 2.0
OK, so this is done... In a recent commit I added the following code to our /data/conf/global.override.inc:
/* ---------- PERFORMANCE ---------- */ /** * Detect requests for feed urls on HTTPS, redirect to HTTP. */ if ($conf['https'] && preg_match("/\/feed$|^\/blogs\/feed\/.*$/", $_SERVER['REQUEST_URI'])) { header("HTTP/1.1 301 Moved Permanently"); // Send user to same URL on HTTP. header("Location: " . str_replace('https://', 'http://', $base_url) . $_SERVER['REQUEST_URI']); }
Which, in English: If we're on a HTTPS page and the path requested ends in "*/feed" or starts with "/blogs/feed/*", then redirect the request back to the same path but on HTTP.
Seems to work well after a little snaffu with the comments (I put the path */ in which closed the comments and made a few 503s).
This will also match all built-in Drupal feeds (on taxonomies, forum etc), plus all blogs feeds. It's starting point for further tweaks later. And per my point in comment 5, I think the number of people reading a RSS feed with the HTTPS Everywhere plugin enabled will be tiny or 0 -- and this way the rest of the site is still accessible via HTTPS for users of the plugin... A good compromise I think.
Finally, because the Pathologic module is making requests and related resources the same protocol (i.e. all HTTPS requests get HTTPS asset links -- like images etc), these resources will also be returned over HTTP, which should maximise caching opportunities (some images are derrived/resized by the server and are only micro-cached over HTTPS). This will give a tiny by useful boost.
Marking done, please reopen if any issues found.
comment:9 Changed 3 years ago by jim
Note, this code could be moved to the 'local.settings.php file for the site if we need it only for TN.org -- presently it'll affect all sites hosted via Aegir.
comment:10 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.44
- Total Hours changed from 2.0 to 2.44
I'd like it noted that I fundamentally disagree with making resources such as RSS HTTP only and denying users the ability to download them using HTTPS. I think we should be aiming at making everything HTTPS only and looking at putting in place the infrastructure needed to support this.
To quantify the HTTP vs HTTPS performance question I have copied Rob's RSS feed to PenguinServer as a static file:
- http://penguin.transitionnetwork.org/rob-hopkins.rss
- https://penguin.transitionnetwork.org/rob-hopkins.rss
And these tests were run from ParrotServer, using ApacheBench, 10 requests for the RSS feed, gzipped, HTTP:
Server Software: nginx/1.4.1 Server Hostname: penguin.transitionnetwork.org Server Port: 80 Document Path: /rob-hopkins.rss Document Length: 425431 bytes Concurrency Level: 1 Time taken for tests: 0.053 seconds Complete requests: 10 Failed requests: 0 Write errors: 0 Total transferred: 4257050 bytes HTML transferred: 4254310 bytes Requests per second: 189.45 [#/sec] (mean) Time per request: 5.278 [ms] (mean) Time per request: 5.278 [ms] (mean, across all concurrent requests) Transfer rate: 78761.64 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 0 Processing: 4 5 1.5 4 8 Waiting: 0 0 0.2 0 1 Total: 4 5 1.6 4 8 Percentage of the requests served within a certain time (ms) 50% 4 66% 4 75% 6 80% 7 90% 8 95% 8 98% 8 99% 8 100% 8 (longest request)
With HTTP it was 5 ms seconds per request, 0.05 seconds for 10 requests, with HTTPS:
ab -v 4 -n 10 -H "Accept-Encoding: gzip" https://penguin.transitionnetwork.org/rob-hopkins.rss Server Software: nginx/1.4.1 Server Hostname: penguin.transitionnetwork.org Server Port: 443 SSL/TLS Protocol: TLSv1/SSLv3,RC4-SHA,2048,128 Document Path: /rob-hopkins.rss Document Length: 425431 bytes Concurrency Level: 1 Time taken for tests: 25.875 seconds Complete requests: 10 Failed requests: 0 Write errors: 0 Total transferred: 4256820 bytes HTML transferred: 4254310 bytes Requests per second: 0.39 [#/sec] (mean) Time per request: 2587.505 [ms] (mean) Time per request: 2587.505 [ms] (mean, across all concurrent requests) Transfer rate: 160.66 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 25 45 23.7 30 84 Processing: 1799 2542 274.7 2649 2701 Waiting: 43 89 34.4 86 134 Total: 1824 2587 283.0 2702 2729 Percentage of the requests served within a certain time (ms) 50% 2702 66% 2705 75% 2712 80% 2719 90% 2729 95% 2729 98% 2729 99% 2729 100% 2729 (longest request)
2.6 seconds per request, 26 seconds for 10 requests -- it is clearly true that HTTPS is slower than HTTP.
Repeating the test but this time with 10 concurrent requests and asking for 100 copies in total, HTTP:
ab -v 4 -n 100 -c 10 -H "Accept-Encoding: gzip" http://penguin.transitionnetwork.org/rob-hopkins.rss Server Software: nginx/1.4.1 Server Hostname: penguin.transitionnetwork.org Server Port: 80 Document Path: /rob-hopkins.rss Document Length: 425431 bytes Concurrency Level: 10 Time taken for tests: 0.272 seconds Complete requests: 100 Failed requests: 0 Write errors: 0 Total transferred: 42570500 bytes HTML transferred: 42543100 bytes Requests per second: 367.56 [#/sec] (mean) Time per request: 27.207 [ms] (mean) Time per request: 2.721 [ms] (mean, across all concurrent requests) Transfer rate: 152804.49 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 0.3 1 2 Processing: 13 26 8.7 23 43 Waiting: 1 4 3.3 3 16 Total: 14 27 8.8 24 44 Percentage of the requests served within a certain time (ms) 50% 24 66% 26 75% 37 80% 39 90% 42 95% 43 98% 43 99% 44 100% 44 (longest request)
And HTTPS:
ab -v 4 -n 100 -c 10 -H "Accept-Encoding: gzip" http://penguin.transitionnetwork.org/rob-hopkins.rss Server Software: nginx/1.4.1 Server Hostname: penguin.transitionnetwork.org Server Port: 443 SSL/TLS Protocol: TLSv1/SSLv3,RC4-SHA,2048,128 Document Path: /rob-hopkins.rss Document Length: 425431 bytes Concurrency Level: 10 Time taken for tests: 282.074 seconds Complete requests: 100 Failed requests: 0 Write errors: 0 Total transferred: 42568200 bytes HTML transferred: 42543100 bytes Requests per second: 0.35 [#/sec] (mean) Time per request: 28207.423 [ms] (mean) Time per request: 2820.742 [ms] (mean, across all concurrent requests) Transfer rate: 147.37 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 51 1147 613.7 1119 2528 Processing: 23610 26981 1325.9 26779 29786 Waiting: 30 540 319.1 446 1517 Total: 25381 28128 1458.4 28574 30722 Percentage of the requests served within a certain time (ms) 50% 28574 66% 29042 75% 29285 80% 29483 90% 30260 95% 30506 98% 30713 99% 30722 100% 30722 (longest request)
Time wise there is clearly no contest with nginx serving static files, of course with PuffinServer the situation is different, Ngnix does all HTTPS be reverse proxying to itself using HTTP and the RSS is dynamically generated and cached, but in terms of the time it takes to do HTTPS compared with HTTP it is very clear that HTTP is faster.
comment:11 Changed 3 years ago by chris
- Add Hours to Ticket changed from 0.0 to 0.1
- Total Hours changed from 2.44 to 2.54
For comparison with the stats for 100 requests, via 10 concurrent connections, these are the results run against PuffinServer (I ran this a few times, these are the best results):
ab -v 4 -n 100 -c 10 -H "Accept-Encoding: gzip" http://www.transitionnetwork.org/blogs/feed/rob-hopkins Server Software: nginx Server Hostname: www.transitionnetwork.org Server Port: 80 Document Path: /blogs/feed/rob-hopkins Document Length: 126308 bytes Concurrency Level: 10 Time taken for tests: 0.540 seconds Complete requests: 100 Failed requests: 0 Write errors: 0 Total transferred: 12700900 bytes HTML transferred: 12630800 bytes Requests per second: 185.29 [#/sec] (mean) Time per request: 53.968 [ms] (mean) Time per request: 5.397 [ms] (mean, across all concurrent requests) Transfer rate: 22982.38 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.3 0 1 Processing: 25 50 49.9 33 301 Waiting: 6 28 49.6 10 274 Total: 25 50 50.1 33 302 Percentage of the requests served within a certain time (ms) 50% 33 66% 38 75% 48 80% 56 90% 90 95% 191 98% 269 99% 302 100% 302 (longest request)
I think it's very interesting to note that PenguinServer Nginx is twice as fast when serving a static file, without any caching set up, than PuffinServer is serving the same data, dynamically generated, via it's various levels of caching.
comment:12 follow-up: ↓ 13 Changed 3 years ago by jim
- Add Hours to Ticket changed from 0.0 to 0.15
- Total Hours changed from 2.54 to 2.69
Per #602 later comments, I've changed the way this works... I'm now forcing long cache times for RSS regardless of HTTP security status.
The code in the override.global.inc (replacing that in comment:5) is now:
/** * Detect requests for feed urls and force long Redis and Speedcache cache times. */ if (preg_match("/\/feed$|^\/blogs\/feed\/.*$/", $_SERVER['REQUEST_URI'])) { // Set all caches on, and to 3 hours expiry (belt and braces -- might be overkill!). $expire_in_seconds = '10800'; $use_cache = TRUE; $use_redis = TRUE; header('X-Force-Redis: YES'); header('X-Limit-Booster: ' . $expire_in_seconds); header('X-Accel-Expires: ' . $expire_in_seconds); }
This resolves the issue as there's no redirect, and covers Chris' concerns regarding HTTPS Everywhere users.
There's also an explanation of the cache speed issue for these feeds over on #602 too.
comment:13 in reply to: ↑ 12 Changed 3 years ago by chris
Replying to jim:
This resolves the issue as there's no redirect, and covers Chris' concerns regarding HTTPS Everywhere users.
Thanks :-)
comment:14 follow-up: ↓ 15 Changed 3 years ago by chris
Jim - have you managed to login to New Relic since it was added back? I still haven't.
I was going to remove it again tonight but I think it's worth taking one last look, especially since it was added back at 8:30pm on 3rd Oct and the load spikes, once again, dramatically reduced around the same time:
Of course it was the weekend and the load has picked up since and it does correspond with the Piwik stats -- I fiund it hard to believe there is a direct correlation:
comment:15 in reply to: ↑ 14 Changed 3 years ago by chris
Oops chris should have been posted to ticket:586#comment:31
Looking at the HTTP headers for Rob's RSS feed using lynx:
So, BOA must assume that this is bot request that isn't allowed, so, using ApacheBench:
And doing 10 requests:
The speed of the response indicates that it's clearly serving the results from a cache -- 10 copies in 1 second.
However using HTTPS (I don't how many requests for the feed use HTTPS) it appears that it isn't cached, 32 seconds for 10 copies:
These are the HTTPS headers:
It appears that a cached copy is being served from the X-Speed-Cache: HIT header, but the speed indicated that it probably isn't - half a second for a single files from another virtual server on the same network via HTTPS compared to 0.05 seconds using HTTP doesn't sound exactly right to me.
Jim, any thoughts on what to do next with this?
My only idea is to put a copy of the RSS file on penguin (also running Nginx) and test if from there using HTTP and HTTPS to see of that is what is adding the overhead.