New program: count-http-requests
While I use Pi-hole ad blocker and, in Firefox, the NoScript and uMatrix plug-ins to cut down on the number ads I get when browsing the web, from time to time I like to see just how much traffic these two things save me. On Wednesday 21 August I wrote a little awk program to count HTTP requests, the hosts they connect to, and give me a listing.
Here’s the program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
#!/usr/bin/awk -f ##---------------------------------------------------------------------------## # Program: count-http-requests # Author: Brian <genius@groupbcl.ca> :) # Date: August 2019 # # Reads an input file created from a Firefox Developer Tools Console # trace, counts URLs and their success or failure, and displays a # list of URLs, counts, and totals. # # As of August 2019, do the following to create an input file and run # this program: # * Start Firefox # * Press F12 to open the Developer Tools # * In the Developer Tools pane or window, click "Console" # * Leave only "Requests" turned on; turn off Errors, Warnings, Logs, # Info, Debug, CSS, and XHR. # * Navigate to the URL of interest # * When the page finishes loading, right-click on the main body of the # developer tools window and select "Export visible messages to # clipboard" # * Edit a file (for example, "/r/http-requests.A.text" and paste the # clipboard contents into it. # * Run this program as follows: # count-http-requests FILENAME | sort | cut -f2 | less ##---------------------------------------------------------------------------## # BUUS: This script is part of Brian's Useful Utilities Set BEGIN { h_count=0; t_count_succeed=0; t_count_fail=0; max_h_len=0 } # On a GET request, get the FQDN and increment its request count match($0, /^(GET|POST) *https?:\/\/([^\/]+)/, a) { t_count_all++ # Reverse the host name (e.g. host.domain.tld --> tld.domain.host) to # group all hosts in a given domain together count = split(a[2], b, /\./) host = "" for (i=count; i>0; i--) { host = host (i==count ? "" : ".") b[i] } # Initialise counters if we haven't seen this host before if (host in h_name) { } else { h_count++ h_name[host] = a[2] h_count_all[host] = 0 h_count_succeed[host] = 0 h_count_fail[host] = 0 } h_count_all[host]++ if (length(d) > max_h_len) max_h_len = length(d) } # On an HTTP response, count success or fail match($0, /\[HTTP\/... ([0-9][0-9][0-9])/, a) { if (a[1] < 400) { h_count_succeed[host]++ t_count_succeed++ } else { h_count_fail[host]++ t_count_fail++ } } # Display results END { for (host in h_name) { i = h_count_all[host] - (h_count_succeed[host] + h_count_fail[host]) x = "" if (h_count_succeed[host]) x = h_count_succeed[host] " succeeded" if (h_count_fail[host]) x = x (x ? ", " : "" ) h_count_fail[host] " failed" if (i) x = x (x ? ", " : "" ) i " never connected" printf("%s\t%-" max_h_len+1 "s total %i; %s\n", host, h_name[host] ":", h_count_all[host], x) } print "z\t Total " t_count_all " requests to " h_count " individual hosts: " \ t_count_succeed " succeeded, " \ t_count_fail " failed, " \ t_count_all - (t_count_succeed + t_count_fail) " never connected" } |
The Daily Mail is the possibly the worst site on the internet for an advertising signal-to-noise ratio. The front page of the Daily Mail for people running with no ad-blockers at all makes nearly 400 individual requests to nearly a hundred different hosts. A lot of the additional requests are due cascading JavaScript, where one JavaScript program makes requests that conneect to additional sites and get more JavaScript, which in turn do the same thing ....
1 2 3 4 5 6 7 8 9 10 11 12 |
ad.360yield.com: total 2; 2 succeeded acdn.adnxs.com: total 2; 2 succeeded ib.adnxs.com: total 4; 4 succeeded ... (368) lines deleted ... creative.dailymail.co.uk: total 1; 1 succeeded crta.dailymail.co.uk: total 4; 4 succeeded dailymail.co.uk: total 1; 1 succeeded i.dailymail.co.uk: total 63; 63 succeeded scripts.dailymail.co.uk: total 1; 1 succeeded video.dailymail.co.uk: total 4; 4 succeeded www.dailymail.co.uk: total 23; 23 succeeded Total 378 requests to 94 individual hosts: 377 succeeded, 1 failed, 0 never connected |
Enabling Firefox’s uMatrix and NoScript add-ons, leaving Pi-hole to block advertising sites, I got the following:
1 2 3 4 5 6 7 8 9 10 11 12 |
adservice.google.ca: total 2; 2 never connected ad.360yield.com: total 1; 1 never connected acdn.adnxs.com: total 2; 2 succeeded ... (183) lines deleted ... creative.dailymail.co.uk: total 1; 1 never connected crta.dailymail.co.uk: total 2; 2 succeeded dailymail.co.uk: total 1; 1 succeeded i.dailymail.co.uk: total 62; 10 succeeded, 52 never connected scripts.dailymail.co.uk: total 1; 1 never connected video.dailymail.co.uk: total 4; 4 succeeded www.dailymail.co.uk: total 24; 7 succeeded, 17 never connected Total 193 requests to 57 individual hosts: 43 succeeded, 1 failed, 149 never connected |
When run with Pi-hole ad blocking and uMatrix and NoScript enabled, going to the site looks like this:
1 2 3 4 5 6 7 |
d3tsytm1wtjqo2.cloudfront.net: total 6; 6 succeeded dailymail.co.uk: total 1; 1 succeeded i.dailymail.co.uk: total 56; 56 succeeded scripts.dailymail.co.uk: total 1; 1 succeeded video.dailymail.co.uk: total 2; 2 succeeded www.dailymail.co.uk: total 20; 20 succeeded Total 86 requests to 6 individual hosts: 86 succeeded, 0 failed, 0 never connected |
Like I said, the Daily Mail is probably the worst offender on the web. Here are a couple of other sites:
Site | Full blocking | No blocking |
---|---|---|
cbc.ca/news | 65 requests (8 hosts): 65 / 0 / 0 | 190 requests (48 hosts): 189 / 0 / 1 |
universetoday.com | 46 requests (8 hosts): 45 / 0 / 1 | 135 requests (40 hosts): 134 / 0 / 1 |