PoempelFox Blog

[..] [RSS Feed]
 

Sun, 21. Feb 2016


Download redirector for Debian CD and DVD images Created: 21.02.2016 10:50

The problem

If you ever downloaded Debian CD and/or DVD images from the internet via HTTP/FTP (and not via Torrent), you'll have come across the 'Debian CD/DVD images via HTTP/FTP' website. So did I on multiple occasions, and when I last visited it to download a Jessie image after that was released, I noticed that it still was as horrible as always: There are links to downloads for each architecture, but they all point to the main site for CD/DVD images in Sweden. There is also a list of all known mirrors hidden away at the bottom of the page, but even if you happen to find that, it is next to useless:
  • It does link to mirrors that have been down for weeks or months
  • It does link to mirrors that do not have the current version (yet)
  • The links point to the root of the mirror, not to the directory containing the image for the current version, meaning you'll have to find your way there, which sometimes isn't easy. And of course, if you actually selected a mirror that does not have the file that you wanted, you'll have to start over with another mirror... In some cases it may not even be obvious if the mirror really does not have the file that you wanted, or you just got lost in the maze of directories with no description.
As a result, nobody uses the mirrors, everyone just uses the main site in Sweden. I can actually confirm that, as I run a listed public mirror of debian-cd at work, and even during release-time it is one of our least traffic-generating mirrors with only a few gigabytes per day - while the site in Sweden pushes out 10+ GBit. While they don't seem to mind the traffic surge that generates, this is still less than ideal. Especially with huge files like 4 GB DVD ISO-images, it does make a huge difference if you download them from another continent, or at least ten times as fast from a mirror in the same country.

The solution

Ideally, the download-page would send you to a mirror that is local to you and has the file that you want automatically. This would better spread the load and also improve the user experience through faster downloads.
This is actually not rocket science. Software that does that is available, and in fact is already in use by Debian: httpredir.debian.org does exactly that for the archive, i.e. the repositories that apt uses. Unfortunately, the software used for httpredir can only handle package repositories, because it needs the index files in those repositories. It would be hard and messy to implement support for redirecting debian-cd into the software used for httpredir. There is however other existing software of this kind that would work very well for debian-cd. The two most popular solutions are Mirrorbrain and Mirrorbits. Both boast a very similiar featureset, as Mirrorbits was written by a VLC developer to replace Mirrorbrain because it allegedly became too slow. Even the commandline-interface of both is almost identical.
The way this works is that Mirrorbrain knows about all the public mirrors of debian-cd and their location. It will check if they are up every few minutes, and it will also scan their contents in regular intervals, so that it knows which mirrors have which files. When a client asks the Mirrorbrain-server for a file, the server will look up the client in his GeoIP and ASN databases. As a result of that, it will usually know in which country and on which ISP the client is. It will then try to find a good mirror for the client to use. It will automatically ignore mirrors that are down or do not have the file that was requested (because they are partial mirrors, or out of sync). If there is one (or multiple) mirrors on the same subnet or the same ASN (in simplified terms, the same ISP) it will select those. If there aren't, it will look for mirrors in the same country, and select a random one of those. If no mirrors are in the same country, the search is broadended to the same continent. Only if that search is also unsuccessful, or if the initial GeoIP-lookup failed to return a country for the client, a random server from all over the world will be selected. The client will then get a HTTP 302 response that redirects it to the selected server, which will send the actual file.

The catch

So why is this not live yet? Well, in a way it is, see below.
But for this to run as an official debian service on official debian infrastructure, there is one major prerequisite: The software needs to be in the standard Debian package repositories. And unfortunately, so far neither Mirrorbrain nor Mirrorbits are.
There actually are Debian packages available for Mirrorbrain, but not in the standard Debian repositories. They would probably need some work to make them compliant with Debian packaging policies. Mirrorbrain consists of multiple packages, among them three small Apache-modules.
For Mirrorbits there are no packages available, and I imagine packaging it will not be fun, because it's the typical "lets download 1000 random libraries in random, mostly beta-versions, because we have to use the absolutely latest features, from random sites around the internet" kind of software. I'll rant about how much I loathe that another time. On the plus side, Mirrorbits packs everything, from builtin webserver to command line utilities, into just one binary, so there will only be one mirrorbits-package. It is also the newer project and under active development.
So far, all attempts to find a Debian Developer to create and maintain packages have been unsuccessful. If you are a Debian Developer and willing to package Mirrorbrain or Mirrorbits - please do. It doesn't really matter which one, both will provide the required featureset, and both have their advantages and disadvantages.

Running instance

Originally because I wanted to toy around a bit with Mirrorbrain, I actually did set up a working mirrorbrain-instance for debian-cd. If you want to give it a try, head over to
debian-cd.debian.net.
This is a fully functional mirror redirector for debian-cd downloads. It knows all mirrors of debian-cd contained in the official mirrorlist, and scans those that it can scan. If this were running on official DSA infrastructure and not on a private server run by me, all download-links on the webpage could be pointed at this tomorrow, and thus be automatically redirected.
It could still use some fine-tuning with regards to mirror selection though: For example, it is possible to set priorities for the mirrors within a country depending on how much bandwidth they have available, and that will make Mirrorbrain redirect more or less clients there; Or it might sometimes be a good idea to override the selection of servers for countries that do not have any mirrors on their own, because Mirrorbrains automatism based on geographic distance sometimes makes less than ideal choices. That however can only be improved by lots of feedback from lots of users. Feel free to send me feedback to the address given on the website. It should be possible to import all tuning made on my demo-instance into a production-version on official Debian infrastructure, which will hopefully happen one day.
For any feedback, it is imperative to know what mirrorbrain thinks about you: Where you're coming from, which mirrors it considered, and why. Luckily, one can easily get that info: You just need to append ?mirrorlist to the download URL, and instead of redirecting you, Mirrorbrain will give you lots of info helpful for debugging. Here is an example output from this in a case where mirror selection worked perfectly:

As you can see, it recognized there is a mirror on the very same subnet (because the university happens to run a debian mirror), and since it is up and has the requested file, the client would be redirected there. If the mirror were down, Mirrorbrain would randomly select (with the "prio" values influencing the likeliness of a mirror getting selected) a mirror from the next group of "same AS", and so on if none was available there.
The Mirrorbrain-version running is mostly the current one from the Mirrorbrain and mod_asn Github repositories with a few minor fixes for IPv6 support applied. The latest released version does not support IPv6 yet, the git version does, but still had a few bugs in that - I've reported them upstream and sent pull requests for some of them. They'll hopefully be fixed soon. Apart from that, there really isn't anything special about this installation. Most of the work setting this up was feeding it the mirrorlist. That was because the official mirrors.masterlist contained A LOT of mirrors, many of them dead, with wrong paths, not answering to rsync or ftp even though the masterlist said they would, and so on. That meant that even though I had a script that would compare the list of mirrors in Mirrorbrain and the one in the masterlist and spit out the commands needed for bringing those into sync, I would have to manually check every second entry, because the masterlist was just wrong. It took me some hours to get all about 130 mirrors into Mirrorbrain. In some rare cases, mirrors are actually up, but cannot be used in Mirrorbrain. That happens because Mirrorbrain needs to scan what files a mirror has, and for that it needs a way to get a directory listing. If a mirror only offers HTTP and no FTP or RSYNC, and prints the directory listings in a format that misses vital information or is just too messed up for Mirrorbrain to parse, that mirror cannot be used, even though it might be perfectly fine for downloading. However that applies to less than 5% of all mirrors, so there aren't that many lost due to that.
If you want to see a list of mirrors currently known to the Mirrorbrain instance, have a look at http://debian-cd.debian.net/mirrorlist.html. It also shows a nice map of the mirror locations.
My biggest hope in running this demo instance is that seeing how good this works in practice will motivate someone to take on the packaging. Lets see if that works out. Until then, spread the word that this option exists. I plan to keep this running for the foreseeable future.
no comments yet
write a new comment:
name or nickname
eMail adress (optional)
Your comment:
calculate: (2 times 10) plus 3
 

EOPage - generated with blosxom