DHT Crawling

As part of the broad goal of replicating a computer security attack, our group investigated the internet protocol known as BitTorrent. BitTorrent is a peer-to-peer file sharing protocol where users attempting to download a given file simultaneously “seed” to other users the portions of that file they have already downloaded. This creates a robust distributed network of peers which allows for higher download speeds, greater resilience to outages, and a much greater downloader capacity than is possible with a traditional single server. According to McAfee, almost 25% of upstream internet traffic is conducted over the Bittorrent protocol. Thus, BitTorrent is an important component of the modern internet which is worthy of understanding from a security standpoint.

Due to the decentralized and resilient nature of Bittorrent, it has become the protocol of choice for sharing files that would be difficult to share over traditional centralized networks due to cost and bandwidth (such as the open-source community’s use of Bittorrent to share large Linux installation media) or legal reasons (many copyrighted works are shared over Bittorrent due to the perceived anonymity and inability of copyright holders to take down a centralized hosting server with a court order).

Traditionally, BitTorrent files were searchable on centralized servers called “trackers” which maintain a list of available files along with the IP addresses of peers which are currently torrenting the given file. This system maintains the bandwidth benefits of the distributed BitTorrent network, but lacks the same long-term resilience if the tracker server goes offline, effectively preventing new peers from joining torrents. In cases where BitTorrent trackers are listing copyrighted material, copyright holders or bounty hunters under their employ may report tracker servers to federal law enforcement, resulting in injunctions to web registrars or server hosts requiring that they take down the tracker site.

Due to legal action and the aforementioned risk of torrent trackers being forcibly taken down, the BitTorrent protocol was extended to support distributed discovery of torrents using distributed hash tables (DHT). This removes the centralized tracker server from the BitTorrent system, thus making the protocol overall more resistant to centralized failures. However, the distributed nature of DHT torrenting, reliant on a network of cooperative servers which form the complete tracker network, means that the identities of torrenting peers are no more anonymous than those torrenting via traditional centralized trackers.

Our goal is to understand the BitTorrent protocol and build a client prototype to efficiently traverse the most popular distributed hash tables currently in use for torrenting. This will allow us to collect metadata such as IP addresses associated with specific torrents, see what torrents are the most popular, and draw other statistical inferences about the behaviors of torrenters and the files commonly downloaded using the BitTorrent protocol.