Master of Science (M.Sc.)
Web crawlers have been developed for several malicious purposes like downloading server data without permission from website administrator. Armored stealthy crawlers are evolving against new anti-crawler mechanisms in the arms race between the crawler developers and crawler defenders. In this paper, we develop a new anti-crawler mechanism called PathMarker to detect and constrain crawlers that crawl content of servers stealthy and persistently. The basic idea is to add a marker to each web page URL and then encrypt the URL and marker. By using the URL path and user information contained in the marker as the novel features of machine learning, we could accurately detect stealthy crawlers at the earliest stage. Besides effectively detecting crawlers, PathMarker can also dramatically suppress the efficiency of crawlers before they are detected by misleading the crawlers visiting same page's URL with different markers. We deploy our approach on a forum website to collect normal users' data. The evaluation results show that PathMarker can quickly capture all 12 open-source and in-house crawlers, plus two external crawlers (i.e., Googlebots and Yahoo Slurp).
© The Author
Wan, Shengye, "Protecting Web Contents Against Persistent Crawlers" (2016). Dissertations, Theses, and Masters Projects. William & Mary. Paper 1477068008.