Loading...
Thumbnail Image
Publication

Protecting Web Contents Against Persistent Crawlers

Wan, Shengye
Abstract
Web crawlers have been developed for several malicious purposes like downloading server data without permission from website administrator. Armored stealthy crawlers are evolving against new anti-crawler mechanisms in the arms race between the crawler developers and crawler defenders. In this paper, we develop a new anti-crawler mechanism called PathMarker to detect and constrain crawlers that crawl content of servers stealthy and persistently. The basic idea is to add a marker to each web page URL and then encrypt the URL and marker. By using the URL path and user information contained in the marker as the novel features of machine learning, we could accurately detect stealthy crawlers at the earliest stage. Besides effectively detecting crawlers, PathMarker can also dramatically suppress the efficiency of crawlers before they are detected by misleading the crawlers visiting same page's URL with different markers. We deploy our approach on a forum website to collect normal users' data. The evaluation results show that PathMarker can quickly capture all 12 open-source and in-house crawlers, plus two external crawlers (i.e., Googlebots and Yahoo Slurp).
Description
Date
2016-10-01
Journal Title
Journal ISSN
Volume Title
Publisher
Download Dataset
Rights Holder
Usage License
Embargo
Research Projects
Organizational Units
Journal Issue
Keywords
Citation
Department
Computer Science
DOI
http://doi.org/10.21220/S2301C
Embedded videos