An Efficient Change Detection Algorithm for Tree-structured Data


The KIPS Transactions:PartC, Vol. 10, No. 6, pp. 683-694, Oct. 2003
10.3745/KIPSTC.2003.10.6.683,   PDF Download:

Abstract

We present X-tree Diff, a change detection algorithm for tree-structured data. Our work is motivated by the need to monitor massive volume of web documents and detect suspicious changes, called defacement attack on web sites. From this context, our algorithm should be very efficient in speed and use of memory space. X-tree Diff uses a special ordered labeled tree, X-tree, to represent XML/HTML documents. X-tree nodes have a special field, tMD, which stores a 128-bit hash value representing the structure and data of subtrees, so that it compares between subtrees efficiently. X-tree Diff, supporting four basic operations (insert, delete, update, and move), uses tMD fields to match identical subtrees from the old and new versions. During this process, X-tree Diff uses the Rule of Delaying Ambiguous Matchings, implying that it perform exact matchings where a node in the old version has one-to-one correspondence with the corresponding node in the new, by delaying all the others. It drastically reduces the possibility of wrong matchings. X-tree Diff propagates such exact matchings upwards in Step 2, and obtain more matchings downwards from roots in Step 3. In Step 4, nodes to be inserted or deleted are decided. We also show that X-tree Diff runs in O(n), where n is the number of nodes in X-trees, in worst case as well as in average case. This result is even better than that of BULD Diff algorithm, which is O(n log(n)) in worst case. We experimented X-tree Diff on real data, which are about 11,000 home pages from about 20 web sites, instead of synthetic documents manipulated for experimentation. Currently, X-tree Diff algorithm is being used in a commercial hacking detection system, called the WIDS (Web-Document Intrusion Detection System), which is to find changes occurred in registered websites, and report suspicious changes to users.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
L. S. Gyun and K. D. A, "An Efficient Change Detection Algorithm for Tree-structured Data," The KIPS Transactions:PartC, vol. 10, no. 6, pp. 683-694, 2003. DOI: 10.3745/KIPSTC.2003.10.6.683.

[ACM Style]
Lee Seog Gyun and Kim Dong A. 2003. An Efficient Change Detection Algorithm for Tree-structured Data. The KIPS Transactions:PartC, 10, 6, (2003), 683-694. DOI: 10.3745/KIPSTC.2003.10.6.683.