Announcement for Downloading full text filePlease respect the Copyright Act.
All digital full text dissertation and theses from this website are authorized the copyright owners. These copyrighted full-text dissertation and theses can be only used for academic, research and non-commercial purposes. Users of this website can search, read, and print for personal usage. In respect of the Copyright Act of the Republic of China, please do not reproduce, distribute, change, or edit the content of these dissertations and theses without any permission. Please do not create any work based upon a pre-existing work by reproduction, Adaptation, Distribution or other means.
URN etd-1110112-003335 Statistics This thesis had been viewed 2016 times. Download 2414 times. Author Tsao-Jen Hsueh Author's Email Address No Public. Department Computer Science and Enginerring Year 2012 Semester 1 Degree Master Type of Document Master's Thesis Language zh-TW.Big5 Chinese Page Count 77 Title WEB DATA EXTRACTION SYSTEM – DATA TRANSFORMATION AND SCHEDULING Keyword Scheduling Extraction Web data extraction Web mining Wrapper generation XML transformation XML transformation Wrapper generation Web mining Web data extraction Extraction Scheduling Abstract The World Wide Web (abbreviated as WWW or Web), is just like the biggest library in the world, which houses various and abundant knowledge and information that we can carry it with us anytime to realize the dream of “Knowing everything without going out”. We often take advantage of the search engine like Google, to query the information on the internet by keywords. But much unrelated information is also responded by the search engine which requires further arrangement and filtering to get what we really want. Therefore, how to find out the information we need from the wide internet and extract the useful data has become a topic for general discussion and study.
Here we proposed a Web data extraction system called “W2X”. It focuses on a few websites of similar themes, such as travel or shopping, to download the interested information from the webpage automatically and extract what we really need from these webpages of completely different structures. At last, the extracted information is transformed to an identical format and delivered to the portal website for query usage. Hence, this portal website is able to provide the query result across various websites and reduces the time it takes the end-user to query from different websites respectively as well as providing an integrated query result to facilitate the end-user for comparing and analyzing the information. In addition, to avoid the data update interruption by the structure change of the target webpage, the system further provides a graphical user interface for fast setting change and adjustment in order to enable the portal website to provide the most updated query result in the shortest time.
Advisor Committee Yue-Sun Kuo - advisor
Ching-Long Yeh - co-chair
Deron Liang - co-chair
Files Date of Defense 2012-10-26 Date of Submission 2012-11-10