首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Kleinbergs HITS algorithm (Kleinberg 1999), which was originally developed in a Web context, tries to infer the authoritativeness of a Web page in relation to a specific query using the structure of a subgraph of the Web graph, which is obtained considering this specific query. Recent applications of this algorithm in contexts far removed from that of Web searching (Bacchin, Ferro and Melucci 2002, Ng et al. 2001) inspired us to study the algorithm in the abstract, independently of its particular applications, trying to mathematically illuminate its behaviour. In the present paper we detail this theoretical analysis. The original work starts from the definition of a revised and more general version of the algorithm, which includes the classic one as a particular case. We perform an analysis of the structure of two particular matrices, essential to studying the behaviour of the algorithm, and we prove the convergence of the algorithm in the most general case, finding the analytic expression of the vectors to which it converges. Then we study the symmetry of the algorithm and prove the equivalence between the existence of symmetry and the independence from the order of execution of some basic operations on initial vectors. Finally, we expound some interesting consequences of our theoretical results.Supported in part by a grant from the Italian National Research Council (CNR) research project Technologies and Services for Enhanced Content Delivery.  相似文献   

2.
Zusammenfassung. Das System fur die interaktive, automatische Stundenplanung ist im Rahmen der Forschungsarbeiten des Bereichs Planungstechnik und Deklarative Programmierung in Fraunhofer FIRST zur Erweiterung der Constraint-basierten Programmierung entwickelt worden. Mit dem System wird die Stundenplanung der Medizinischen Fakultat Charité seit dem Sommersemester 1998 vorgenommen. Seitdem wurde das System kontinuierlich weiterentwickelt. Der erfolgreiche Einsatz des Systems zeigte, dass die gewahlten Methoden und Verfahren sehr geeignet fur die Behandlung derartiger Probleme sind. Die Vorteile einer kombinierten interaktiven und automatischen Stundenplanerzeugung konnten eindeutig nachgewiesen werden.CR Subject Classification: I.2.8, I.2.3, J.1, K.3.2, D.3.3, D.1.6Eingegangen am 15. März 2003 / Angenommen am 9. März 2004, Online publiziert: 1. Juli 2004  相似文献   

3.
Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which prove effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents. This algorithm is the Semi-Supervised Fuzzy c-Means (ssFCM). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 115 TOPICS classes of the Reuters collection. Using the Vector Space Model, each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssFCM improve its performance, effectively addresses the classification of documents into categories with few training documents and does not interfere with the use of training data.  相似文献   

4.
Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.
Diego Reforgiato RecuperoEmail:
  相似文献   

5.
Generalized Hamming Distance   总被引:4,自引:0,他引:4  
Many problems in information retrieval and related fields depend on a reliable measure of the distance or similarity between objects that, most frequently, are represented as vectors. This paper considers vectors of bits. Such data structures implement entities as diverse as bitmaps that indicate the occurrences of terms and bitstrings indicating the presence of edges in images. For such applications, a popular distance measure is the Hamming distance. The value of the Hamming distance for information retrieval applications is limited by the fact that it counts only exact matches, whereas in information retrieval, corresponding bits that are close by can still be considered to be almost identical. We define a Generalized Hamming distance that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently. We envision many uses for such a measure. In this paper we define and prove some basic properties of the Generalized Hamming distance, and illustrate its use in the area of object recognition. We evaluate our implementation in a series of experiments, using autonomous robots to test the measure's effectiveness in relating similar bitstrings.  相似文献   

6.
The application of relevance feedback techniques has been shown to improve retrieval performance for a number of information retrieval tasks. This paper explores incremental relevance feedback for ad hoc Japanese text retrieval; examining, separately and in combination, the utility of term reweighting and query expansion using a probabilistic retrieval model. Retrieval performance is evaluated in terms of standard precision-recall measures, and also using number-to-view graphs. Experimental results, on the standard BMIR-J2 Japanese language retrieval collection, show that both term reweighting and query expansion improve retrieval performance. This is reflected in improvements in both precision and recall, but also a reduction in the average number of documents which must be viewed to find a selected number of relevant items. In particular, using a simple simulation of user searching, incremental application of relevance information is shown to lead to progressively improved retrieval performance and an overall reduction in the number of documents that a user must view to find relevant ones.  相似文献   

7.
Information Retrieval systems typically sort the result with respect to document retrieval status values (RSV). According to the Probability Ranking Principle, this ranking ensures optimum retrieval quality if the RSVs are monotonously increasing with the probabilities of relevance (as e.g. for probabilistic IR models). However, advanced applications like filtering or distributed retrieval require estimates of the actual probability of relevance. The relationship between the RSV of a document and its probability of relevance can be described by a normalisation function which maps the retrieval status value onto the probability of relevance (mapping functions). In this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection). These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function.  相似文献   

8.
Zusammenfassung. Dadurch, dass Literaturnachweise und Publikationen zunehmend in elektronischer und auch vernetzter Form angeboten werden, haben Anzahl und Größe der von wissenschaftlichen Bibliotheken angebotenen Datenbanken erheblich zugenommen. In den verbreiteten Metasuchen über mehrere Datenbanken sind Suchen mit natürlichsprachlichen Suchbegriffen heute der kleinste gemeinsame Nenner. Sie führen aber wegen der bekannten Mängel des booleschen Retrievals häufig zu Treffermengen, die entweder zu speziell oder zu lang und zu unspezifisch sind. Die Technische Fakultät der Universität Bielefeld und die Universitätsbibliothek Bielefeld haben einen auf Fuzzy- Suchlogik basierenden Rechercheassistenten entwickelt, der die Suchanfragen der Benutzer in Teilsuchfragen an die externen Datenbanken zerlegt und die erhaltenen Teilsuchergebnisse in einer nach Relevanz sortierten Liste kumuliert. Es ist möglich, Suchbegriffe zu gewichten und durch Fuzzy- Aggregationsoperatoren zu verknüpfen, die auf der Benutzeroberfläche durch natürlichsprachliche Fuzzy-Quantoren wie möglichst viele, einige u.a. repräsentiert werden. Die Suchparameter werden in der intuitiv bedienbaren einfachen Suche automatisch nach heuristischen Regeln ermittelt, können in einer erweiterten Suche aber auch explizit eingestellt werden. Die Suchmöglichkeiten werden durch Suchen nach ähnlichen Dokumenten und Vorschlagslisten für weitere Suchbegriffe ergänzt. Wir beschreiben die Ausgangssituation, den theoretischen Ansatz, die Benutzeroberfläche und berichten über eine Evalution zur Benutzung und einen Vergleichstest betreffend die Effizienz der Retrievalmethodik.CR Subject Classification: H.3.3, H.3.5Eingegangen am 3. März 2004 / Angenommen am 19. August 2004, Online publiziert am 18. Oktober 2004  相似文献   

9.
TIJAH: Embracing IR Methods in XML Databases   总被引:1,自引:0,他引:1  
This paper discusses our participation in INEX (the Initiative for the Evaluation of XML Retrieval) using the TIJAH XML-IR system. TIJAHs system design follows a standard layered database architecture, carefully separating the conceptual, logical and physical levels. At the conceptual level, we classify the INEX XPath-based query expressions into three different query patterns. For each pattern, we present its mapping into a query execution strategy. The logical layer exploits score region algebra (SRA) as the basis for query processing. We discuss the region operators used to select and manipulate XML document components. The logical algebra expressions are mapped into efficient relational algebra expressions over a physical representation of the XML document collection using the pre-post numbering scheme. The paper concludes with an analysis of experiments performed with the INEX test collection.  相似文献   

10.
Variability is a central concept in software product family development. Variability empowers constructive reuse and facilitates the derivation of different, customer specific products from the product family. If many customer specific requirements can be realised by exploiting the product family variability, the reuse achieved is obviously high. If not, the reuse is low. It is thus important that the variability of the product family is adequately considered when eliciting requirements from the customer. In this paper we sketch the challenges for requirements engineering for product family applications. More precisely we elaborate on the need to communicate the variability of the product family to the customer. We differentiate between variability aspects which are essential for the customer and aspects which are more related to the technical realisation and need thus not be communicated to the customer. Motivated by the successful usage of use cases in single product development we propose use cases as communication medium for the product family variability. We discuss and illustrate which customer relevant variability aspects can be represented with use cases, and for which aspects use cases are not suitable. Moreover we propose extensions to use case diagrams to support an intuitive representation of customer relevant variability aspects.Received: 14 October 2002, Accepted: 8 January 2003, This work was partially funded by the CAFÉ project From Concept to Application in System Family Engineering; Eureka ! 2023 Programme, ITEA Project ip00004 (BMBF, Förderkennzeichen 01 IS 002 C) and the state Nord-Rhein-Westfalia. This paper is a significant extension of the paper Modellierung der Variabilität einer Produktfamilie, [15].  相似文献   

11.
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.  相似文献   

12.
Detection As Multi-Topic Tracking   总被引:1,自引:0,他引:1  
The topic tracking task from TDT is a variant of information filtering tasks that focuses on event-based topics in streams of broadcast news. In this study, we compare tracking to another TDT task, detection, which has the goal of partitioning all arriving news into topics, regardless of whether the topics are of interest to anyone, and even when a new topic appears that had not been previous anticipated. There are clear relationships between the two tasks (under some assumptions, a perfect tracking system could solve the detection problem), but they are evaluated quite differently. We describe the two tasks and discuss their similarities. We show how viewing detection as a form of multi-topic parallel tracking can illuminate the performance tradeoffs of detection over tracking.  相似文献   

13.
This paper presents an experimental evaluation of several text-based methods for detecting duplication in scanned document databases using uncorrected OCR output. This task is made challenging both by the wide range of degradations printed documents can suffer, and by conflicting interpretations of what it means to be a duplicate. We report results for four sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.  相似文献   

14.
Collaborative Filtering (CF) Systems have been studied extensively for more than a decade to confront the “information overload” problem. Nearest-neighbor CF is based either on similarities between users or between items, to form a neighborhood of users or items, respectively. Recent research has tried to combine the two aforementioned approaches to improve effectiveness. Traditional clustering approaches (k-means or hierarchical clustering) has been also used to speed up the recommendation process. In this paper, we use biclustering to disclose this duality between users and items, by grouping them in both dimensions simultaneously. We propose a novel nearest-biclusters algorithm, which uses a new similarity measure that achieves partial matching of users’ preferences. We apply nearest-biclusters in combination with two different types of biclustering algorithms—Bimax and xMotif—for constant and coherent biclustering, respectively. Extensive performance evaluation results in three real-life data sets are provided, which show that the proposed method improves substantially the performance of the CF process.
Yannis ManolopoulosEmail:
  相似文献   

15.
The Archival Bond   总被引:1,自引:0,他引:1  
This paper presents the concept of archival bond as formulated by archival science and used in a research project carried out at the University of British Columbia, entitled The Preservation of Electronic Records. Being one of the essential components of the record, the concept of archival bond is discussed in the context of the traditional diplomatic and archival definitions of records, and its function in demonstrating the reliability and authenticity of records is shown. The most serious challenge with which we are confronted is to make explicit and preserve intact over the long term the archival bond between electronic and non electronic records belonging in the same aggregations.  相似文献   

16.
This article examines theclaim that, through its overt symbolicmessaging, the Gatineau Preservation Centre,opened by the National Archives of Canada in1997, embodies a perfect transparency betweenfunction and form, with the shape of the placebeing derived seamlessly from the needs of thearchival work done there, and the proof beingin the exposure of all the elements to view. It reveals the undercurrents of contendingoppositions to this claim, both in thesubversive, Mannerist, or impure architectural eccentricities designed into thestructure, and in the embodiment of archivalnarratives whose symbolism is challenged byunacknowledged resistances. While the buildingis clearly inspired by Modernist andEnlightenment orientations, such as theambition to preserve unchanged a universal,transcendent historical authenticity, thesediverse resistances buried in it aremanifested, for example, in the contest of maleversus female structural elements, and inthe authority of the monumental and exposed setagainst the seduction of the varied and secret. Most importantly, the absorption of the bodyboth metaphorically and physically into themany disciplines of the place unconsciouslycalls into question the building's self-imageas the epitome of a liberal-humanist andobjective-scientific activity; it reflectsinstead the destabilizing plays and displays ofpower which are increasingly seen to form theindeterminate field of the archival pursuit.  相似文献   

17.
Locating and Recognizing Text in WWW Images   总被引:4,自引:0,他引:4  
The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and fuzzy n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.  相似文献   

18.
New Mexico State University's Computing Research Lab has participated in research in all three phases of the US Government's Tipster program. Our work on information retrieval has focused on research and development of multilingual and cross-language approaches to automatic retrieval. The work on automatic systems has been supplemented by additional research into the role of the IR system user in interactive retrieval scenarios: monolingual, multilingual and cross-language. The combined efforts suggest that universal text retrieval, in which a user can find, access and use documents in the face of language differences and information overload, may be possible.  相似文献   

19.
With a central focus on thecultural contexts of Pacific island societies,this essay examines the entanglement ofcolonial power relations in local recordkeepingpractices. These cultural contexts include theon-going exchange between oral and literatecultures, the aftermath of colonialdisempowerment and reassertion of indigenousrights and identities, the difficulty ofmaintaining full archival systems in isolated,resource-poor micro-states, and the drivinginfluence of development theory. The essayopens with a discussion of concepts ofexploration and evangelism in cross-culturalanalysis as metaphors for archival endeavour. It then explores the cultural exchanges betweenoral memory and written records, orality, andliteracy, as means of keeping evidence andremembering. After discussing the relation ofrecords to processes of political and economicdisempowerment, and the reclaiming of rightsand identities, it returns to the patterns ofarchival development in the Pacific region toconsider how archives can better integrate intotheir cultural and political contexts, with theaim of becoming more valued parts of theircommunities.  相似文献   

20.
Exploiting the Similarity of Non-Matching Terms at Retrieval Time   总被引:2,自引:0,他引:2  
In classic Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem, known as term mismatch, has been recognised for a long time by the Information Retrieval community and a number of possible solutions have been proposed. Here I present a preliminary investigation into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space. The use of term similarity enables to enhance classic retrieval models by taking into account non-matching terms. The theoretical advantages and drawbacks of these models are presented and compared with other models tackling the same problem. A preliminary experimental investigation into the performance gain achieved by exploiting term similarity with the proposed models is presented and discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号