首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
In this paper we propose and evaluate the Block Max WAND with Candidate Selection and Preserving Top-K Results algorithm, or BMW-CSP. It is an extension of BMW-CS, a method previously proposed by us. Although very efficient, BMW-CS does not guarantee preserving the top-k results for a given query. Algorithms that do not preserve the top results may reduce the quality of ranking results in search systems. BMW-CSP extends BMW-CS to ensure that the top-k results will have their rankings preserved. In the experiments we performed for computing the top-10 results, the final average time required for processing queries with BMW-CSP was lesser than the ones required by the baselines adopted. For instance, when computing top-10 results, the average time achieved by MBMW, the best multi-tier baseline we found in the literature, was 36.29 ms per query, while the average time achieved by BMW-CSP was 19.64 ms per query. The price paid by BMW-CSP is an extra memory required to store partial scores of documents. As we show in the experiments, this price is not prohibitive and, in cases where it is acceptable, BMW-CSP may constitute an excellent alternative query processing method.  相似文献   

2.
While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today’s massive document collections (e.g., ClueWeb12’s 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics? In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.  相似文献   

3.
A proposed particle swarm classifier has been integrated with the concept of intelligently controlling the search process of PSO to develop an efficient swarm intelligence based classifier, which is called intelligent particle swarm classifier (IPS-classifier). This classifier is described to find the decision hyperplanes to classify patterns of different classes in the feature space. An intelligent fuzzy controller is designed to improve the performance and efficiency of the proposed classifier by adapting three important parameters of PSO (inertia weight, cognitive parameter and social parameter). Three pattern recognition problems with different feature vector dimensions are used to demonstrate the effectiveness of the introduced classifier: Iris data classification, Wine data classification and radar targets classification from backscattered signals. The experimental results show that the performance of the IPS-classifier is comparable to or better than the k-nearest neighbor (k-NN) and multi-layer perceptron (MLP) classifiers, which are two conventional classifiers.  相似文献   

4.
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami’s method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.  相似文献   

5.
为获取竞争优势,企业在资源和能力有限的条件下需要对创新模式作出合理的选择。本文运用演化博弈理论,对企业在创新投入补贴和创新产品补贴条件下的创新模式选择进行了探讨,在此基础上分析和比较了两种创新补贴对创新模式选择的影响。结果表明:无论是创新投入补贴还是创新产品补贴,补贴标准的提高均会导致企业选择颠覆性创新模式的可能性增大,选择渐进性创新模式的可能性减小,反则反之;在不同产品价格与边际成本差额条件下,两种创新补贴对创新模式选择影响的显著性存在差异。  相似文献   

6.
为去除网络入侵数据集中的冗余和噪声特征,降低数据处理难度和提高检测性能,提出一种基于特征选择和支持向量机的入侵检测方法。该方法采用提出的特征选择算法选取最优特征组合,并以支持向量机为分类器建立模型,应用于入侵检测系统。仿真结果表明,本文方法不仅可以减少特征维数,降低训练和测试时间,还能提高入侵检测的分类准确率。  相似文献   

7.
Previous studies have repeatedly demonstrated that the relevance of a citing document is related to the number of times with which the source document is cited. Despite the ease with which electronic documents would permit the incorporation of this information into citation-based document search and retrieval systems, the possibilities of repeated citations remain untapped. Part of this under-utilization may be due to the fact that very little is known regarding the pattern of repeated citations in scholarly literature or how this pattern may vary as a function of journal, academic discipline or self-citation. The current research addresses these unanswered questions in order to facilitate the future incorporation of repeated citation information into document search and retrieval systems. Using data mining of electronic texts, the citation characteristics of nine different journals, covering the three different academic fields (economics, computing, and medicine & biology), were characterized. It was found that the frequency (f) with which a reference is cited N or more times within a document is consistent across the sampled journals and academic fields. Self-citation causes an increase in frequency, and this effect becomes more pronounced for large N. The objectivity, automatability, and insensitivity of repeated citations to journal and discipline, present powerful opportunities for improving citation-based document search.  相似文献   

8.
Recommender systems are techniques to make personalized recommendations of items to users. In e-commerce sites and online sharing communities, providing high quality recommendations is an important issue which can help the users to make effective decisions to select a set of items. Collaborative filtering is an important type of the recommender systems that produces user specific recommendations of the items based on the patterns of ratings or usage (e.g. purchases). However, the quality of predicted ratings and neighbor selection for the users are important problems in the recommender systems. Selecting suitable neighbors set for the users leads to improve the accuracy of ratings prediction in recommendation process. In this paper, a novel social recommendation method is proposed which is based on an adaptive neighbor selection mechanism. In the proposed method first of all, initial neighbors set of the users is calculated using clustering algorithm. In this step, the combination of historical ratings and social information between the users are used to form initial neighbors set for the users. Then, these neighbor sets are used to predict initial ratings of the unseen items. Moreover, the quality of the initial predicted ratings is evaluated using a reliability measure which is based on the historical ratings and social information between the users. Then, a confidence model is proposed to remove useless users from the initial neighbors of the users and form a new adapted neighbors set for the users. Finally, new ratings of the unseen items are predicted using the new adapted neighbors set of the users and the top_N interested items are recommended to the active user. Experimental results on three real-world datasets show that the proposed method significantly outperforms several state-of-the-art recommendation methods.  相似文献   

9.
There has been a large literature in the last two decades affirming adaptive DNA sequence evolution between species. The main lines of evidence are from (i) the McDonald-Kreitman (MK) test, which compares divergence and polymorphism data, and (ii) the phylogenetic analysis by maximum likelihood (PAML) test, which analyzes multispecies divergence data. Here, we apply these two tests concurrently to genomic data of Drosophila and Arabidopsis. To our surprise, the >100 genes identified by the two tests do not overlap beyond random expectation. Because the non-concordance could be due to low powers leading to high false negatives, we merge every 20–30 genes into a ‘supergene’. At the supergene level, the power of detection is large but the calls still do not overlap. We rule out methodological reasons for the non-concordance. In particular, extensive simulations fail to find scenarios whereby positive selection can only be detected by either MK or PAML, but not both. Since molecular evolution is governed by positive and negative selection concurrently, a fundamental assumption for estimating one of these (say, positive selection) is that the other is constant. However, in a broad survey of primates, birds, Drosophila and Arabidopsis, we found that negative selection rarely stays constant for long in evolution. As a consequence, the variation in negative selection is often misconstrued as a signal of positive selection. In conclusion, MK, PAML and any method that examines genomic sequence evolution has to explicitly address the variation in negative selection before estimating positive selection. In a companion study, we propose a possible path forward in two stages—first, by mapping out the changes in negative selection and then using this map to estimate positive selection. For now, the large literature on positive selection between species has to await reassessment.  相似文献   

10.
Recent developments have shown that entity-based models that rely on information from the knowledge graph can improve document retrieval performance. However, given the non-transitive nature of relatedness between entities on the knowledge graph, the use of semantic relatedness measures can lead to topic drift. To address this issue, we propose a relevance-based model for entity selection based on pseudo-relevance feedback, which is then used to systematically expand the input query leading to improved retrieval performance. We perform our experiments on the widely used TREC Web corpora and empirically show that our proposed approach to entity selection significantly improves ad hoc document retrieval compared to strong baselines. More concretely, the contributions of this work are as follows: (1) We introduce a graphical probability model that captures dependencies between entities within the query and documents. (2) We propose an unsupervised entity selection method based on the graphical model for query entity expansion and then for ad hoc retrieval. (3) We thoroughly evaluate our method and compare it with the state-of-the-art keyword and entity based retrieval methods. We demonstrate that the proposed retrieval model shows improved performance over all the other baselines on ClueWeb09B and ClueWeb12B, two widely used Web corpora, on the [email protected], and [email protected] metrics. We also show that the proposed method is most effective on the difficult queries. In addition, We compare our proposed entity selection with a state-of-the-art entity selection technique within the context of ad hoc retrieval using a basic query expansion method and illustrate that it provides more effective retrieval for all expansion weights and different number of expansion entities.  相似文献   

11.
We consider the design of opportunistic spectrum access (OSA) strategies that allow secondary users to search for and exploit spectrum opportunities in unslotted primary systems. The traffic of the primary system in each channel is modeled by a continuous time Markov process. We formulate the joint design of OSA as a constrained discrete-time partially observable Markov decision process (POMDP). A separation principle for the joint design of OSA is established under certain conditions on the probability of false alarm of the spectrum sensor. This result extends the separation principle for OSA in slotted primary systems to unslotted primary systems. Furthermore, we show that the myopic sensing policy has a simple and robust structure under certain conditions on the probability of false alarm.  相似文献   

12.
聂珍  王华秋 《现代情报》2012,32(7):112-116,121
本文采取了3种必要的措施提高了聚类质量:考虑到各维数据特征属性对聚类效果影响不同,采用了基于统计方法的维度加权的方法进行特征选择;对于和声搜索算法的调音概率进行了改进,将改进的和声搜索算法和模糊聚类相结合用于快速寻找最优的聚类中心;循环测试各种中心数情况下的聚类质量以获得最佳的类中心数。接着,该算法被应用于图书馆读者兴趣度建模中,用于识别图书馆日常运行时各读者借阅图书的类型,实验表明该算法较其它算法更优。这样的读者兴趣度聚类分析可以进行图书推荐,从而提高图书馆的运行效率。  相似文献   

13.
14.
Modern web search engines are expected to return the top-k results efficiently. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignoring some especially important factors in ranking functions, such as term-proximity (the distance relationship between query terms in a document). In our recent work [Zhu, M., Shi, S., Li, M., & Wen, J. (2007). Effective top-k computation in retrieving structured documents with term-proximity support. In Proceedings of 16th CIKM conference (pp. 771–780)], we demonstrated that, when term-proximity is incorporated into ranking functions, most existing index structures and top-k strategies become quite inefficient. To solve this problem, we built the inverted index based on web page structure and proposed the query processing strategies accordingly. The experimental results indicate that the proposed index structures and query processing strategies significantly improve the top-k efficiency. In this paper, we study the possibility of adopting additional techniques to further improve top-k computation efficiency. We propose a Proximity-Probe Heuristic to make our top-k algorithms more efficient. We also test the efficiency of our approaches on various settings (linear or non-linear ranking functions, exact or approximate top-k processing, etc.).  相似文献   

15.
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.  相似文献   

16.
BackgroundThe effect of diverse oxygen transfer coefficient on the l-erythrulose production from meso-erythritol by a newly isolated strain, Gluconobacter kondonii CGMCC8391 was investigated. In order to elucidate the effects of volumetric mass transfer coefficient (kLa) on the fermentations, baffled and unbaffled flask cultures, and fed-batch cultures were developed in present work.ResultsWith the increase of the kLa value in the fed-batch culture, l-erythrulose concentration, productivity and yield were significantly improved, while cell growth was not the best in the high kLa. Thus, a two-stage oxygen supply control strategy was proposed, aimed at achieving high concentration and high productivity of l-erythrulose. During the first 12 h, kLa was controlled at 40.28 h-1 to obtain high value for cell growth, subsequently kLa was controlled at 86.31 h-1 to allow for high l-erythrulose accumulation.ConclusionsUnder optimal conditions, the l-erythrulose concentration, productivity, yield and DCW reached 207.9 ± 7.78 g/L, 6.50 g/L/h, 0.94 g/g, 2.68 ± 0.17 g/L, respectively. At the end of fermentation, the l-erythrulose concentration and productivity were higher than those in the previous similar reports.  相似文献   

17.
A new approach is formulated for the matching polynomial m(G) of a graph G. A matrix A(G) is associated with G. A certain function defined on A(G) yields the matching polynomial of G. This approach leads to a simple characterization of m(G). It also facilitates a technique for constructing graphs with a given matching polynomial.  相似文献   

18.
面向集成创新的第三方技术源选择方法研究   总被引:2,自引:0,他引:2       下载免费PDF全文
王建军  张米尔 《科学学研究》2009,27(11):1736-1741
在集成创新的过程中,企业面临对多项技术进行选择的问题。在分析技术选择决策问题特点的基础上,运用多属性决策分析原理,综合应用层次分析方法、偏好顺序结构评估方法,提出了一种第三方技术源选择评价方法,建立了评价模型,并用交互辅助几何分析法对其进行灵敏度分析。以数控机床为例,将构建的第三方技术源选择模型应用于对数控系统的技术选择,说明该方法的科学性和有效性。  相似文献   

19.
Numerous relatively simple physical systems give rise under appropriate circumstances to oscillations which obey the equation y″ + ?(1 + k cos t)y = 0 (Mathieu's equation). These oscillations may be either stable, periodic, or unstable, depending upon parameters of the physical system as expressed by the parameters ? and k in the basic equation. It has been customary to distinguish between the stable and unstable states by diagrams of the type of Fig. 1, from which it is possible to tell whether a given set of values of the parameters ?, k will yield a stable or unstable solution. In this paper are given curves which not only present this information, but in addition give for an important part of the stable state the values of the characteristic exponent μ. The solution of the equation y″ + ?(1 + k cos t)y = 0 depends to a large extent on this exponent, and the availability of values of μ should greatly facilitate the practical application of the equation.  相似文献   

20.
Breast cancer is one of the leading causes of death among women worldwide. Accurate and early detection of breast cancer can ensure long-term surviving for the patients. However, traditional classification algorithms usually aim only to maximize the classification accuracy, failing to take into consideration the misclassification costs between different categories. Furthermore, the costs associated with missing a cancer case (false negative) are clearly much higher than those of mislabeling a benign one (false positive). To overcome this drawback and further improving the classification accuracy of the breast cancer diagnosis, in this work, a novel breast cancer intelligent diagnosis approach has been proposed, which employed information gain directed simulated annealing genetic algorithm wrapper (IGSAGAW) for feature selection, in this process, we performs the ranking of features according to IG algorithm, and extracting the top m optimal feature utilized the cost sensitive support vector machine (CSSVM) learning algorithm. Our proposed feature selection approach which can not only help to reduce the complexity of SAGASW algorithm and effectively extracting the optimal feature subset to a certain extent, but it can also obtain the maximum classification accuracy and minimum misclassification cost. The efficacy of our proposed approach is tested on Wisconsin Original Breast Cancer (WBC) and Wisconsin Diagnostic Breast Cancer (WDBC) breast cancer data sets, and the results demonstrate that our proposed hybrid algorithm outperforms other comparison methods. The main objective of this study was to apply our research in real clinical diagnostic system and thereby assist clinical physicians in making correct and effective decisions in the future. Moreover our proposed method could also be applied to other illness diagnosis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号