 |
 |
A Distributed Incremental Nearest Neighbor Algorithm
Searching for non-text data (e.g., images) is mostly done by means of metadata annotations or by extracting the text close to the data. However, supporting real content-based audio-visual search, based on similarity search on features, is significantly more expensive than searching for text. Moreover, the search exhibits linear scalability with respect to the data set size. In this paper, we present a Distributed Incremental Nearest Neighbor algorithm (DINN) for finding the nearest neighbor in an incremental fashion, over data distributed between nodes that are able to perform a local Incremental Nearest Neighbor (local-INN). We prove that our algorithm is optimal with respect to both number of involved nodes and number of local-INN invocations. An implementation of our DINN algorithm, on a real P2P system called MCAN, was used for conducting an extensive experimental evaluation on a real-life dataset.
Back
Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the inverted index among a set of parallel server nodes.
In this paper we are interested in devising an effective term-partitioning strategy, according to which the global vocabulary of terms and the associated inverted lists are split into disjoint subsets, and assigned to distinct servers. Due to the workload imbalance caused by the skewed distribution of terms in user queries, finding an effective partitioning strategy is considered a very complex task.
In this paper we first formally introduce Term Partitioning as a new optimization problem. Then we show how the knowledge mined from past WSE query logs can be profitably used to discover good solutions of this problem. Finally, we report many results to show that we are able to effectively reduce both the average number of servers activated per each query, along with the workload imbalance. Experiments are conducted on large query logs of real WSEs.
Back
The similarity search has become a fundamental computational task in many applications. One of the mathematical models of the similarity – the metric space – has drawn attention of many researchers resulting in several sophisticated metric-indexing techniques. An important part of a research in this area is typically a prototype implementation and subsequent experimental evaluation of the proposed data structure. This paper describes an implementation framework called MESSIF that eases the task of building such prototypes. It provides a number of modules from basic storage management to automatic collecting of performance statistics. Due to its open and modular design it is also easy to implement additional modules if necessary. The MESSIF also offers several ready-to-use generic clients that allow to control and test the index structures and also measure its performance.
Back
In this paper we tackle the issues of exploiting the concepts of social networking in processing similarity queries in the environment of a P2P network. The processed similarity queries are laying the base on which the relationships among peers are created. Consequently, the communities encompassing similar data emerge in the network. The architecture of the presented metric social network is formally defined using the acquaintance and friendship relations. Two version of the navigation algorithm are presented and thoroughly experimentally evaluated. Finally, learning ability of the metric social network is presented and discussed.
Back
The aim of this work is a proposal for a new approach concerning the management of the metadata related to the Digital Rights in centralized systems or networks with indexing capabilities for both text and similarity searches, providing the basic infrastructure enabling the private use and the commercial exploitation as well. We present an innovative approach that treats the right management metadata as metric objects, enabling similarity search on IPR attributes between digital items. Moreover we show how the content base similarity search can help both the user to deal with a huge amount of similar items with different licenses and the content providers to detect fake copies or illegal uses.
Back
This paper presents a novel approach to the retrieval of multimedia documents that considers Intellectual Property Rights (IPR) metadata as a multidimensional feature in a metric space.
The approach allows us to perform similarity searches on IPR attributes between digital items and to integrate these searches in a common query by example paradigm.
The aim of this work is the management of the metadata related to the IPR, both in centralized systems and in networks with indexing capabilities, for text and similarity searches, providing the basic infrastructure enabling the private use and the commercial exploitation as well.
Content based similarity search can help both the user to deal with a huge amount of similar items with different licenses and the content providers to detect fake copies or illegal uses, as discussed for the case of images and music.
Back
The aim of this work is a proposal for a new approach concerning the management of the metadata related to the Digital Rights in centralized systems or networks with indexing capabilities for both text and similarity searches, providing the basic infrastructure enabling the private use and the commercial exploitation as well. We present an innovative approach that treats the right management metadata as metric objects, enabling similarity search on IPR attributes between digital items. Moreover we show how the content base similarity search can help both the user to deal with a huge amount of similar items with different licenses and the content providers to detect fake copies or illegal uses.
Back
Peer-to-Peer (P2P) systems are widely used for sharing digital items without structured metadata and in absence of any kind of digital rights management applied to the distributed contents. In this paper we propose the implementation of a prototype application that makes use of a structured P2P system enabling the indexing of complex metadata, used to express digital rights. In this way the media contents are exchanged and played according to the expressed grants. The creation and the consumption of the shared contents can be performed through any MPEG-21 REL compliant software and the application allows indexing and search for both governed and ungoverned contents. The information about the license can be included in the queries and the P2P network can be used to share governed contents (both free and with fee) in a legitimate way. In particular the proposed approach represents a suitable solution for indexing and querying rights complex structures on DHT based networks.
Back
This paper introduces a suitable way for indexing multimedia metadata on a structured Peer-to-Peer overlay network, with special care to the management of rights metadata expressed by MPEG-21. We have selected a suitable subset of MPEG-21 Rights Expression Language elements to be indexed, in order to map governed contents into a ·at space and allow insertion and retrieval of digital contents. Furthermore, we present a distributed application built on a structured overlay network enabling the search of multimedia items using rights related information. Our solution is completely decentralized and can be exploited in any MPEG-21 compliant metadata representation.
Back
This paper presents a novel approach to the retrieval of multimedia documents that considers Intellectual Property Rights (IPR) metadata as a multidimensional feature in a metric space.
The approach allows us to perform similarity searches on IPR attributes between digital items and to integrate these searches in a common query by example paradigm.
The aim of this work is the management of the metadata related to the IPR, both in centralized systems and in networks with indexing capabilities, for text and similarity searches, providing the basic infrastructure enabling the private use and the commercial exploitation as well.
Content based similarity search can help both the user to deal with a huge amount of similar items with different licenses and the content providers to detect fake copies or illegal uses, as discussed for the case of images and music.
Back
Inspired by emerging multi-core computer architectures, in this paper we present MT_CLOSED, a multi-threaded algorithm for frequent closed itemset mining (FCIM). To the best of our knowledge, this is the first FCIM parallel algorithm proposed so far.
We studied how different duplicate checking techniques, typical of FCIM algorithms, my affect this parallelization. We showed that only one of them allows to decompose the global FCIM problem into independent tasks that can be executed in any order, and thus in parallel.
Finally we show how MT_CLOSED efficiently harness modern CPUs. We designed and tested several parallelization paradigms by investigating static/dynamic decomposition and scheduling of tasks, thus showing its scalability w.r.t. to the number of CPUs. We analyzed the cache friendliness of the algorithm. Finally, we provided additional speed-up by introducing SIMD extensions.
Back
In the effort to model theoretically a distributed and heterogeneous environment, as the one where SAPIR project is placed, a key issue is capturing the uncertain nature of the retrieval process and routing mechanisms that characterize a distributed context, like the P2P one. A probabilistic model meets this need. We study explicitly the event space for distributed information retrieval, proposing two different approaches, because the event space definition sets not only the model correctness, but also its capability of describing features of the considered context.
The approach has been applied to the retrieval of music documents. To this end, we proposed a novel methodology for the identification of music works from the recording of a performance, yet independently from the particular performance. The methodology fits with a distributed architecture because of its high interoperability and could be exploited perfectly also in the architecture proposed by the SAPIR project.
A prototype system has been designed and developed in order to test the architecture. Tests have been done figuring out a situation of a peer storing a part of the database composed of 200 recordings, all of them representing tonal Western music. Final results returned the 90% of the analyzed recordings ranked among top 3 positions. These results may show that the identification task becomes feasible also when larger collections of competing audio recordings are used.
Back
Below is a list of links to other sites that sell new and used books, and may also have further information about books you are looking for:
Back
The approach presented in the paper introduces the two-dimensional representation of documents which allows documents to be represented on a two-dimensional Cartesian plane which has proved to be a valid visualization tool for Automated Text Classification for understanding the relationships between categories of textual documents, and to help users to visually audit the classifier and identify suspicious training data. In order to obtain the two coordinates in the case of the Naive Bayes classifier, a reformulation of the equation for the decision of classification has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component P(d | c_i), and a constant component P(c_i). When plotted on the Cartesian plane according to this formulation, the documents that are constantly shifted along the x-axis and the y-axis can be seen. This effect of shifting is more or less evident according to which NB model, Bernoulli or multinomial, is chosen. The same reformulation has been applied in the case of the Binary Independence Retrieval model for Information Retrieval with encouraging results.
Back
In this paper, we tackle the issues of analyzing the structural evolution of the metric social network. The metric social network operates in a P2P environment where peers maintain their own data and the relationships among them are formed on the basis of the processed similarity queries. The evolution is analyzed by traditional social networking tools -- the characteristic path length and the clustering coefficient. Nonetheless, due to the special structure of the metric social network, own designed gauges -- the average overlap and robustness of description coefficients -- are presented to analyze the structure of emerging communities encompassing similar data.
Back
Similarity search for content-based retrieval (where content can be any combination of text, image, audio/video, etc.) has gained importance in recent years, also because of the advantage of ranking the retrieved results according to their proximity to a query. However, to use similarity search in real world applications, we need to tackle the problem of huge volumes of such mixed multimedia data (e.g., coming from Web sites) and the problem of their distribution on multiple co-operating nodes. This is the situation of the Networked Peers for Business, where the distributed nodes (i.e., peers) represent aggregations of SME's with similar activities and the multimedia objects are descriptions/presentations of their products/services extracted from the companies' Web sites. In this paper we approach this problem by considering a scenario of a network of autonomous peers maintaining a local collection of metric objects (i.e., mixed mode multimedia content). This network forms a distributed Peer-to-Peer search engine for similarity search based on the paradigm of Routing Index. Each peer in the network thus maintains both an index of its local resources and a table for every neighbor, summarizing the objects that are reachable from it. The paper presents techniques that aim to make our P2P similarity-based search system viable, trading approximate results for scalable solutions. Results of simulations that use real collections of images are discussed.
Back
In this paper, a probabilistic approach to modeling distributed Information Retrieval centered around the notion of event space is illustrated. This notion is the underlying issue of all probabilistic models and its structure cannot be ignored when a probabilistic model is being constructed. The importance of de·ning the event space is not only related to the correctness of the model, but also to describing different architectures. Three different spaces are proposed in this paper for modeling distributed IR. Each space captures different aspects and dictates a distinct function for ranking by probability of relevance.
Back
The concept of peer-to-peer structures has recently been applied on the problem of large-scale similarity search. This resulted in systems where the computational load of the peers is of a high importance. Since no current load-balancing technique is designed for structures of this kind, we propose LOBS -- a general system for load-balancing in peer-to-peer structures with time-consuming searching. LOBS is based on the following principles: measuring the computational load, separation of the logical and the physical level of the system, and detailed analysis of the load source to exploit either data relocation or replication.
This work contains results of experiments we conducted using a prototype implementation of LOBS. In these trials, we used a real-life dataset and we varied the number of peers and the distribution of the query traffic in the system. The results show that LOBS is able to cope with any query-distribution and that it improves both the utilization of resources and the performance of the query processing. The costs of balancing are reasonable and are very small if there is time to adapt to a query-traffic. The behaviour of LOBS is independent of the network size.
Back
Large collections of multimedia documents need to be provided with new tools for easy access and retrieval. In this paper we present a prototype system, available through a Web interface, that can identify an unknown recording of classical music using a database of reference recordings. An important characteristic is that the identification can be carried out also when the query is the recording of a different performance of the work stored in the database, possibly played by different instruments and recorded with background noise.
Back
The advent of Peer-to-peer (P2P) networks on the scene of the search engines poses new challenges for Distributed Information Retrieval (IR) and Digital Library (DL) Systems. A software architecture called SPINA (Superimposed Peer Infrastructure for iNformation Access) is described in this paper.
Back
Current information systems are required to process complex digital objects, which are typically characterized by multiple descriptors. Since the values of many descriptors belong to non-sortable domains, they are effectively comparable only by a sort of similarity. Moreover, the scalability is very important in the current digital-explosion age. Therefore, we propose a distributed extension of the well-known threshold algorithm for peer-to-peer paradigm. The technique allows to answer similarity queries that combine multiple similarity measures and due to its peer-to-peer nature it is highly scalable. We also explore possibilities of approximate evaluation strategies, where some relevant results can be lost in favor of increasing the efficiency by order of magnitude. To reveal the strengths and weaknesses of our approach we have experimented with a 1.6 million image database from Flicker comparing the content of the images by five similarity measures from the MPEG-7 standard. To the best of our knowledge, the experience with such a huge real-life dataset is quite unique.
Back
Due to the exponential growth of digital data and its complexity, we need a technique which allows us to search such collections efficiently. A suitable solution seems to be based on the peer-to-peer (P2P) network paradigm and the metric-space model of similarity. During the building phase of the distributed structure, the peers often split as new peers join the network. During a peer split, the local data is halved and one half is migrated to the new peer. In this paper, we study the problem of efficient splits of metric data locally organized by an M-tree and we propose a novel algorithm for speeding the splits up. In particular, we focus on the metric-based structured P2P network called the M-Chord. In experimental evaluation, we compare the proposed algorithm with several straightforward solutions on a real network organizing 10 million images. Our algorithm provides a significant performance boost.
Back
Exploiting the concepts of social networking represents a novel approach to the approximate similarity query processing. We present an unstructured and dynamic P2P environment in which a metric social network is built. Social communities of peers giving similar results to specific queries are established and such ties are exploited for answering future queries. Based on the universal law of generalization, a new query forwarding algorithm is introduced and evaluated. The same principle is used to manage query histories of individual peers with the possibility to tune the tradeoff between the extent of the history and the level of the query-answer approximation. All proposed algorithms are tested on real data and medium-sized P2P networks consisting of tens of computers.
Back
The digital images became a commodity which is searched on the Web as ordinarily as the web pages. However, web-scale engines are using only the text-search technology on images' annotations while the true content-based similarity engines do not seem to be ready for such scales. In this paper, we show a way which opens doors towards a Web-scale image similarity search. We present a very flexible system based on the metric space model and on the peer-to-peer paradigm; it uses structures M-Chord and M-Tree as its fundamental components and measures the image similarity by a combination of MPEG-7 features. The system has been implemented including a graphical interface for online demonstrations and it currently indexes 10 million images downloaded from the Web. We present performance experiments, which focus on the approximate search and the results show that the system provides high quality answers in a fraction of the response times of the precise answers.
Retrieved from: http://sapir.tid.es/sapirwiki/index.php/Web-scale_System_for_Image_Similarity_Search:_When_the_Dreams_Are_Coming_True.
Back
The increasing amount of digital audio-visual content accessible today calls for scalable solutions for content based search. The state of the art solutions reveal linear scalability in respect to the collection size due to the large number of distance computations needed for comparing low level audio-visual features. As a result, search in large audio-visual collections is limited to associated metadata only done by text retrieval methods that are proven to be scalable.
Search in audio-visual content can be generalized to search in metric spaces by assuming some distance function on low-level features. In this paper we propose a framework for efficient indexing and retrieval of metric spaces by extending classical techniques taken from the textual IR methods such as lexicon, posting lists and boolean constraints- thus enable scalable search in any metric space and in particular in audio-visual content. We show the efficiency and effectiveness of our method by experiments on a large image collection.
Back
The state of the art of searching for non-text data (e.g., images) is to use extracted metadata annotations or text, which might be available as a related information. However, supporting real content-based audio-visual search, based on similarity search on features, is significantly more expensive than searching for text. Moreover, such search exhibits linear scalability with respect to the data set size, so parallel query execution is needed.
In this paper, we present a Distributed Incremental Nearest Neighbor algorithm (DINN) for finding closest objects in an incremental fashion over data distributed among computer nodes, each able to perform its local Incremental Nearest Neighbor local-INN) algorithm. We prove that our algorithm is optimum with respect to both the number of involved nodes and the number of local-INN invocations. An implementation of our DINN algorithm, on a real P2P system called MCAN, was used for conducting an extensive experimental evaluation on a real-life dataset.
The proposed algorithm is being used in two running projects: SAPIR and NeP4B.
Back
|
|
 |
| |
|