1,721,408 research outputs found
PARMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library
Karypis, George; Schloegel, Kirk; Kumar, Vipin. (1997). PARMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215345
Language and Library Support for Climate Data Applications
Associated research group: Minnesota Extensible Language ToolsVan Wyk, Eric; Kumar, Vipin; Steinbach, Michael; Boriah, Shyam; Choudhary, Alok. (2009). Language and Library Support for Climate Data Applications. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/217360
METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices
Metis is copyrighted by the regents of the University of Minnesota. This work was supponed by IST/BMDO through Army Research Office
contract DA/DAAH04-93-G-0080. and by Army High Performance Computing Research Center under the auspices of the Department of the Army.
Anny Research Laboratory cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, the content of which does
not necessarily reflect the position or the policy of lhe government, and no official endorsement should be inferred. Access to computing facilities
were provided by Minnesota Supercomputer Institute, Cray Research Inc, and by the Pittsburgh Supercomputing Center.Karypis, George; Kumar, Vipin. (1997). METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215346
A Computationally Efficient and Statistically Powerful Framework for Searching High-order Epistasis with Systematic Pruning and Gene-set Constraints
This paper has not yet been submitted.Fang, Gang; Haznadar, Majda; Wang, Wen; Steinbach, Michael; Van Ness, Brian; Kumar, Vipin. (2010). A Computationally Efficient and Statistically Powerful Framework for Searching High-order Epistasis with Systematic Pruning and Gene-set Constraints. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215831
Similarity Measures for Categorical Data--A Comparative Study
Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.Chandola, Varun; Boriah, Shyam; Kumar, Vipin. (2007). Similarity Measures for Categorical Data--A Comparative Study. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215736
Design of Scalable Parallel Classification Algorithms for Mining Large Datasets
In this paper, we present ScalParC (Scalable Parallel Classifier), a new parallel formulation of a decision tree based classification process. Like other state-of-the-art decision tree classifiers such as SPRINT, ScalParC is suited for handling large datasets. We show that existing parallel formulation of SPRINT is unscalable, whereas ScalParC is shown to be scalable in both runtime and memory requirements. We present the experimental results of classifying up to 6.4 million records on up to 128 processors of Cray T3D, in order to demonstrate the scalable behavior of ScalParC. A key component of ScalParC is the parallel hash table. The proposed parallel hashing paradigm can be used to parallelize other algorithms that require many concurrent updates to a large hash table.Joshi, Mahesh; Karypis, George; Kumar, Vipin. (1998). Design of Scalable Parallel Classification Algorithms for Mining Large Datasets. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215372
Min-Apriori: An Algorithm for Finding Association Rules in Data with Continuous Attributes
This work was supported by NSF ASC-9634719, by Army Research Office contract DNDAAH04-95-1-0538, by Army High Performance
Computing Research Center cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, the content of which does
not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Additional support was provided
by the IBM Partnership Award, and by the IBM SUR equipment grant. Access to computing facilities was provided by AHPCRC, Minnesota
Supercomputer Institute.Han, Eui-Hong; Karypis, George; Kumar, Vipin. (1997). Min-Apriori: An Algorithm for Finding Association Rules in Data with Continuous Attributes. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215354
Summarization - Compressing Data into an Informative Representation Report
Summarization is an important problem in many domains involving large datasets. Summarization can be essentially viewed as transformation of data into a concise yet meaningful representation which could be used for efficient storage or manual inspection. In this paper, we formulate the problem of summarization of a large dataset of transactions as an optimization problem involving two objective functions - compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We propose data mining techniques to obtain a summary for a given set of transactions while optimizing these two objective functions. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a meaningful representation. We first present a direct application of a standard clustering scheme to generate summaries. We then show how this could be significantly improved by using a multi-step approach which involves generating candidate summaries for a dataset using association analysis and then choosing a subset of these candidates as the summary with the desired compaction and information content. We present results of experiments conducted on real and artificial datasets to demonstrate the effectiveness of our techniques.Chandola, Varun; Kumar, Vipin. (2005). Summarization - Compressing Data into an Informative Representation Report. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215665
Supplement for "Contextual Time Series Change Detection"
Time series data are common in a variety of fields ranging from economics to medicine and manufacturing. As a result, time series analysis and modeling has become an active research area in statistics and data mining. In this paper, we focus on a type of change we call contextual time series change (CTC) and propose a novel two-stage algorithm to address it. In contrast to traditional change detection methods, which consider each time series separately, CTC is defined as a change relative to the behavior of a group of related time series. As a result, our proposed method is able to identify novel types of changes not found by other algorithms. We demonstrate the unique capabilities of our approach with several case studies on real-world datasets from the financial and Earth science domains.Chen, Xi; Steinhaeuser, Karsten; Boriah, Shyam; Chatterjee, Snigdhansu; Kumar, Vipin. (2013). Supplement for "Contextual Time Series Change Detection". Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215905
Characterizing Pattern based Clustering
Recently, there has been considerable interest in using association patterns for clustering. Although several interesting algorithms have been developed, further investigation is needed to characterize (1) the benefits of using association patterns and (2) the most effective way of using them for clustering. To that end, we present a new clustering technique, bisecting K-means Clustering with pAttern Preservation (K-CAP), which exploits key properties of the hyperclique association pattern and bisecting k-means. Experimental results on document data show that, in terms of entropy, K-CAP can perform substantially better than the standard bisecting k-means algorithm when data sets contain clusters of widely different sizes--the typical situation. Furthermore, because hyperclique patterns can be found much more efficiently than other types of association patterns, K-CAP retains the appealing computational efficiency of bisecting k-means.Xiong, Hui; Steinbach, Michael; Ruslim, Arifin; Kumar, Vipin. (2005). Characterizing Pattern based Clustering. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215656
- …
