quitefastmst/0000755000176200001440000000000015143132623013003 5ustar liggesusersquitefastmst/MD50000644000176200001440000000220415143132623013311 0ustar liggesusers6a0f461df7c44ba467916fd78502bdf2 *DESCRIPTION 36436aa214ec1b0671230f662d9fd4f4 *NAMESPACE ec7654cb494c55db7d5017b5de79adaf *NEWS 57f68178165c365a6ce0022655279d6e *R/RcppExports.R ea1e911ba199ee7b6627014842e3e90d *R/quitefastmst-package.R 1a54d5d28cb5d25c905c43080cb2f16e *man/knn_euclid.Rd 397c4c63147780eb0eacdb7a87117429 *man/mst_euclid.Rd 29a33a067365c4e0bcf5695a5acc2867 *man/omp.Rd 9199ddcfd7e1c91e3f9999dbc2b13a50 *man/quitefastmst-package.Rd 19ade22892ba8ed21ba249417e82b524 *src/Makevars 1174afd381a33fe61a5fbd0d960e59f8 *src/RcppExports.cpp 3c31ae4b231fb0fd6b116298a1248453 *src/RcppFastmst.cpp c6d414bb815350af78b1a7b98c7ab13f *src/c_common.h 2207976418a19c4d735f536cc4e4ea92 *src/c_disjoint_sets.h e0aa8bc5f3b63ca64dbdc8a3c2f69765 *src/c_fastmst.h a8870e728738e69a4f9a7bb9ccbb1181 *src/c_kdtree.h 2e5b3c9b5abd06e9f43418656af6ff21 *src/c_kdtree_boruvka.h a520f091ba579769e2a14352324129e0 *src/c_mst_triple.h 651b351b00cf6561379e93d7bcf3e7c9 *src/knn_euclid_brute.cpp 92d3e1f01670b55900d87cdfe547f785 *src/knn_euclid_kdtree.cpp 110685365ff6a7feb32cb98219fc226f *src/mst_euclid_brute.cpp d38a4823740adadece869851a17a4fa2 *src/mst_euclid_kdtree.cpp quitefastmst/R/0000755000176200001440000000000015143126457013214 5ustar liggesusersquitefastmst/R/RcppExports.R0000644000176200001440000003554215143126457015641 0ustar liggesusers# Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #' @title #' Get or Set the Number of Threads #' #' @description #' These functions get or set the maximal number of OpenMP threads that #' can be used by \code{\link{knn_euclid}} and \code{\link{mst_euclid}}, #' amongst others. #' #' @param n_threads maximal number of threads to use #' #' @return #' \code{omp_get_max_threads} returns the maximal number #' of threads that will be used during the next call to a parallelised #' function, not the maximal number of threads possibly available. #' It there is no built-in support for OpenMP, 1 is always returned. #' #' For \code{omp_set_num_threads}, the previous value of \code{max_threads} #' is returned. #' #' #' @rdname omp #' @encoding UTF-8 #' @export omp_set_num_threads <- function(n_threads) { .Call(`_quitefastmst_Romp_set_num_threads`, n_threads) } #' @rdname omp #' @export omp_get_max_threads <- function() { .Call(`_quitefastmst_Romp_get_max_threads`) } #' @title Euclidean Nearest Neighbours #' #' @description #' If \code{Y} is \code{NULL}, then the function determines the first \code{k} #' nearest neighbours of each point in \code{X} with respect #' to the Euclidean distance. It is assumed that each query point is #' not its own neighbour. #' #' Otherwise, for each point in \code{Y}, this function determines the \code{k} #' nearest points thereto from \code{X}. #' #' @details #' The implemented algorithms, see the \code{algorithm} parameter, assume #' that \eqn{k} is rather small. #' #' Our implementation of K-d trees (Bentley, 1975) has been quite optimised; #' amongst others, it has good locality of reference (at the cost of making #' a copy of the input dataset), features the sliding #' midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), #' node pruning strategies inspired by some ideas from (Sample et al., 2001), #' and a couple of further tuneups proposed by the current author. #' Still, it is well-known that K-d trees perform well only in spaces of low #' intrinsic dimensionality. Thus, due to the so-called curse of #' dimensionality, for high \code{d}, the brute-force algorithm is recommended. #' #' The number of threads is controlled via the \code{OMP_NUM_THREADS} #' environment variable or via the \code{\link{omp_set_num_threads}} function #' at runtime. For best speed, consider building the package #' from sources using, e.g., \code{-O3 -march=native} compiler flags. #' #' #' @references #' J.L. Bentley, Multidimensional binary search trees used for associative #' searching, \emph{Communications of the ACM} 18(9), 509–517, 1975, #' \doi{10.1145/361002.361007} #' #' S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends #' are fat, \emph{4th CGC Workshop on Computational Geometry}, 1999 #' #' N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search #' strategies in K-d Trees, \emph{5th WSES/IEEE Conf. on Circuits, Systems, #' Communications & Computers} (CSCC'01), 2001 #' #' #' @param X the "database"; a matrix of shape \eqn{n\times d} #' @param k requested number of nearest neighbours #' @param Y the "query points"; \code{NULL} or a matrix of shape \eqn{m\times d}; #' note that setting \code{Y=X}, contrary to \code{NULL}, #' will include the query points themselves amongst their own neighbours #' @param algorithm \code{"auto"}, \code{"kd_tree"} or \code{"brute"}; #' K-d trees can be used for \code{d} between 2 and 20 only; #' \code{"auto"} selects \code{"kd_tree"} in low-dimensional spaces #' @param max_leaf_size maximal number of points in the K-d tree leaves; #' smaller leaves use more memory, yet are not necessarily faster; #' use \code{0} to select the default value, currently set to 32 #' @param squared whether the output \code{nn.dist} should be based on #' the squared Euclidean distance #' @param verbose whether to print diagnostic messages #' #' #' @return #' A list with two elements, \code{nn.index} and \code{nn.dist}, is returned. #' #' \code{nn.dist} and \code{nn.index} have shape \eqn{n\times k} #' or \eqn{m\times k}, depending whether \code{Y} is given. #' #' \code{nn.index[i,j]} is the index (between \eqn{1} and \eqn{n}) #' of the \eqn{j}-th nearest neighbour of \eqn{i}. #' #' \code{nn.dist[i,j]} gives the weight of the edge \code{{i, nn.index[i,j]}}, #' i.e., the distance between the \eqn{i}-th point and its \eqn{j}-th #' nearest neighbour, \eqn{j=1,\dots,k}. #' \code{nn.dist[i,]} is sorted nondecreasingly for all \eqn{i}. #' #' #' #' @examples #' library("datasets") #' data("iris") #' X <- jitter(as.matrix(iris[1:2])) # some data #' neighbours <- knn_euclid(X, 1) # 1-NNs of each point #' plot(X, asp=1, las=1) #' segments(X[,1], X[,2], X[neighbours$nn.index,1], X[neighbours$nn.index,2]) #' #' knn_euclid(X, 5, matrix(c(6, 4), nrow=1)) # five closest points to (6, 4) #' #' #' @seealso \code{\link{mst_euclid}} #' #' @rdname knn_euclid #' @encoding UTF-8 #' @export knn_euclid <- function(X, k = 1L, Y = NULL, algorithm = "auto", max_leaf_size = 0L, squared = FALSE, verbose = FALSE) { .Call(`_quitefastmst_knn_euclid`, X, k, Y, algorithm, max_leaf_size, squared, verbose) } #' @title Euclidean and Mutual Reachability Minimum Spanning Trees #' #' @description #' The function determines the/a(*) minimum spanning tree (MST) of a set #' of \eqn{n} points, i.e., an acyclic undirected connected graph whose #' vertices represent the points, and edges are weighted by the distances #' between point pairs and have minimal total weight. #' #' MSTs have many uses in, amongst others, topological data analysis #' (clustering, density estimation, dimensionality reduction, #' outlier detection, etc.). #' #' In clustering and density estimation, the parameter \code{M} plays the role #' of a smoothing factor; for discussion, see (Campello et al., 2015) #' and the references therein. #' #' For \eqn{M\leq 1}, we get a spanning tree that minimises the sum of #' Euclidean distances between the points, i.e., the classic Euclidean minimum #' spanning tree (EMST). If \eqn{M=1}, the function additionally returns #' the distance to each point's nearest neighbour. #' #' If \eqn{M>1}, the spanning tree is the smallest with respect to #' the degree-\eqn{M} mutual reachability distance (Campello et al., 2013) given by #' \eqn{d_M(i, j)=\max\{ c_M(i), c_M(j), d(i, j)\}}, where \eqn{d(i,j)} #' is the standard Euclidean distance between the \eqn{i}-th and the \eqn{j}-th point, #' and \eqn{c_M(i)} is the \eqn{i}-th \eqn{M}-core distance defined as the distance #' between the \eqn{i}-th point and its \eqn{M}-th nearest neighbour #' (not including the query point itself). #' #' Note that (Campello et al., 2013) defines the core distance as the #' distance to the \eqn{(M-1)}-th nearest neighbour (or the \eqn{M}-th one, #' but including self). #' #' #' @details #' (*) Note that if there are many pairs of equidistant points, #' there can be many minimum spanning trees. In particular, it is likely #' that there are point pairs with the same mutual reachability distances. #' #' To make the definition unambiguous, the \code{mutreach_ties} argument #' indicates the preference towards connecting to farther/closer points with #' respect to the original metric, or having smaller/larger core distances #' in cases of tied distances; see (Gagolewski, 2026). Empirically, #' \code{mutreach_ties="dcore_min"} and \code{mutreach_leaves="reconnect_dcore_min"} #' leads to MSTs with more leaves and hubs. #' #' The brute force method always resolves all ties, whilst, for efficiency, #' the K-d tree-based algorithms use this adjustment only for the first \eqn{M} #' nearest neighbours, so the resulting trees might be slightly different. #' #' The implemented algorithms, see the \code{algorithm} parameter, assume #' that \eqn{M} is rather small. #' #' Our implementation of K-d trees (Bentley, 1975) has been quite optimised; #' amongst others, it has good locality of reference (at the cost of making #' a copy of the input dataset), features the sliding #' midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), #' node pruning strategies inspired by some ideas from (Sample et al., 2001), #' and a couple of further tuneups proposed by the current author. #' #' The "single-tree" version of the Borůvka algorithm is parallelised: #' in every iteration, it seeks each point's nearest "alien", #' i.e., the nearest point thereto from another cluster. #' The "dual-tree" Borůvka version of the algorithm is, in principle, based #' on (March et al., 2010). As far as our implementation is concerned, #' the dual-tree approach is often only faster in 2- and 3-dimensional spaces, #' for \eqn{M\leq 1}, and in a single-threaded setting. For another #' (approximate) adaptation of the dual-tree algorithm to mutual #' reachability distances, see (McInnes and Healy, 2017). #' #' The "sesqui-tree" variant (by the current author) is a mixture of the two #' approaches: it compares leaves against the full tree and can be run #' in parallel. It is usually faster than the single- and dual-tree methods #' in very low dimensional spaces and usually not much slower than #' the single-tree variant otherwise. #' #' Nevertheless, it is well-known that K-d trees perform well only in spaces #' of low intrinsic dimensionality (the "curse"). For high \eqn{d}, #' the "brute-force" algorithm is recommended. Here, we provided a #' parallelised (see Olson, 1995) version of the Jarník (1930) (a.k.a. #' Prim, 1957) algorithm, where the distances are computed #' on the fly (only once for \eqn{M\leq 1}). #' #' The number of threads used is controlled via the \code{OMP_NUM_THREADS} #' environment variable or via the \code{\link{omp_set_num_threads}} function #' at runtime. For best speed, consider building the package #' from sources using, e.g., \code{-O3 -march=native} compiler flags. #' #' #' @references #' V. Jarník, O jistém problému minimálním, #' \emph{Práce Moravské Přírodovědecké Společnosti} 6, 1930, 57–63. #' #' C.F. Olson, Parallel algorithms for hierarchical clustering, #' Parallel Computing 21(8), 1995, 1313–1325. #' #' R. Prim, Shortest connection networks and some generalizations, #' \emph{The Bell System Technical Journal} 36(6), 1957, 1389–1401. #' #' O. Borůvka, O jistém problému minimálním, \emph{Práce Moravské #' Přírodovědecké Společnosti} 3, 1926, 37–58. #' #' W.B. March, R. Parikshit, A.G. Gray, Fast Euclidean minimum spanning #' tree: Algorithm, analysis, and applications, \emph{Proc. 16th ACM SIGKDD #' Intl. Conf. Knowledge Discovery and Data Mining (KDD '10)}, 2010, 603–612. #' #' J.L. Bentley, Multidimensional binary search trees used for associative #' searching, \emph{Communications of the ACM} 18(9), 509–517, 1975, #' \doi{10.1145/361002.361007} #' #' S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends #' are fat, \emph{4th CGC Workshop on Computational Geometry}, 1999 #' #' N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search #' strategies in K-d Trees, \emph{5th WSES/IEEE Conf. on Circuits, Systems, #' Communications & Computers} (CSCC'01), 2001 #' #' R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based #' on hierarchical density estimates, \emph{Lecture Notes in Computer Science} #' 7819, 2013, 160–172. \doi{10.1007/978-3-642-37456-2_14} #' #' R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical #' density estimates for data clustering, visualization, and outlier detection, #' \emph{ACM Transactions on Knowledge Discovery from Data (TKDD)} 10(1), #' 2015, 1–51, \doi{10.1145/2733381} #' #' L. McInnes, J. Healy, Accelerated hierarchical density-based #' clustering, \emph{IEEE Intl. Conf. Data Mining Workshops (ICMDW)}, 2017, #' 33–42, \doi{10.1109/ICDMW.2017.12} #' #' M. Gagolewski, quitefastmst, in preparation, 2026, TODO #' #' #' @param X the "database"; a matrix of shape \eqn{n\times d} #' @param M the smoothing factor a.k.a. the degree of the mutual reachability #' distance; \eqn{M\leq 1} gives the ordinary Euclidean distance #' @param algorithm \code{"auto"}, \code{"single_kd_tree"}, #' \code{"sesqui_kd_tree"}, \code{"dual_kd_tree"}, or \code{"brute"}; #' K-d trees can only be used for \eqn{d} between 2 and 20 only; #' \code{"auto"} selects \code{"sesqui_kd_tree"} for \eqn{d\leq 20}. #' \code{"brute"} is used otherwise #' @param max_leaf_size maximal number of points in the K-d tree leaves; #' smaller leaves use more memory, yet are not necessarily faster; #' use \code{0} to select the default value, currently set to 32 for the #' single-tree and sesqui-tree and 8 for the dual-tree Borůvka algorithm #' @param first_pass_max_brute_size minimal number of points in a node to #' treat it as a leaf (unless it actually is a leaf) in the first #' iteration of the algorithm; use \code{0} to select the default value, #' currently set to 32 #' @param mutreach_ties adjustment for mutual reachability distance ambiguity #' (for \eqn{M>1}); one of \code{"dcore_min"}, \code{"dist_max"}, #' \code{"dist_min"} (default), or \code{"dcore_max"} #' @param mutreach_leaves a way to postprocess the leaves of the computed tree; #' one of \code{"keep"} (default: do nothing), #' or \code{"reconnect_dcore_min"} (try reconnecting leaves to #' inner vertices which have them amongst their M nearest neighbours; #' prefer vertices of the smallest core distance) #' @param verbose whether to print diagnostic messages #' #' #' @return #' A list with two $(M=0)$ or four $(M>0)$ elements, \code{mst.index} and #' \code{mst.dist}, and additionally \code{nn.index} and \code{nn.dist}. #' #' \code{mst.index} is a matrix with \eqn{n-1} rows and \eqn{2} columns, #' whose rows define the tree edges. #' #' \code{mst.dist} is a vector of length #' \eqn{n-1} giving the weights of the corresponding edges. #' #' The tree edges are ordered with respect to weights nondecreasingly, and then by #' the indexes (lexicographic ordering of the \code{(weight, index1, index2)} #' triples). For each \code{i}, it holds \code{mst_ind[i,1] # # # # # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU Affero General Public License # # Version 3, 19 November 2007, published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU Affero General Public License Version 3 for more details. # # You should have received a copy of the License along with this program. # # If this is not the case, refer to . # # # # ############################################################################ # #' @title Euclidean and Mutual Reachability Minimum Spanning Trees #' #' @description #' See \code{\link{mst_euclid}()} for more details. #' #' @details #' For best speed, consider building the package from sources #' using, e.g., \code{-O3 -march=native} compiler flags and with OpenMP #' support on. #' #' @useDynLib quitefastmst, .registration=TRUE #' @importFrom Rcpp evalCpp #' @encoding UTF-8 #' @keywords internal "_PACKAGE" quitefastmst/NEWS0000644000176200001440000000233215143120722013477 0ustar liggesusers# Changelog ## To Do * [HELP NEEDED] [Python] Set up OpenMP on macOS. * Parallelise the K-d tree building procedure. * In the Borůvka algorithm based on K-d trees, apply the correction for ambiguity of mutual reachability distances (`mutreach_adj`) also when considering non-M first neighbours. * Extend the online documentation: Tutorials, benchmarks, definitions. ## 0.9.1 (2026-02-11) * [NEW FEATURE] The `mutreach_leaves` argument to `mst_euclid` controls the postprocessing of tree leaves. * [BACKWARD INCOMPATIBILITY] The definition of the mutual reachability distance has changed (for notational prudence). Unlike in Campello et al.'s 2013 paper, now the core distance is the distance to the M-th nearest neighbour, not the (M-1)-th one (not including self). * [BACKWARD INCOMPATIBILITY] The `mutreach_adj` argument to `mst_euclid` was removed. Instead, the `mutreach_ties` argument is now available. It defaults to `"dist_min"` for (rough) compatibility with other packages. * [BUGFIX] #3: SIGSEGV on duplicated inputs in `mst_euclid` with `algorithm="brute"` was fixed. ## 0.9.0 (2025-07-22) * [R] Initial CRAN release. * [Python] Initial PyPI release. quitefastmst/src/0000755000176200001440000000000015143126457013602 5ustar liggesusersquitefastmst/src/RcppExports.cpp0000644000176200001440000000744215076665022016607 0ustar liggesusers// Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include using namespace Rcpp; #ifdef RCPP_USE_GLOBAL_ROSTREAM Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get(); Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get(); #endif // Romp_set_num_threads int Romp_set_num_threads(int n_threads); RcppExport SEXP _quitefastmst_Romp_set_num_threads(SEXP n_threadsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< int >::type n_threads(n_threadsSEXP); rcpp_result_gen = Rcpp::wrap(Romp_set_num_threads(n_threads)); return rcpp_result_gen; END_RCPP } // Romp_get_max_threads int Romp_get_max_threads(); RcppExport SEXP _quitefastmst_Romp_get_max_threads() { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; rcpp_result_gen = Rcpp::wrap(Romp_get_max_threads()); return rcpp_result_gen; END_RCPP } // knn_euclid List knn_euclid(SEXP X, int k, SEXP Y, Rcpp::String algorithm, int max_leaf_size, bool squared, bool verbose); RcppExport SEXP _quitefastmst_knn_euclid(SEXP XSEXP, SEXP kSEXP, SEXP YSEXP, SEXP algorithmSEXP, SEXP max_leaf_sizeSEXP, SEXP squaredSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< SEXP >::type X(XSEXP); Rcpp::traits::input_parameter< int >::type k(kSEXP); Rcpp::traits::input_parameter< SEXP >::type Y(YSEXP); Rcpp::traits::input_parameter< Rcpp::String >::type algorithm(algorithmSEXP); Rcpp::traits::input_parameter< int >::type max_leaf_size(max_leaf_sizeSEXP); Rcpp::traits::input_parameter< bool >::type squared(squaredSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(knn_euclid(X, k, Y, algorithm, max_leaf_size, squared, verbose)); return rcpp_result_gen; END_RCPP } // mst_euclid List mst_euclid(SEXP X, int M, Rcpp::String algorithm, int max_leaf_size, int first_pass_max_brute_size, Rcpp::String mutreach_ties, Rcpp::String mutreach_leaves, bool verbose); RcppExport SEXP _quitefastmst_mst_euclid(SEXP XSEXP, SEXP MSEXP, SEXP algorithmSEXP, SEXP max_leaf_sizeSEXP, SEXP first_pass_max_brute_sizeSEXP, SEXP mutreach_tiesSEXP, SEXP mutreach_leavesSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< SEXP >::type X(XSEXP); Rcpp::traits::input_parameter< int >::type M(MSEXP); Rcpp::traits::input_parameter< Rcpp::String >::type algorithm(algorithmSEXP); Rcpp::traits::input_parameter< int >::type max_leaf_size(max_leaf_sizeSEXP); Rcpp::traits::input_parameter< int >::type first_pass_max_brute_size(first_pass_max_brute_sizeSEXP); Rcpp::traits::input_parameter< Rcpp::String >::type mutreach_ties(mutreach_tiesSEXP); Rcpp::traits::input_parameter< Rcpp::String >::type mutreach_leaves(mutreach_leavesSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(mst_euclid(X, M, algorithm, max_leaf_size, first_pass_max_brute_size, mutreach_ties, mutreach_leaves, verbose)); return rcpp_result_gen; END_RCPP } static const R_CallMethodDef CallEntries[] = { {"_quitefastmst_Romp_set_num_threads", (DL_FUNC) &_quitefastmst_Romp_set_num_threads, 1}, {"_quitefastmst_Romp_get_max_threads", (DL_FUNC) &_quitefastmst_Romp_get_max_threads, 0}, {"_quitefastmst_knn_euclid", (DL_FUNC) &_quitefastmst_knn_euclid, 7}, {"_quitefastmst_mst_euclid", (DL_FUNC) &_quitefastmst_mst_euclid, 8}, {NULL, NULL, 0} }; RcppExport void R_init_quitefastmst(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } quitefastmst/src/c_common.h0000644000176200001440000001055015132156655015547 0ustar liggesusers/* Common functions, macros, includes * * Copyleft (C) 2018-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_common_h #define __c_common_h #ifdef QUITEFASTMST_PYTHON #undef QUITEFASTMST_PYTHON #define QUITEFASTMST_PYTHON 1 #endif #ifdef QUITEFASTMST_R #undef QUITEFASTMST_R #define QUITEFASTMST_R 1 #endif #include #include #include #include #ifndef QUITEFASTMST_ASSERT #define __QUITEFASTMST_STR(x) #x #define QUITEFASTMST_STR(x) __QUITEFASTMST_STR(x) #define QUITEFASTMST_ASSERT(EXPR) { if (!(EXPR)) \ throw std::runtime_error( "[quitefastmst] Assertion " #EXPR " failed in "\ __FILE__ ":" QUITEFASTMST_STR(__LINE__) ); } #endif #if QUITEFASTMST_R #include #else #include "Python.h" #include #endif #if QUITEFASTMST_R #define QUITEFASTMST_PRINT(...) REprintf(__VA_ARGS__); #else #define QUITEFASTMST_PRINT(...) fprintf(stderr, __VA_ARGS__); #endif #ifdef QUITEFASTMST_PROFILER #include #define QUITEFASTMST_PROFILER_START \ _quitefastmst_profiler_t0 = std::chrono::high_resolution_clock::now(); #define QUITEFASTMST_PROFILER_GETDIFF \ _quitefastmst_profiler_td = std::chrono::duration(std::chrono::high_resolution_clock::now()-_quitefastmst_profiler_t0); #define QUITEFASTMST_PROFILER_USE \ auto QUITEFASTMST_PROFILER_START \ auto QUITEFASTMST_PROFILER_GETDIFF \ char _quitefastmst_profiler_strbuf[256]; #define QUITEFASTMST_PROFILER_STOP(...) \ QUITEFASTMST_PROFILER_GETDIFF; \ snprintf(_quitefastmst_profiler_strbuf, sizeof(_quitefastmst_profiler_strbuf), __VA_ARGS__); \ QUITEFASTMST_PRINT("%-64s: time=%12.3lf s\n", _quitefastmst_profiler_strbuf, _quitefastmst_profiler_td.count()/1000.0); /* use like: QUITEFASTMST_PROFILER_USE QUITEFASTMST_PROFILER_START QUITEFASTMST_PROFILER_STOP("message %d", 7) */ #else #define QUITEFASTMST_PROFILER_START ; /* no-op */ #define QUITEFASTMST_PROFILER_STOP(...) ; /* no-op */ #define QUITEFASTMST_PROFILER_GETDIFF ; /* no-op */ #define QUITEFASTMST_PROFILER_USE ; /* no-op */ #endif #if QUITEFASTMST_R typedef ssize_t Py_ssize_t; #endif typedef double FLOAT_T; ///< float type we are working internally with // #ifndef INFTY // #define INFTY (std::numeric_limits::infinity()) // #endif template inline T square(T x) { return x*x; } template inline T min3(const T a, const T b, const T c) { T m = a; if (b < m) m = b; if (c < m) m = c; return m; } template inline T med3(const T a, const T b, const T c) { if ((b < a)^(c < a)) return a; // b < a && a <= c= || c < a && a <= b else if ((b < c)^(b < a)) return b; // c <= b && b < a || c > b && b >= a else return c; } template inline T max3(const T a, const T b, const T c) { T m = a; if (b > m) m = b; if (c > m) m = c; return m; } #define IS_PLUS_INFINITY(x) ((x) > 0.0 && !std::isfinite(x)) #define IS_MINUS_INFINITY(x) ((x) < 0.0 && !std::isfinite(x)) #ifdef OPENMP_DISABLED #define OPENMP_IS_ENABLED 0 #ifdef _OPENMP #undef _OPENMP #endif #else #ifdef _OPENMP #include #define OPENMP_IS_ENABLED 1 #else #define OPENMP_IS_ENABLED 0 #endif #endif inline int Comp_set_num_threads(int n_threads) { //QUITEFASTMST_PRINT("Comp_set_num_threads(%d), omp_get_max_threads()==%d\n", // n_threads, omp_get_max_threads()); if (n_threads <= 0) return n_threads; #if OPENMP_IS_ENABLED int oldval = omp_get_max_threads(); // confusing name... omp_set_num_threads(n_threads); return oldval; #else return 1; #endif } inline int Comp_get_max_threads() { #if OPENMP_IS_ENABLED return omp_get_max_threads(); #else return 1; #endif } #endif quitefastmst/src/c_disjoint_sets.h0000644000176200001440000000772415132156655017151 0ustar liggesusers/* class CDisjointSets * * Copyleft (C) 2025-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_disjoint_sets_h #define __c_disjoint_sets_h #include "c_common.h" #include #include /*! Disjoint Sets (Union-Find) Data Structure * * A class to represent partitions of the set {0,1,...,n-1} for any n. * * Path compression for find() is implemented, * but the union() operation is naive (neither * it is union by rank nor by size), * see https://en.wikipedia.org/wiki/Disjoint-set_data_structure. * This is by design, as some other operations in the current * package rely on the assumption that the parent id of each * element is always <= than itself. */ class CDisjointSets { protected: Py_ssize_t n; //!< number of distinct elements Py_ssize_t k; //!< number of subsets std::vector par; /*!< par[i] is the id of the parent * of the i-th element */ public: /*! Starts with a "weak" partition { {0}, {1}, ..., {n-1} }, * i.e., n singletons. * * @param n number of elements, n>=0. */ CDisjointSets(Py_ssize_t n) : par(n) { // if (n < 0) throw std::domain_error("n < 0"); this->n = n; reset(); } void reset() { this->k = n; for (Py_ssize_t i=0; ipar[i] = i; } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CDisjointSets() : CDisjointSets(0) { } /*! Returns the current number of sets in the partition. */ inline Py_ssize_t get_k() const { return this->k; } /*! Returns the total cardinality of the set being partitioned. */ inline Py_ssize_t get_n() const { return this->n; } /*! Danger zone! Ensure find() was called upon each element */ inline Py_ssize_t get_parent(Py_ssize_t x) const { return this->par[x]; } inline const Py_ssize_t* get_parents() const { return this->par.data(); } /*! Finds the subset id for a given x. * * @param x a value in {0,...,n-1} */ Py_ssize_t find(Py_ssize_t x) { if (x < 0 || x >= this->n) throw std::domain_error("CDisjointSets: x not in [0,n)"); if (this->par[x] == x) return x; this->par[x] = this->find(this->par[x]); return this->par[x]; } /*! Merges the sets containing x and y. * * Let px be the parent id of x, and py be the parent id of y. * If px < py, then the new parent id of py will be set to py. * Otherwise, px will have py as its parent. * * If x and y are already members of the same subset, * an exception is thrown. * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} */ virtual Py_ssize_t merge(Py_ssize_t x, Py_ssize_t y) // well, union is a reserved C++ keyword :) { x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y if (x == y) throw std::invalid_argument("CDisjointSets: find(x) == find(y)"); if (y < x) std::swap(x, y); this->par[y] = x; this->k -= 1; return x; } }; #endif quitefastmst/src/c_kdtree.h0000644000176200001440000004070115073705530015532 0ustar liggesusers/* An implementation of K-d trees w.r.t. the squared Euclidean distance * * Supports finding k nearest neighbours of points within the same dataset; * fast for small k and dimensionality d. * * Features the sliding midpoint (midrange) rule suggested in "It's okay to be * skinny, if your friends are fat" by S. Maneewongvatana and D.M. Mount, 1999 * and some further enhancements (minding locality of reference, etc.). * This split criterion was the most efficient amongst those tested * (different quantiles, adjusted midrange, etc.), at least for the purpose * of building minimum spanning trees. * * * Copyleft (C) 2025, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_kdtree_h #define __c_kdtree_h #include "c_common.h" #include #include #include #include #include namespace quitefastkdtree { template struct kdtree_node_base { // some implementations store split_dim and split_val, but exact bounding // boxes (smallest) have better pruning capabilities std::array bbox_min; //< points' bounding box (min dims) std::array bbox_max; //< points' bounding box (max dims) // std::array centroid; Py_ssize_t idx_from; Py_ssize_t idx_to; }; template struct kdtree_node_knn : public kdtree_node_base { kdtree_node_knn* left; kdtree_node_knn* right; kdtree_node_knn() { left = nullptr; // right = nullptr; } inline bool is_leaf() const { return left == nullptr /*&& right == nullptr*/; // either both null or none } }; template class kdtree_distance_sqeuclid { public: static inline FLOAT point_point(const FLOAT* x, const FLOAT* y) { FLOAT dist = 0.0; for (Py_ssize_t u=0; u x[u]) // compare first, as FP subtract is slower dist += square(bbox_min[u] - x[u]); else if (x[u] > bbox_max[u]) dist += square(x[u] - bbox_max[u]); // else dist += 0.0; } return dist; } static inline FLOAT node_node( const FLOAT* bbox_min_a, const FLOAT* bbox_max_a, const FLOAT* bbox_min_b, const FLOAT* bbox_max_b ) { FLOAT dist = 0.0; for (Py_ssize_t u=0; u bbox_max_a[u]) dist += square(bbox_min_b[u] - bbox_max_a[u]); else if (bbox_min_a[u] > bbox_max_b[u]) dist += square(bbox_min_a[u] - bbox_max_b[u]); // else dist += 0.0; } return dist; } }; /** A class enabling searching for k nearest neighbours of a given point * (excluding self) within the same dataset; * it is thread-safe */ template < typename FLOAT, Py_ssize_t D, typename DISTANCE=kdtree_distance_sqeuclid, typename NODE=kdtree_node_knn > class kdtree_kneighbours { private: const Py_ssize_t which; ///< for which point are we getting the k-nns? const Py_ssize_t k; ///< how many nns? const FLOAT* x; ///< the point itself (shortcut) const FLOAT* data; ///< the dataset FLOAT* knn_dist; Py_ssize_t* knn_ind; const Py_ssize_t max_brute_size; // when to switch to the brute-force mode? 0 to honour the tree's max_leaf_size inline void point_vs_points(Py_ssize_t idx_from, Py_ssize_t idx_to) { const FLOAT* y = data+D*idx_from; for (Py_ssize_t i=idx_from; i= knn_dist[k-1]) continue; // insertion-sort like scheme (fast for small k) Py_ssize_t j = k-1; while (j > 0 && dist < knn_dist[j-1]) { knn_ind[j] = knn_ind[j-1]; knn_dist[j] = knn_dist[j-1]; j--; } knn_ind[j] = i; knn_dist[j] = dist; } } void find_knn(const NODE* root) { if (root->is_leaf() || root->idx_to-root->idx_from <= max_brute_size) { if (which < root->idx_from || which >= root->idx_to) point_vs_points(root->idx_from, root->idx_to); else { point_vs_points(root->idx_from, which); point_vs_points(which+1, root->idx_to); } return; } // closer node first (significant speedup) FLOAT left_dist = DISTANCE::point_node( x, root->left->bbox_min.data(), root->left->bbox_max.data() ); FLOAT right_dist = DISTANCE::point_node( x, root->right->bbox_min.data(), root->right->bbox_max.data() ); #define FIND_KNN_PROCESS(nearer_dist, farther_dist, nearer_node, farther_node) \ if (nearer_dist < knn_dist[k-1]) { \ find_knn(nearer_node); \ if (farther_dist < knn_dist[k-1]) \ find_knn(farther_node); \ } \ if (left_dist <= right_dist) { FIND_KNN_PROCESS(left_dist, right_dist, root->left, root->right); } else { FIND_KNN_PROCESS(right_dist, left_dist, root->right, root->left); } // slower: // if (DISTANCE::point_point(x, root->left->centroid.data()) <= DISTANCE::point_point(x, root->right->centroid.data())) // { // if (left_dist < knn_dist[k-1]) // find_knn(root->left); // if (right_dist < knn_dist[k-1]) // find_knn(root->right); // } // else { // if (right_dist < knn_dist[k-1]) // find_knn(root->right); // if (left_dist < knn_dist[k-1]) // find_knn(root->left); // } } public: kdtree_kneighbours( const FLOAT* data, const FLOAT* x, const Py_ssize_t which, FLOAT* knn_dist, Py_ssize_t* knn_ind, const Py_ssize_t k, const Py_ssize_t max_brute_size=0 ) : which(which), k(k), x(x), data(data), knn_dist(knn_dist), knn_ind(knn_ind), max_brute_size(max_brute_size) { if (x == nullptr) { QUITEFASTMST_ASSERT(which >= 0); this->x = data+D*which; } // // Pre-flight (no benefit) // for (Py_ssize_t i=0; i<=2*k; ++i) { // Py_ssize_t j = (Py_ssize_t)which-i-(Py_ssize_t)k; // if (j == (Py_ssize_t)which) continue; // else if (j < 0) j = (Py_ssize_t)n+j; // else if (j >= (Py_ssize_t)n) j = j - (Py_ssize_t)n; // // const FLOAT* y = data+j*D; // FLOAT dist = 0.0; // for (size_t u=0; u= knn_dist[k-1]) // continue; // // // insertion-sort like scheme (fast for small k) // // j = (Py_ssize_t)k-1; // while (j > 0 && dist < knn_dist[j-1]) { // knn_dist[j] = knn_dist[j-1]; // j--; // } // knn_dist[j] = dist; // } // // knn_dist[k-1] = std::nexttoward(knn_dist[k-1], INFINITY); // for (size_t i=0; i, typename NODE=kdtree_node_knn > class kdtree { protected: std::deque< NODE > nodes; // stores all nodes FLOAT* data; //< destroyable; a row-major n*D matrix (points are permuted, see perm[] - that's for better locality of reference) const Py_ssize_t n; //< number of points std::vector perm; //< original point indexes const Py_ssize_t max_leaf_size; //< unless in pathological cases Py_ssize_t nleaves; //< number of leaves in the tree inline void compute_bounding_box(NODE*& root) { const FLOAT* y = data+root->idx_from*D; for (Py_ssize_t u=0; ubbox_min[u] = *y; root->bbox_max[u] = *y; // root->centroid[u] = *y; ++y; } for (Py_ssize_t i=root->idx_from+1; iidx_to; ++i) { for (Py_ssize_t u=0; ubbox_min[u]) root->bbox_min[u] = *y; else if (*y > root->bbox_max[u]) root->bbox_max[u] = *y; // root->centroid[u] += *y; ++y; } } // for (Py_ssize_t u=0; ucentroid[u] /= (root->idx_to-root->idx_from); // } } void build_tree( NODE* root, Py_ssize_t idx_from, Py_ssize_t idx_to ) { QUITEFASTMST_ASSERT(idx_to - idx_from > 0); root->idx_from = idx_from; root->idx_to = idx_to; compute_bounding_box(root); if (idx_to - idx_from <= max_leaf_size) { // this will be a leaf node; nothing more to do ++nleaves; return; } // cut by the dim of the greatest range Py_ssize_t split_dim = 0; FLOAT dim_width = root->bbox_max[0] - root->bbox_min[0]; for (Py_ssize_t u=1; ubbox_max[u] - root->bbox_min[u]; if (cur_width > dim_width) { dim_width = cur_width; split_dim = u; } } // The sliding midpoint rule: FLOAT split_val = 0.5f*(root->bbox_min[split_dim] + root->bbox_max[split_dim]); // midrange // this doesn't improve: // size_t cnt = 0; // for (size_t i=idx_from; ibbox_min[split_dim]+0.75*(root->bbox_max[split_dim] - root->bbox_min[split_dim]); // else if ((idx_to-idx_from)-cnt <= max_leaf_size/4) // split_val = root->bbox_min[split_dim]+0.25*(root->bbox_max[split_dim] - root->bbox_min[split_dim]); if (dim_width == 0) { // a pathological case: this will be a "large" leaf (all points with the same coords) return; } QUITEFASTMST_ASSERT(root->bbox_min[split_dim] < split_val); QUITEFASTMST_ASSERT(split_val < root->bbox_max[split_dim]); // FLOAT split_left_max = root->bbox_min[split_dim]; // FLOAT split_right_min = root->bbox_max[split_dim]; // partition data[idx_from:idx_left, split_dim] <= split_val, data[idx_left:idx_to, split_dim] > split_val Py_ssize_t idx_left = idx_from; Py_ssize_t idx_right = idx_to-1; while (true) { while (data[idx_left*D+split_dim] <= split_val) { // split_val < curbox_max[split_dim] // if (data[idx_left*D+split_dim] > split_left_max) // split_left_max = data[idx_left*D+split_dim]; idx_left++; } while (data[idx_right*D+split_dim] > split_val) { // split_val > curbox_min[split_dim] // if (data[idx_right*D+split_dim] < split_right_min) // split_right_min = data[idx_right*D+split_dim]; idx_right--; } if (idx_left >= idx_right) break; std::swap(perm[idx_left], perm[idx_right]); for (Py_ssize_t u=0; u idx_from); QUITEFASTMST_ASSERT(idx_left < idx_to); QUITEFASTMST_ASSERT(data[idx_left*D+split_dim] > split_val); QUITEFASTMST_ASSERT(data[(idx_left-1)*D+split_dim] <= split_val); // QUITEFASTMST_ASSERT(split_left_max <= split_val); // QUITEFASTMST_ASSERT(split_right_min > split_val); // root->intnode_data.split_dim = split_dim; // root->intnode_data.split_left_max = split_left_max; // root->intnode_data.split_right_min = split_right_min; nodes.push_back(NODE()); root->left = &nodes[nodes.size()-1]; nodes.push_back(NODE()); root->right = &nodes[nodes.size()-1]; build_tree(root->left, idx_from, idx_left); build_tree(root->right, idx_left, idx_to); } public: kdtree() : data(nullptr), n(0), perm(0), max_leaf_size(1) { } kdtree(FLOAT* data, const Py_ssize_t n, const Py_ssize_t max_leaf_size=16) : data(data), n(n), perm(n), max_leaf_size(max_leaf_size) { QUITEFASTMST_ASSERT(max_leaf_size > 0); for (Py_ssize_t i=0; i knn(data, nullptr, which, knn_dist, knn_ind, k); knn.find(&nodes[0]); } void kneighbours(const FLOAT* x, FLOAT* knn_dist, Py_ssize_t* knn_ind, Py_ssize_t k) { kdtree_kneighbours knn(data, x, -1, knn_dist, knn_ind, k); knn.find(&nodes[0]); } }; /*! * k nearest neighbours of each point in X (in the tree); * each point is not its own neighbour * * see _knn_sqeuclid_kdtree * * @param tree a pre-built K-d tree containing n points * @param knn_dist [out] size n*k * @param knn_ind [out] size n*k * @param k number of neighbours */ template void kneighbours( TREE& tree, FLOAT* knn_dist, // size n*k Py_ssize_t* knn_ind, // size n*k Py_ssize_t k ) { Py_ssize_t n = tree.get_n(); const Py_ssize_t* perm = tree.get_perm(); #if OPENMP_IS_ENABLED #pragma omp parallel for schedule(static) #endif for (Py_ssize_t i=0; i void kneighbours( TREE& tree, const FLOAT* Y, Py_ssize_t m, FLOAT* knn_dist, // size m*k Py_ssize_t* knn_ind, // size m*k Py_ssize_t k ) { #if OPENMP_IS_ENABLED #pragma omp parallel for schedule(static) #endif for (Py_ssize_t i=0; i * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_fastmst.h" #include "c_common.h" #include #include "c_kdtree.h" /** * helper function called by Cknn2_euclid_kdtree below */ template void _knn_sqeuclid_kdtree( FLOAT* X, const size_t n, const FLOAT* Y, const Py_ssize_t m, const size_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, size_t max_leaf_size, bool /*verbose=false*/) { using DISTANCE=quitefastkdtree::kdtree_distance_sqeuclid; quitefastkdtree::kdtree tree(X, n, max_leaf_size); if (!Y) quitefastkdtree::kneighbours(tree, nn_dist, nn_ind, k); else quitefastkdtree::kneighbours(tree, Y, m, nn_dist, nn_ind, k); } /*! Get the k nearest neighbours of each point w.r.t. the Euclidean distance, * using a K-d tree to speed up the computations. * * The implemented algorithm assumes that `k` is rather small; say, `k <= 20`. * * Our implementation of K-d trees [1]_ has been quite optimised; amongst * others, it has good locality of reference, features the sliding midpoint * (midrange) rule suggested in [2]_, and a node pruning strategy inspired * by the discussion in [3]_. Still, it is well-known that K-d trees * perform well only in spaces of low intrinsic dimensionality. Thus, * due to the so-called curse of dimensionality, for high `d`, a brute-force * algorithm is recommended. * * [1] J.L. Bentley, Multidimensional binary search trees used for associative * searching, Communications of the ACM 18(9), 509–517, 1975, * https://doi.org/10.1145/361002.361007 * * [2] S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends * are fat, 4th CGC Workshop on Computational Geometry, 1999 * * [3] N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search * strategies in K-d Trees, 5th WSES/IEEE Conf. on Circuits, Systems, * Communications & Computers (CSCC'01), 2001 * * * * @param X [destroyable] data: a C-contiguous data matrix [destroyable] * @param n number of rows in X * @param Y query points: a C-contiguous data matrix [destroyable] * @param m number of rows in Y * @param d number of columns in X and in Y * @param k number of nearest neighbours to look for * @param nn_dist [out] vector(matrix) of length n*k in Y is NULL or m*k otherwise; distances to NNs * @param nn_ind [out] vector(matrix) of the same length as nn_ind; indexes of NNs * @param max_leaf_size maximal number of points in the K-d tree's leaves * @param squared return the squared Euclidean distance? * @param verbose output diagnostic/progress messages? */ template void Cknn2_euclid_kdtree( FLOAT* X, const Py_ssize_t n, const FLOAT* Y, const Py_ssize_t m, const Py_ssize_t d, const Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, bool squared, bool verbose ) { Py_ssize_t nknn; if (n <= 0) throw std::domain_error("n <= 0"); if (k <= 0) throw std::domain_error("k <= 0"); if (!Y) { if (k >= n) throw std::domain_error("k >= n"); nknn = n; } else { if (m <= 0) throw std::domain_error("m <= 0"); if (k > n) throw std::domain_error("k > n"); nknn = m; } if (max_leaf_size <= 0) throw std::domain_error("max_leaf_size <= 0"); if (verbose) QUITEFASTMST_PRINT("[quitefastmst] Determining the nearest neighbours... "); #define ARGS_knn_sqeuclid_kdtree X, n, Y, m, k, nn_dist, nn_ind, max_leaf_size, verbose /* LMAO; templates... */ /**/ if (d == 2) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 3) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 4) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 5) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 6) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 7) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 8) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 9) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 10) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 11) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 12) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 13) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 14) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 15) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 16) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 17) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 18) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 19) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else if (d == 20) _knn_sqeuclid_kdtree(ARGS_knn_sqeuclid_kdtree); else { throw std::runtime_error("d should be between 2 and 20"); } if (!squared) { for (Py_ssize_t i=0; i void Cknn1_euclid_kdtree( FLOAT* X, const Py_ssize_t n, const Py_ssize_t d, const Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, bool squared, bool verbose ) { Cknn2_euclid_kdtree( X, n, (const FLOAT*)nullptr, -1, d, k, nn_dist, nn_ind, max_leaf_size, squared, verbose ); } // instantiate: template void Cknn2_euclid_kdtree( float* X, const Py_ssize_t n, const float* Y, const Py_ssize_t m, const Py_ssize_t d, const Py_ssize_t k, float* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, bool squared, bool verbose ); template void Cknn2_euclid_kdtree( double* X, const Py_ssize_t n, const double* Y, const Py_ssize_t m, const Py_ssize_t d, const Py_ssize_t k, double* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, bool squared, bool verbose ); template void Cknn1_euclid_kdtree( float* X, const Py_ssize_t n, const Py_ssize_t d, const Py_ssize_t k, float* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, bool squared, bool verbose ); template void Cknn1_euclid_kdtree( double* X, const Py_ssize_t n, const Py_ssize_t d, const Py_ssize_t k, double* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, bool squared, bool verbose ); quitefastmst/src/RcppFastmst.cpp0000644000176200001440000006053015143122402016542 0ustar liggesusers/* Functions to compute k-nearest neighbours and minimum spanning trees * with respect to the Euclidean metric and the thereon-based mutual * reachability distances. The module gives access to a quite fast * implementation of K-d trees. * * For best speed, consider building the package from sources * using, e.g., `-O3 -march=native` compiler flags. * * Copyleft (C) 2025-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_common.h" #include "c_fastmst.h" #include using namespace Rcpp; //' @title //' Get or Set the Number of Threads //' //' @description //' These functions get or set the maximal number of OpenMP threads that //' can be used by \code{\link{knn_euclid}} and \code{\link{mst_euclid}}, //' amongst others. //' //' @param n_threads maximal number of threads to use //' //' @return //' \code{omp_get_max_threads} returns the maximal number //' of threads that will be used during the next call to a parallelised //' function, not the maximal number of threads possibly available. //' It there is no built-in support for OpenMP, 1 is always returned. //' //' For \code{omp_set_num_threads}, the previous value of \code{max_threads} //' is returned. //' //' //' @rdname omp //' @encoding UTF-8 //' @export // [[Rcpp::export("omp_set_num_threads")]] int Romp_set_num_threads(int n_threads) { return Comp_set_num_threads(n_threads); } //' @rdname omp //' @export // [[Rcpp::export("omp_get_max_threads")]] int Romp_get_max_threads() { return Comp_get_max_threads(); } //' @title Euclidean Nearest Neighbours //' //' @description //' If \code{Y} is \code{NULL}, then the function determines the first \code{k} //' nearest neighbours of each point in \code{X} with respect //' to the Euclidean distance. It is assumed that each query point is //' not its own neighbour. //' //' Otherwise, for each point in \code{Y}, this function determines the \code{k} //' nearest points thereto from \code{X}. //' //' @details //' The implemented algorithms, see the \code{algorithm} parameter, assume //' that \eqn{k} is rather small. //' //' Our implementation of K-d trees (Bentley, 1975) has been quite optimised; //' amongst others, it has good locality of reference (at the cost of making //' a copy of the input dataset), features the sliding //' midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), //' node pruning strategies inspired by some ideas from (Sample et al., 2001), //' and a couple of further tuneups proposed by the current author. //' Still, it is well-known that K-d trees perform well only in spaces of low //' intrinsic dimensionality. Thus, due to the so-called curse of //' dimensionality, for high \code{d}, the brute-force algorithm is recommended. //' //' The number of threads is controlled via the \code{OMP_NUM_THREADS} //' environment variable or via the \code{\link{omp_set_num_threads}} function //' at runtime. For best speed, consider building the package //' from sources using, e.g., \code{-O3 -march=native} compiler flags. //' //' //' @references //' J.L. Bentley, Multidimensional binary search trees used for associative //' searching, \emph{Communications of the ACM} 18(9), 509–517, 1975, //' \doi{10.1145/361002.361007} //' //' S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends //' are fat, \emph{4th CGC Workshop on Computational Geometry}, 1999 //' //' N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search //' strategies in K-d Trees, \emph{5th WSES/IEEE Conf. on Circuits, Systems, //' Communications & Computers} (CSCC'01), 2001 //' //' //' @param X the "database"; a matrix of shape \eqn{n\times d} //' @param k requested number of nearest neighbours //' @param Y the "query points"; \code{NULL} or a matrix of shape \eqn{m\times d}; //' note that setting \code{Y=X}, contrary to \code{NULL}, //' will include the query points themselves amongst their own neighbours //' @param algorithm \code{"auto"}, \code{"kd_tree"} or \code{"brute"}; //' K-d trees can be used for \code{d} between 2 and 20 only; //' \code{"auto"} selects \code{"kd_tree"} in low-dimensional spaces //' @param max_leaf_size maximal number of points in the K-d tree leaves; //' smaller leaves use more memory, yet are not necessarily faster; //' use \code{0} to select the default value, currently set to 32 //' @param squared whether the output \code{nn.dist} should be based on //' the squared Euclidean distance //' @param verbose whether to print diagnostic messages //' //' //' @return //' A list with two elements, \code{nn.index} and \code{nn.dist}, is returned. //' //' \code{nn.dist} and \code{nn.index} have shape \eqn{n\times k} //' or \eqn{m\times k}, depending whether \code{Y} is given. //' //' \code{nn.index[i,j]} is the index (between \eqn{1} and \eqn{n}) //' of the \eqn{j}-th nearest neighbour of \eqn{i}. //' //' \code{nn.dist[i,j]} gives the weight of the edge \code{{i, nn.index[i,j]}}, //' i.e., the distance between the \eqn{i}-th point and its \eqn{j}-th //' nearest neighbour, \eqn{j=1,\dots,k}. //' \code{nn.dist[i,]} is sorted nondecreasingly for all \eqn{i}. //' //' //' //' @examples //' library("datasets") //' data("iris") //' X <- jitter(as.matrix(iris[1:2])) # some data //' neighbours <- knn_euclid(X, 1) # 1-NNs of each point //' plot(X, asp=1, las=1) //' segments(X[,1], X[,2], X[neighbours$nn.index,1], X[neighbours$nn.index,2]) //' //' knn_euclid(X, 5, matrix(c(6, 4), nrow=1)) # five closest points to (6, 4) //' //' //' @seealso \code{\link{mst_euclid}} //' //' @rdname knn_euclid //' @encoding UTF-8 //' @export // [[Rcpp::export("knn_euclid")]] List knn_euclid( SEXP X, int k=1, SEXP Y=R_NilValue, Rcpp::String algorithm="auto", int max_leaf_size=0, bool squared=false, bool verbose=false ) { using FLOAT = double; // float is not faster.. Rcpp::NumericMatrix _X; if (!Rf_isMatrix(X)) _X = Rcpp::internal::convert_using_rfunction(X, "as.matrix"); else _X = X; Py_ssize_t n = (Py_ssize_t)_X.nrow(); Py_ssize_t d = (Py_ssize_t)_X.ncol(); Py_ssize_t m; bool use_kdtree; if (n < 1 || d <= 1) stop("X is ill-shaped"); if (k < 1) stop("`k` must be >= 1"); if (algorithm == "auto") { if (2 <= d && d <= 20) algorithm = "kd_tree"; else algorithm = "brute"; } if (algorithm == "kd_tree") { if (d < 2 || d > 20) stop("kd_tree can only be used for 2 <= d <= 20"); if (max_leaf_size == 0) max_leaf_size = 32; // the current default if (max_leaf_size <= 0) stop("max_leaf_size must be positive"); use_kdtree = true; } else if (algorithm == "brute") use_kdtree = false; else stop("invalid 'algorithm'"); std::vector XC(n*d); Py_ssize_t j = 0; for (Py_ssize_t i=0; i nn_dist; std::vector nn_ind; if (Rf_isNull(Y)) { if (k >= n) stop("too many neighbours requested"); m = n; nn_dist.resize(n*k); nn_ind.resize(n*k); if (use_kdtree) Cknn1_euclid_kdtree( XC.data(), n, d, k, nn_dist.data(), nn_ind.data(), max_leaf_size, squared, verbose ); else Cknn1_euclid_brute( XC.data(), n, d, k, nn_dist.data(), nn_ind.data(), squared, verbose ); } else { if (k > n) stop("too many neighbours requested"); Rcpp::NumericMatrix _Y; if (!Rf_isMatrix(Y)) _Y = Rcpp::internal::convert_using_rfunction(Y, "as.matrix"); else _Y = Y; m = (Py_ssize_t)_Y.nrow(); if (_Y.ncol() != d) stop("Y's dimensionality does not match that of X"); nn_dist.resize(m*k); nn_ind.resize(m*k); std::vector YC(m*d); Py_ssize_t j = 0; for (Py_ssize_t i=0; i1}, the spanning tree is the smallest with respect to //' the degree-\eqn{M} mutual reachability distance (Campello et al., 2013) given by //' \eqn{d_M(i, j)=\max\{ c_M(i), c_M(j), d(i, j)\}}, where \eqn{d(i,j)} //' is the standard Euclidean distance between the \eqn{i}-th and the \eqn{j}-th point, //' and \eqn{c_M(i)} is the \eqn{i}-th \eqn{M}-core distance defined as the distance //' between the \eqn{i}-th point and its \eqn{M}-th nearest neighbour //' (not including the query point itself). //' //' Note that (Campello et al., 2013) defines the core distance as the //' distance to the \eqn{(M-1)}-th nearest neighbour (or the \eqn{M}-th one, //' but including self). //' //' //' @details //' (*) Note that if there are many pairs of equidistant points, //' there can be many minimum spanning trees. In particular, it is likely //' that there are point pairs with the same mutual reachability distances. //' //' To make the definition unambiguous, the \code{mutreach_ties} argument //' indicates the preference towards connecting to farther/closer points with //' respect to the original metric, or having smaller/larger core distances //' in cases of tied distances; see (Gagolewski, 2026). Empirically, //' \code{mutreach_ties="dcore_min"} and \code{mutreach_leaves="reconnect_dcore_min"} //' leads to MSTs with more leaves and hubs. //' //' The brute force method always resolves all ties, whilst, for efficiency, //' the K-d tree-based algorithms use this adjustment only for the first \eqn{M} //' nearest neighbours, so the resulting trees might be slightly different. //' //' The implemented algorithms, see the \code{algorithm} parameter, assume //' that \eqn{M} is rather small. //' //' Our implementation of K-d trees (Bentley, 1975) has been quite optimised; //' amongst others, it has good locality of reference (at the cost of making //' a copy of the input dataset), features the sliding //' midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), //' node pruning strategies inspired by some ideas from (Sample et al., 2001), //' and a couple of further tuneups proposed by the current author. //' //' The "single-tree" version of the Borůvka algorithm is parallelised: //' in every iteration, it seeks each point's nearest "alien", //' i.e., the nearest point thereto from another cluster. //' The "dual-tree" Borůvka version of the algorithm is, in principle, based //' on (March et al., 2010). As far as our implementation is concerned, //' the dual-tree approach is often only faster in 2- and 3-dimensional spaces, //' for \eqn{M\leq 1}, and in a single-threaded setting. For another //' (approximate) adaptation of the dual-tree algorithm to mutual //' reachability distances, see (McInnes and Healy, 2017). //' //' The "sesqui-tree" variant (by the current author) is a mixture of the two //' approaches: it compares leaves against the full tree and can be run //' in parallel. It is usually faster than the single- and dual-tree methods //' in very low dimensional spaces and usually not much slower than //' the single-tree variant otherwise. //' //' Nevertheless, it is well-known that K-d trees perform well only in spaces //' of low intrinsic dimensionality (the "curse"). For high \eqn{d}, //' the "brute-force" algorithm is recommended. Here, we provided a //' parallelised (see Olson, 1995) version of the Jarník (1930) (a.k.a. //' Prim, 1957) algorithm, where the distances are computed //' on the fly (only once for \eqn{M\leq 1}). //' //' The number of threads used is controlled via the \code{OMP_NUM_THREADS} //' environment variable or via the \code{\link{omp_set_num_threads}} function //' at runtime. For best speed, consider building the package //' from sources using, e.g., \code{-O3 -march=native} compiler flags. //' //' //' @references //' V. Jarník, O jistém problému minimálním, //' \emph{Práce Moravské Přírodovědecké Společnosti} 6, 1930, 57–63. //' //' C.F. Olson, Parallel algorithms for hierarchical clustering, //' Parallel Computing 21(8), 1995, 1313–1325. //' //' R. Prim, Shortest connection networks and some generalizations, //' \emph{The Bell System Technical Journal} 36(6), 1957, 1389–1401. //' //' O. Borůvka, O jistém problému minimálním, \emph{Práce Moravské //' Přírodovědecké Společnosti} 3, 1926, 37–58. //' //' W.B. March, R. Parikshit, A.G. Gray, Fast Euclidean minimum spanning //' tree: Algorithm, analysis, and applications, \emph{Proc. 16th ACM SIGKDD //' Intl. Conf. Knowledge Discovery and Data Mining (KDD '10)}, 2010, 603–612. //' //' J.L. Bentley, Multidimensional binary search trees used for associative //' searching, \emph{Communications of the ACM} 18(9), 509–517, 1975, //' \doi{10.1145/361002.361007} //' //' S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends //' are fat, \emph{4th CGC Workshop on Computational Geometry}, 1999 //' //' N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search //' strategies in K-d Trees, \emph{5th WSES/IEEE Conf. on Circuits, Systems, //' Communications & Computers} (CSCC'01), 2001 //' //' R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based //' on hierarchical density estimates, \emph{Lecture Notes in Computer Science} //' 7819, 2013, 160–172. \doi{10.1007/978-3-642-37456-2_14} //' //' R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical //' density estimates for data clustering, visualization, and outlier detection, //' \emph{ACM Transactions on Knowledge Discovery from Data (TKDD)} 10(1), //' 2015, 1–51, \doi{10.1145/2733381} //' //' L. McInnes, J. Healy, Accelerated hierarchical density-based //' clustering, \emph{IEEE Intl. Conf. Data Mining Workshops (ICMDW)}, 2017, //' 33–42, \doi{10.1109/ICDMW.2017.12} //' //' M. Gagolewski, quitefastmst, in preparation, 2026, TODO //' //' //' @param X the "database"; a matrix of shape \eqn{n\times d} //' @param M the smoothing factor a.k.a. the degree of the mutual reachability //' distance; \eqn{M\leq 1} gives the ordinary Euclidean distance //' @param algorithm \code{"auto"}, \code{"single_kd_tree"}, //' \code{"sesqui_kd_tree"}, \code{"dual_kd_tree"}, or \code{"brute"}; //' K-d trees can only be used for \eqn{d} between 2 and 20 only; //' \code{"auto"} selects \code{"sesqui_kd_tree"} for \eqn{d\leq 20}. //' \code{"brute"} is used otherwise //' @param max_leaf_size maximal number of points in the K-d tree leaves; //' smaller leaves use more memory, yet are not necessarily faster; //' use \code{0} to select the default value, currently set to 32 for the //' single-tree and sesqui-tree and 8 for the dual-tree Borůvka algorithm //' @param first_pass_max_brute_size minimal number of points in a node to //' treat it as a leaf (unless it actually is a leaf) in the first //' iteration of the algorithm; use \code{0} to select the default value, //' currently set to 32 //' @param mutreach_ties adjustment for mutual reachability distance ambiguity //' (for \eqn{M>1}); one of \code{"dcore_min"}, \code{"dist_max"}, //' \code{"dist_min"} (default), or \code{"dcore_max"} //' @param mutreach_leaves a way to postprocess the leaves of the computed tree; //' one of \code{"keep"} (default: do nothing), //' or \code{"reconnect_dcore_min"} (try reconnecting leaves to //' inner vertices which have them amongst their M nearest neighbours; //' prefer vertices of the smallest core distance) //' @param verbose whether to print diagnostic messages //' //' //' @return //' A list with two $(M=0)$ or four $(M>0)$ elements, \code{mst.index} and //' \code{mst.dist}, and additionally \code{nn.index} and \code{nn.dist}. //' //' \code{mst.index} is a matrix with \eqn{n-1} rows and \eqn{2} columns, //' whose rows define the tree edges. //' //' \code{mst.dist} is a vector of length //' \eqn{n-1} giving the weights of the corresponding edges. //' //' The tree edges are ordered with respect to weights nondecreasingly, and then by //' the indexes (lexicographic ordering of the \code{(weight, index1, index2)} //' triples). For each \code{i}, it holds \code{mst_ind[i,1]= n) stop("incorrect M"); if (algorithm == "auto") { if (2 <= d && d <= 20) { //if (d <= 3) algorithm = "sesqui_kd_tree"; //else // algorithm = "single_kd_tree"; } else algorithm = "brute"; } if (algorithm == "single_kd_tree" || algorithm == "sesqui_kd_tree" || algorithm == "dual_kd_tree") { if (d < 2 || d > 20) stop("K-d trees can only be used for 2 <= d <= 20"); use_kdtree = true; if (algorithm == "single_kd_tree") { if (max_leaf_size == 0) max_leaf_size = 32; // the current default if (first_pass_max_brute_size == 0) first_pass_max_brute_size = 32; // the current default boruvka_variant = 1.0; } else if (algorithm == "sesqui_kd_tree") { if (max_leaf_size == 0) max_leaf_size = 32; // the current default if (first_pass_max_brute_size == 0) first_pass_max_brute_size = 32; // the current default boruvka_variant = 1.5; } else { if (max_leaf_size == 0) max_leaf_size = 8; // the current default if (first_pass_max_brute_size == 0) first_pass_max_brute_size = 32; // the current default boruvka_variant = 2.0; } if (max_leaf_size <= 0) stop("max_leaf_size must be positive"); if (first_pass_max_brute_size <= 0) stop("first_pass_max_brute_size must be positive"); } else if (algorithm == "brute") use_kdtree = false; else stop("invalid 'algorithm'"); if (mutreach_ties == "dcore_min") mutreach_ties_val = -2; else if (mutreach_ties == "dist_max") mutreach_ties_val = -1; else if (mutreach_ties == "dist_min") mutreach_ties_val = 1; else if (mutreach_ties == "dcore_max") mutreach_ties_val = 2; else stop("invalid 'mutreach_ties'"); std::vector XC(n*d); Py_ssize_t j = 0; for (Py_ssize_t i=0; i mst_ind((n-1)*2); // C-order std::vector mst_dist(n-1); // TODO: use out_dist std::vector nn_ind((M==0)?0:(n*M)); std::vector nn_dist((M==0)?0:(n*M)); if (use_kdtree) Cmst_euclid_kdtree( XC.data(), n, d, M, mst_dist.data(), mst_ind.data(), (M==0)?nullptr:nn_dist.data(), (M==0)?nullptr:nn_ind.data(), max_leaf_size, first_pass_max_brute_size, boruvka_variant, mutreach_ties_val, verbose ); else Cmst_euclid_brute( XC.data(), n, d, M, mst_dist.data(), mst_ind.data(), (M==0)?nullptr:nn_dist.data(), (M==0)?nullptr:nn_ind.data(), mutreach_ties_val, verbose ); if (mutreach_leaves=="keep") ; else if (mutreach_leaves=="reconnect_dcore_min") { if (M>0) { Py_ssize_t mutreach_leaves_maxiter = 10; // TODO: param while (mutreach_leaves_maxiter > 0) { Py_ssize_t num_changes = Cleaves_reconnect_dcore_min( n-1, n, M, mst_dist.data(), mst_ind.data(), nn_dist.data(), nn_ind.data() ); if (num_changes <= 0) break; mutreach_leaves_maxiter--; } } } else stop("invalid 'mutreach_leaves'"); // generate outputs Rcpp::IntegerMatrix out_mst_ind(n-1, 2); Rcpp::NumericVector out_mst_dist(n-1); for (Py_ssize_t i=0; i * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_kdtree_boruvka_h #define __c_kdtree_boruvka_h #include "c_common.h" #include "c_kdtree.h" #include "c_disjoint_sets.h" namespace quitefastkdtree { template struct kdtree_node_clusterable : public kdtree_node_base { kdtree_node_clusterable* left; kdtree_node_clusterable* right; Py_ssize_t cluster_repr; //< representative point index if all descendants are in the same cluster, -1 otherwise struct t_dtb_data { FLOAT cluster_max_dist; FLOAT min_dcore /* M>1 */; }; struct t_qtb_data { FLOAT lastbest_dist; Py_ssize_t lastbest_ind; Py_ssize_t lastbest_from; }; union { t_dtb_data dtb_data; t_qtb_data qtb_data; }; kdtree_node_clusterable() { left = nullptr; // right = nullptr; } inline bool is_leaf() const { return left == nullptr /*&& right == nullptr*/; // either both null or none } }; template < typename FLOAT, Py_ssize_t D, typename DISTANCE=kdtree_distance_sqeuclid, typename NODE=kdtree_node_clusterable > struct kdtree_node_orderer { NODE* nearer_node; NODE* farther_node; FLOAT nearer_dist; FLOAT farther_dist; kdtree_node_orderer(const FLOAT* x, NODE* to1, NODE* to2) // QTB, STB { nearer_dist = DISTANCE::point_node( x, to1->bbox_min.data(), to1->bbox_max.data() ); farther_dist = DISTANCE::point_node( x, to2->bbox_min.data(), to2->bbox_max.data() ); if (nearer_dist <= farther_dist) { nearer_node = to1; farther_node = to2; } else { std::swap(nearer_dist, farther_dist); nearer_node = to2; farther_node = to1; } } kdtree_node_orderer(NODE* from, NODE* to1, NODE* to2, bool use_min_dcore=false) // DTB { nearer_dist = DISTANCE::node_node( from->bbox_min.data(), from->bbox_max.data(), to1->bbox_min.data(), to1->bbox_max.data() ); farther_dist = DISTANCE::node_node( from->bbox_min.data(), from->bbox_max.data(), to2->bbox_min.data(), to2->bbox_max.data() ); if (use_min_dcore) { nearer_dist = max3(nearer_dist, from->dtb_data.min_dcore, to1->dtb_data.min_dcore); farther_dist = max3(farther_dist, from->dtb_data.min_dcore, to2->dtb_data.min_dcore); } if (nearer_dist <= farther_dist) { nearer_node = to1; farther_node = to2; } else { std::swap(nearer_dist, farther_dist); nearer_node = to2; farther_node = to1; } } }; /** A class enabling searching for the nearest neighbour * outside of the current point's cluster; * (for the "sesqui-tree" and "single-tree" Borůvka algo); it is thread-safe */ template < typename FLOAT, Py_ssize_t D, typename DISTANCE=kdtree_distance_sqeuclid, typename NODE=kdtree_node_clusterable > class kdtree_nearest_outsider { private: const FLOAT* data; ///< the dataset const FLOAT* dcore; ///< the "core" distances Py_ssize_t M; ///< the "smoothing factor const Py_ssize_t* ds_par; ///< points' cluster IDs (par[i]==ds.find(i)!) FLOAT nn_dist; ///< shortest distance Py_ssize_t nn_ind; ///< index of the nn Py_ssize_t nn_from; const FLOAT* x; ///< the point itself (shortcut) / first point NODE* curleaf; ///< nullptr or a whole leaf Py_ssize_t which; ///< for which point are we getting the nns / first point index Py_ssize_t cluster; ///< the point's / points' cluster template inline void point_vs_points(Py_ssize_t idx_from, Py_ssize_t idx_to) { const FLOAT* y = data+D*idx_from; for (Py_ssize_t j=idx_from; j= nn_dist) continue; FLOAT dd = DISTANCE::point_point(x, y); if (USE_DCORE) dd = max3(dd, dcore[which], dcore[j]); if (dd < nn_dist) { nn_dist = dd; nn_ind = j; } } } template void find_nn_single(const NODE* root) { if (root->cluster_repr == cluster) { // nothing to do - all are members of the x's cluster return; } if (root->is_leaf()/* || root->idx_to-root->idx_from <= max_brute_size*/) { if (which < root->idx_from || which >= root->idx_to) point_vs_points(root->idx_from, root->idx_to); else { point_vs_points(root->idx_from, which); point_vs_points(which+1, root->idx_to); } return; } kdtree_node_orderer sel( x, root->left, root->right ); if (sel.nearer_dist < nn_dist) { find_nn_single(sel.nearer_node); if (sel.farther_dist < nn_dist) find_nn_single(sel.farther_node); } } template void find_nn_multi(const NODE* root) { if (root->cluster_repr == curleaf->cluster_repr) { // nothing to do - all are members of the x's cluster return; } if (root->is_leaf()) { const FLOAT* _y = data+D*root->idx_from; for (Py_ssize_t j=root->idx_from; jidx_to; ++j, _y+=D) { if (curleaf->cluster_repr == ds_par[j]) continue; if (USE_DCORE && dcore[j] >= nn_dist) continue; const FLOAT* _x = x; for (Py_ssize_t i=curleaf->idx_from; iidx_to; ++i, _x+=D) { if (USE_DCORE && dcore[i] >= nn_dist) continue; FLOAT dd = DISTANCE::point_point(_x, _y); if (USE_DCORE) dd = max3(dd, dcore[i], dcore[j]); if (dd < nn_dist) { nn_dist = dd; nn_ind = j; nn_from = i; } } } return; } kdtree_node_orderer sel( curleaf, root->left, root->right ); if (sel.nearer_dist < nn_dist) { find_nn_multi(sel.nearer_node); if (sel.farther_dist < nn_dist) find_nn_multi(sel.farther_node); } } public: kdtree_nearest_outsider( const FLOAT* data, FLOAT* dcore, Py_ssize_t M, const Py_ssize_t* ds_par ) : data(data), dcore(dcore), M(M), ds_par(ds_par) { ; } /** * @param curleaf * @param root * @param nn_dist best nn_dist found so far for the current cluster */ void find_multi(NODE* curleaf, const NODE* root, FLOAT nn_dist=INFINITY) { this->nn_dist = nn_dist; this->nn_ind = -1; this->nn_from = -1; this->curleaf = curleaf; this->which = curleaf->idx_from; this->x = data+D*this->which; this->cluster = curleaf->cluster_repr; if (M>1) find_nn_multi(root); else find_nn_multi(root); } /** * @param which * @param root * @param nn_dist best nn_dist found so far for the current cluster */ void find_single(Py_ssize_t which, const NODE* root, FLOAT nn_dist=INFINITY) { this->nn_dist = nn_dist; this->nn_ind = -1; this->nn_from = which; this->curleaf = nullptr; this->which = which; this->x = data+D*this->which; this->cluster = ds_par[this->which]; if (M>1) find_nn_single(root); else find_nn_single(root); } inline FLOAT get_nn_dist() { return nn_dist; } inline Py_ssize_t get_nn_ind() { return nn_ind; } inline Py_ssize_t get_nn_from() { return nn_from; } }; template < typename FLOAT, Py_ssize_t D, typename DISTANCE=kdtree_distance_sqeuclid, typename NODE=kdtree_node_clusterable > class kdtree_boruvka : public kdtree { protected: FLOAT* tree_dist; ///< size n-1 Py_ssize_t* tree_ind; ///< size 2*(n-1) Py_ssize_t tree_edges; /// number of MST edges already found Py_ssize_t tree_iter; CDisjointSets ds; std::vector ncl_dist; // ncl_dist[find(i)] - distance to i's nn std::vector ncl_ind; // ncl_ind[find(i)] - index of i's nn std::vector ncl_from; // ncl_from[find(i)] - the relevant member of i const Py_ssize_t first_pass_max_brute_size; // used in the first iter (finding 1-nns) enum BORUVKA_TYPE { BORUVKA_STB, BORUVKA_QTB, BORUVKA_DTB }; BORUVKA_TYPE boruvka_variant; bool reset_nns; const Py_ssize_t mutreach_ties; // M>1 only std::vector lastbest_dist; // !use_dtb only std::vector lastbest_ind; // !use_dtb only const Py_ssize_t M; // mutual reachability distance - "smoothing factor" std::vector dcore; // distances to the M-th nns of each point if M>0 or 1-NN for M==0 std::vector Mnn_dist; // M nearest neighbours of each point if M>0 std::vector Mnn_ind; #if OPENMP_IS_ENABLED omp_lock_t omp_lock; #endif int omp_nthreads; std::vector leaves; // sesquitree only inline void tree_add(Py_ssize_t i, Py_ssize_t j, FLOAT d) { tree_ind[tree_edges*2+0] = i; tree_ind[tree_edges*2+1] = j; tree_dist[tree_edges] = d; ds.merge(i, j); tree_edges++; } void setup_leaves() { QUITEFASTMST_ASSERT(boruvka_variant == BORUVKA_QTB); leaves.resize(this->nleaves); Py_ssize_t _leafnum = 0; for (auto curnode = this->nodes.begin(); curnode != this->nodes.end(); ++curnode) { if (curnode->is_leaf()) { leaves[_leafnum++] = &(*curnode); curnode->qtb_data.lastbest_dist = 0.0; curnode->qtb_data.lastbest_ind = -1; curnode->qtb_data.lastbest_from = -1; } } QUITEFASTMST_ASSERT(_leafnum == this->nleaves); } void setup_min_dcore() { QUITEFASTMST_ASSERT(M>=1); QUITEFASTMST_ASSERT(boruvka_variant == BORUVKA_DTB); for (auto curnode = this->nodes.rbegin(); curnode != this->nodes.rend(); ++curnode) { if (curnode->is_leaf()) { curnode->dtb_data.min_dcore = dcore[curnode->idx_from]; for (Py_ssize_t i=curnode->idx_from+1; iidx_to; ++i) { if (dcore[i] < curnode->dtb_data.min_dcore) curnode->dtb_data.min_dcore = dcore[i]; } } else { // all descendants have already been processed as children in `nodes` occur after their parents curnode->dtb_data.min_dcore = std::min( curnode->left->dtb_data.min_dcore, curnode->right->dtb_data.min_dcore ); } } } void update_node_data() { // Performed in each iteration // ds.find(i) == ds.get_parent(i) for all i // nodes is a deque... for (auto curnode = this->nodes.rbegin(); curnode != this->nodes.rend(); ++curnode) { if (boruvka_variant == BORUVKA_DTB) curnode->dtb_data.cluster_max_dist = INFINITY; // for DTB if (curnode->cluster_repr >= 0) { curnode->cluster_repr = ds.get_parent(curnode->cluster_repr); continue; } if (curnode->is_leaf()) { curnode->cluster_repr = ds.get_parent(curnode->idx_from); for (Py_ssize_t j=curnode->idx_from+1; jidx_to; ++j) { if (curnode->cluster_repr != ds.get_parent(j)) { curnode->cluster_repr = -1; // not all are members of the same cluster break; } } if (curnode->cluster_repr >= 0 && boruvka_variant == BORUVKA_QTB) { Py_ssize_t i=curnode->idx_from; curnode->qtb_data.lastbest_dist = lastbest_dist[i]; curnode->qtb_data.lastbest_ind = lastbest_ind[i]; curnode->qtb_data.lastbest_from = i; for (++i; iidx_to; ++i) { if (curnode->qtb_data.lastbest_dist > lastbest_dist[i]) { curnode->qtb_data.lastbest_dist = lastbest_dist[i]; curnode->qtb_data.lastbest_ind = lastbest_ind[i]; curnode->qtb_data.lastbest_from = i; } } } } else { // all descendants have already been processed as children in `nodes` occur after their parents if (curnode->left->cluster_repr >= 0) { // if both children only feature members of the same cluster, update the cluster repr for the current node; if (curnode->left->cluster_repr == curnode->right->cluster_repr) curnode->cluster_repr = curnode->left->cluster_repr; } // else curnode->cluster_repr = -1; // it already is } } } void update_nn_data() { if (boruvka_variant != BORUVKA_DTB && tree_iter > 1) { // if tree_iter == 1, then all lastbest_ind[i] == -1; // we don't get access to individual NNs in DTB, except in the 1st iter for (Py_ssize_t i=0; in; ++i) { if (lastbest_ind[i] < 0) continue; Py_ssize_t ds_find_i = ds.get_parent(i); Py_ssize_t ds_find_j = ds.get_parent(lastbest_ind[i]); if (ds_find_i == ds_find_j) { lastbest_ind[i] = -1; continue; } if (ncl_dist[ds_find_i] > lastbest_dist[i]) { ncl_dist[ds_find_i] = lastbest_dist[i]; ncl_ind[ds_find_i] = lastbest_ind[i]; ncl_from[ds_find_i] = i; } // ok even if nthreads>1 if (ncl_dist[ds_find_j] > lastbest_dist[i]) { ncl_dist[ds_find_j] = lastbest_dist[i]; ncl_ind[ds_find_j] = i; ncl_from[ds_find_j] = lastbest_ind[i]; } } } if (M > 1) { // reuse M NNs if d==dcore[i] as an initialiser to ncl_ind/dist/from; // good speed-up sometimes (we'll be happy with any match; leaves // are formed in the 1st iteration of the algorithm) for (Py_ssize_t i=0; in; ++i) { Py_ssize_t ds_find_i = ds.get_parent(i); if (ncl_dist[ds_find_i] <= lastbest_dist[i] || lastbest_dist[i] > dcore[i]) continue; Py_ssize_t bestj = -1; if (mutreach_ties <= -2 || mutreach_ties >= 2) { FLOAT bestdcorej = (mutreach_ties <= -2)?INFINITY:(-INFINITY); for (Py_ssize_t v=0; v= dcore[j]) { if ( (mutreach_ties <= -2 && dcore[j] <= bestdcorej) || // choose lowest dcore, but farthest (mutreach_ties >= 2 && dcore[j] > bestdcorej) // choose highest dcore, but closest ) { bestj = j; bestdcorej = dcore[j]; } } } } else { for (Py_ssize_t v=0; v= dcore[j]) { bestj = j; break; // other candidates have d_M >= dcore[i] anyway } } } if (bestj >= 0) { ncl_dist[ds_find_i] = dcore[i]; ncl_ind[ds_find_i] = bestj; ncl_from[ds_find_i] = i; lastbest_dist[i] = dcore[i]; // actually unchanged lastbest_ind[i] = bestj; if (boruvka_variant == BORUVKA_DTB || omp_nthreads == 1) { // TODO: describe why omp_nthreads == 1 only Py_ssize_t ds_find_j = ds.get_parent(bestj); if (ncl_dist[ds_find_j] > dcore[i]) { ncl_dist[ds_find_j] = dcore[i]; ncl_ind[ds_find_j] = i; ncl_from[ds_find_j] = bestj; } } } } } } void find_mst_first_1() { QUITEFASTMST_ASSERT(M <= 1); const Py_ssize_t k = 1; for (Py_ssize_t i=0; in; ++i) ncl_dist[i] = INFINITY; for (Py_ssize_t i=0; in; ++i) ncl_ind[i] = -1; // find 1-nns of each point using max_brute_size, // preferably with max_brute_size>max_leaf_size #if OPENMP_IS_ENABLED #pragma omp parallel for schedule(static) #endif for (Py_ssize_t i=0; in; ++i) { kdtree_kneighbours nn( this->data, nullptr, i, &ncl_dist[i], &ncl_ind[i], k, first_pass_max_brute_size ); nn.find(&this->nodes[0], /*reset=*/false); if (omp_nthreads == 1 && ncl_dist[i] < ncl_dist[ncl_ind[i]]) { // the speed up is rather small... ncl_dist[ncl_ind[i]] = ncl_dist[i]; ncl_ind[ncl_ind[i]] = i; } lastbest_ind[i] = -1; // inactive lastbest_dist[i] = ncl_dist[i]; if (M > 0) { dcore[i] = ncl_dist[i]; Mnn_dist[i] = ncl_dist[i]; Mnn_ind[i] = ncl_ind[i]; } } // connect nearest neighbours with each other for (Py_ssize_t i=0; in; ++i) { if (ds.find(i) != ds.find(ncl_ind[i])) { tree_add(i, ncl_ind[i], ncl_dist[i]); } } } void find_mst_first_M() { QUITEFASTMST_ASSERT(M>0); // find the M NNs of each point for (size_t i=0; in; ++i) { kdtree_kneighbours nn( this->data, nullptr, i, Mnn_dist.data()+M*i, Mnn_ind.data()+M*i, M, first_pass_max_brute_size ); nn.find(&this->nodes[0], /*reset=*/false); dcore[i] = Mnn_dist[i*M+(M-1)]; lastbest_dist[i] = dcore[i]; // merely a lower bound lastbest_ind[i] = -(M+1); } // k-nns w.r.t. Euclidean distances are not necessarily // k-nns w.r.t. M-mutreach; k-nns have d_M >= d_core // dcore[i] is definitely the smallest possible d_M(i, *); i!=* // we can only be sure that j is a NN if d_M(i, j) == dcore[i] // but NNs w.r.t. d_M might be ambiguous - we might want to pick, // e.g., the farthest or the closest one w.r.t. the original dist if (mutreach_ties <= -2 || mutreach_ties >= 2) { for (Py_ssize_t i=0; in; ++i) { // mutreach_ties <= -2 - connect with j whose dcore[j] is the smallest // mutreach_ties >= 2 - connect with j whose dcore[j] is the largest Py_ssize_t ds_find_i = ds.find(i); Py_ssize_t bestj = -1; FLOAT bestdcorej = (mutreach_ties <= -2)?INFINITY:(-INFINITY); for (Py_ssize_t v=0; v= dcore[j] && ds_find_i != ds.find(j)) { if ( (mutreach_ties <= -2 && dcore[j] <= bestdcorej) || // choose lowest dcore, but farthest (mutreach_ties >= 2 && dcore[j] > bestdcorej) // choose highest dcore, but closest ) { bestj = j; bestdcorej = dcore[j]; } } } if (bestj >= 0) tree_add(i, bestj, dcore[i]); } } else { for (Py_ssize_t i=0; in; ++i) { // connect with j whose d(i,j) is the smallest (1>=mutreach_ties>0) or largest (-1<=mutreach_ties<0) // stops searching early, because the original distances are sorted Py_ssize_t ds_find_i = ds.find(i); for (Py_ssize_t v=0; v= dcore[j] && ds_find_i != ds.find(j)) { // j is the nearest neighbour of i w.r.t. mutreach dist. tree_add(i, j, dcore[i]); break; // other candidates have d_M >= dcore[i] anyway } } } } } void find_mst_first() { // the 1st iteration: connect nearest neighbours with each other if (M <= 1) find_mst_first_1(); else find_mst_first_M(); } template inline void leaf_vs_leaf_dtb(NODE* roota, NODE* rootb) { // assumes ds.find(i) == ds.get_parent(i) for all i! const FLOAT* _x = this->data + roota->idx_from*D; for (Py_ssize_t i=roota->idx_from; iidx_to; ++i, _x += D) { Py_ssize_t ds_find_i = ds.get_parent(i); if (USE_DCORE && dcore[i] >= ncl_dist[ds_find_i]) continue; for (Py_ssize_t j=rootb->idx_from; jidx_to; ++j) { Py_ssize_t ds_find_j = ds.get_parent(j); if (ds_find_i == ds_find_j) continue; if (USE_DCORE && dcore[j] >= ncl_dist[ds_find_i]) continue; FLOAT dij = DISTANCE::point_point(_x, this->data+j*D); if (USE_DCORE) dij = max3(dij, dcore[i], dcore[j]); if (dij < ncl_dist[ds_find_i]) { ncl_dist[ds_find_i] = dij; ncl_ind[ds_find_i] = j; ncl_from[ds_find_i] = i; } } } } void find_mst_next_dtb(NODE* roota, NODE* rootb) { // we have ds.find(i) == ds.get_parent(i) for all i! if (roota->cluster_repr >= 0 && roota->cluster_repr == rootb->cluster_repr) { // both consist of members of the same cluster - nothing to do return; } if (roota->is_leaf()) { if (rootb->is_leaf()) { if (M>1) leaf_vs_leaf_dtb(roota, rootb); else leaf_vs_leaf_dtb(roota, rootb); if (roota->cluster_repr >= 0) { // all points are in the same cluster roota->dtb_data.cluster_max_dist = ncl_dist[roota->cluster_repr]; } else { roota->dtb_data.cluster_max_dist = ncl_dist[ds.get_parent(roota->idx_from)]; for (Py_ssize_t i=roota->idx_from+1; iidx_to; ++i) { FLOAT dist_cur = ncl_dist[ds.get_parent(i)]; if (dist_cur > roota->dtb_data.cluster_max_dist) roota->dtb_data.cluster_max_dist = dist_cur; } } } else { // nearer node first -> faster! kdtree_node_orderer sel(roota, rootb->left, rootb->right, (M>1)); // prune nodes too far away if we have better candidates if (roota->dtb_data.cluster_max_dist > sel.nearer_dist) { find_mst_next_dtb(roota, sel.nearer_node); if (roota->dtb_data.cluster_max_dist > sel.farther_dist) find_mst_next_dtb(roota, sel.farther_node); } // roota->dtb_data.cluster_max_dist updated above } } else { // roota is not a leaf if (rootb->is_leaf()) { kdtree_node_orderer sel(rootb, roota->left, roota->right, (M>1)); if (sel.nearer_node->dtb_data.cluster_max_dist > sel.nearer_dist) find_mst_next_dtb(sel.nearer_node, rootb); if (sel.farther_node->dtb_data.cluster_max_dist > sel.farther_dist) // separate if! find_mst_next_dtb(sel.farther_node, rootb); } else { kdtree_node_orderer sel(roota->left, rootb->left, rootb->right, (M>1)); if (roota->left->dtb_data.cluster_max_dist > sel.nearer_dist) { find_mst_next_dtb(roota->left, sel.nearer_node); if (roota->left->dtb_data.cluster_max_dist > sel.farther_dist) find_mst_next_dtb(roota->left, sel.farther_node); } sel = kdtree_node_orderer(roota->right, rootb->left, rootb->right, (M>1)); if (roota->right->dtb_data.cluster_max_dist > sel.nearer_dist) { find_mst_next_dtb(roota->right, sel.nearer_node); if (roota->right->dtb_data.cluster_max_dist > sel.farther_dist) find_mst_next_dtb(roota->right, sel.farther_node); } } roota->dtb_data.cluster_max_dist = std::max( roota->left->dtb_data.cluster_max_dist, roota->right->dtb_data.cluster_max_dist ); } } void find_mst_next_dtb() { find_mst_next_dtb(&this->nodes[0], &this->nodes[0]); } void find_nn_next_multi(NODE* curleaf) // QTB { QUITEFASTMST_ASSERT(curleaf->cluster_repr == ds.get_parent(curleaf->idx_from)); Py_ssize_t ds_find_i = curleaf->cluster_repr; // NOTE: assumption: no race condition/atomic read... FLOAT ncl_dist_cur = ncl_dist[ds_find_i]; if (ncl_dist_cur <= curleaf->qtb_data.lastbest_dist) return; if (curleaf->qtb_data.lastbest_ind >= 0) { Py_ssize_t ds_find_j = ds.get_parent(curleaf->qtb_data.lastbest_ind); if (ds_find_i == ds_find_j) curleaf->qtb_data.lastbest_ind = -1; } if (curleaf->qtb_data.lastbest_ind < 0) { kdtree_nearest_outsider nn( this->data, (M>1)?(this->dcore.data()):NULL, M, ds.get_parents() ); nn.find_multi(curleaf, &this->nodes[0], reset_nns?INFINITY:ncl_dist_cur); if (nn.get_nn_ind() >= 0) { curleaf->qtb_data.lastbest_ind = nn.get_nn_ind(); curleaf->qtb_data.lastbest_dist = nn.get_nn_dist(); curleaf->qtb_data.lastbest_from = nn.get_nn_from(); } } if (curleaf->qtb_data.lastbest_ind < 0) return; #if OPENMP_IS_ENABLED if (omp_nthreads > 1) omp_set_lock(&omp_lock); #endif if (curleaf->qtb_data.lastbest_dist < ncl_dist[ds_find_i]) { ncl_dist[ds_find_i] = curleaf->qtb_data.lastbest_dist; ncl_ind[ds_find_i] = curleaf->qtb_data.lastbest_ind; ncl_from[ds_find_i] = curleaf->qtb_data.lastbest_from; } if (omp_nthreads == 1) { // otherwise slightly worse performance... Py_ssize_t ds_find_j = ds.get_parent(curleaf->qtb_data.lastbest_ind); QUITEFASTMST_ASSERT(ds_find_i != ds_find_j); if (curleaf->qtb_data.lastbest_dist < ncl_dist[ds_find_j]) { ncl_dist[ds_find_j] = curleaf->qtb_data.lastbest_dist; ncl_ind[ds_find_j] = curleaf->qtb_data.lastbest_from; ncl_from[ds_find_j] = curleaf->qtb_data.lastbest_ind; } } #if OPENMP_IS_ENABLED if (omp_nthreads > 1) omp_unset_lock(&omp_lock); #endif } void find_nn_next_single(Py_ssize_t i) // STB, QTB { // Py_ssize_t i = (M<1)?u:ptperm[u]; Py_ssize_t ds_find_i = ds.get_parent(i); // NOTE: assumption: no race condition/atomic read... FLOAT ncl_dist_cur = ncl_dist[ds_find_i]; if (ncl_dist_cur <= lastbest_dist[i]) return; // speeds up even for M==0 if (lastbest_ind[i] < 0) { kdtree_nearest_outsider nn( this->data, (M>1)?(this->dcore.data()):NULL, M, ds.get_parents() ); nn.find_single(i, &this->nodes[0], reset_nns?INFINITY:ncl_dist_cur); lastbest_ind[i] = nn.get_nn_ind(); // can be negative if best found >= ncl_dist_cur if (lastbest_ind[i] >= 0) lastbest_dist[i] = nn.get_nn_dist(); } if (lastbest_ind[i] < 0) return; #if OPENMP_IS_ENABLED if (omp_nthreads > 1) omp_set_lock(&omp_lock); #endif if (lastbest_dist[i] < ncl_dist[ds_find_i]) { ncl_dist[ds_find_i] = lastbest_dist[i]; ncl_ind[ds_find_i] = lastbest_ind[i]; ncl_from[ds_find_i] = i; } if (omp_nthreads == 1) { // otherwise slightly worse performance... Py_ssize_t ds_find_j = ds.get_parent(lastbest_ind[i]); QUITEFASTMST_ASSERT(ds_find_i != ds_find_j); if (lastbest_dist[i] < ncl_dist[ds_find_j]) { ncl_dist[ds_find_j] = lastbest_dist[i]; ncl_ind[ds_find_j] = i; ncl_from[ds_find_j] = lastbest_ind[i]; } } #if OPENMP_IS_ENABLED if (omp_nthreads > 1) omp_unset_lock(&omp_lock); #endif } void find_mst_next_qtb() { // find the point from another cluster that is closest to the i-th point // i.e., the nearest "alien" // go leaf-by-leaf #if OPENMP_IS_ENABLED #pragma omp parallel for schedule(static) #endif for (Py_ssize_t l=0; lnleaves; ++l) { NODE* curleaf = leaves[l]; if (curleaf->cluster_repr >= 0 && curleaf->idx_to - curleaf->idx_from > 1) // all elems in the same cluster { find_nn_next_multi(curleaf); } else { for (Py_ssize_t i=curleaf->idx_from; iidx_to; ++i) { find_nn_next_single(i); // updates lastbest_dist[i] and ncl_dist[ds_find_i] if necessary } } } } void find_mst_next_stb() { // find the point from another cluster that is closest to the i-th point // i.e., the nearest "alien" #if OPENMP_IS_ENABLED #pragma omp parallel for schedule(static) #endif for (Py_ssize_t i=0; in; ++i) { find_nn_next_single(i); // updates lastbest_dist[i] and ncl_dist[ds_find_i] if necessary } } void find_mst() { QUITEFASTMST_PROFILER_USE QUITEFASTMST_PROFILER_START // the 1st iteration: connect nearest neighbours with each other find_mst_first(); QUITEFASTMST_PROFILER_STOP("find_mst_first") if (boruvka_variant == BORUVKA_DTB && M>1) { QUITEFASTMST_PROFILER_START setup_min_dcore(); QUITEFASTMST_PROFILER_STOP("setup_min_dcore") } if (boruvka_variant == BORUVKA_QTB) { QUITEFASTMST_PROFILER_START setup_leaves(); QUITEFASTMST_PROFILER_STOP("setup_leaves") } std::vector ds_parents(this->n); Py_ssize_t ds_k; while (tree_edges < this->n-1) { #if QUITEFASTMST_R Rcpp::checkUserInterrupt(); // throws an exception, not a longjmp #elif QUITEFASTMST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif tree_iter++; QUITEFASTMST_PROFILER_START ds_k = 0; for (Py_ssize_t i=0; in; ++i) { if (i == this->ds.find(i)) { ncl_dist[i] = INFINITY; ncl_ind[i] = -1; ncl_from[i] = -1; ds_parents[ds_k++] = i; } } // now ds.find(i) == ds.get_parent(i) for all i update_nn_data(); // update lastbest_dist etc. update_node_data(); // reset cluster_max_dist and set up cluster_repr if (boruvka_variant == BORUVKA_DTB) find_mst_next_dtb(); else if (boruvka_variant == BORUVKA_QTB) // TODO find_mst_next_qtb(); else find_mst_next_stb(); for (Py_ssize_t j=0; j= 0 && ncl_ind[i] < this->n); if (ds.find(i) != ds.find(ncl_ind[i])) { QUITEFASTMST_ASSERT(ncl_from[i] >= 0 && ncl_from[i] < this->n); QUITEFASTMST_ASSERT(ds.find(i) == ds.find(ncl_from[i])); tree_add(ncl_from[i], ncl_ind[i], ncl_dist[i]); } } QUITEFASTMST_PROFILER_STOP("find_mst iter #%d (tree_edges=%d)", (int)tree_iter, tree_edges) } } public: kdtree_boruvka() : kdtree() { omp_nthreads = -1; } /**! * see fastmst.h for the description of the parameters, * no need to repeat that here */ kdtree_boruvka( FLOAT* data, const Py_ssize_t n, const Py_ssize_t M=0, const Py_ssize_t max_leaf_size=16, const Py_ssize_t first_pass_max_brute_size=16, const FLOAT boruvka_variant=1.5, const Py_ssize_t mutreach_ties=-2 ) : kdtree(data, n, max_leaf_size), tree_edges(0), tree_iter(0), ds(n), ncl_dist(n), ncl_ind(n), ncl_from(n), first_pass_max_brute_size(first_pass_max_brute_size), mutreach_ties(mutreach_ties), M(M) { QUITEFASTMST_ASSERT(M>=0); if (M > 0) { dcore.resize(n); Mnn_dist.resize(n*M); Mnn_ind.resize(n*M); } lastbest_dist.resize(n); lastbest_ind.resize(n); if (boruvka_variant == 2.0) this->boruvka_variant = BORUVKA_DTB; else if (boruvka_variant == 1.0) this->boruvka_variant = BORUVKA_STB; else this->boruvka_variant = BORUVKA_QTB; // 1.5 ;) reset_nns = (M<=1); // plain Euclidean MST benefits from this #if OPENMP_IS_ENABLED omp_nthreads = Comp_get_max_threads(); if (omp_nthreads > 1) omp_init_lock(&omp_lock); #else omp_nthreads = 1; #endif } ~kdtree_boruvka() { #if OPENMP_IS_ENABLED if (omp_nthreads > 1) omp_destroy_lock(&omp_lock); #endif } void mst(FLOAT* tree_dist, Py_ssize_t* tree_ind) { this->tree_dist = tree_dist; this->tree_ind = tree_ind; if (ds.get_k() != (Py_ssize_t)this->n) ds.reset(); tree_edges = 0; tree_iter = 0; for (Py_ssize_t i=0; in-1; ++i) tree_dist[i] = INFINITY; for (Py_ssize_t i=0; i<2*(this->n-1); ++i) tree_ind[i] = -1; // nodes is a deque... for (auto curnode = this->nodes.rbegin(); curnode != this->nodes.rend(); ++curnode) curnode->cluster_repr = -1; find_mst(); } inline const FLOAT* get_Mnn_dist() const { QUITEFASTMST_ASSERT(M>0); return this->Mnn_dist.data(); } inline const Py_ssize_t* get_Mnn_ind() const { QUITEFASTMST_ASSERT(M>0); return this->Mnn_ind.data(); } inline const FLOAT* get_dcore() const { QUITEFASTMST_ASSERT(M>0); return this->dcore.data(); } inline Py_ssize_t get_M() const { return this->M; } }; /*! * Find a minimum spanning tree of X (in the tree) * * see _mst_euclid_kdtree * * @param tree a pre-built K-d tree containing n points * @param tree_dist [out] size n*k * @param tree_ind [out] size n*k * @param nn_dist [out] distances to M nns of each point * @param nn_ind [out] indexes of M nns of each point */ template void mst( TREE& tree, FLOAT* tree_dist, // size n-1 Py_ssize_t* tree_ind, // size 2*(n-1), FLOAT* nn_dist=nullptr, // size n*M Py_ssize_t* nn_ind=nullptr // size n*M ) { tree.mst(tree_dist, tree_ind); Py_ssize_t n = tree.get_n(); Py_ssize_t M = tree.get_M(); const Py_ssize_t* perm = tree.get_perm(); if (M > 0) { QUITEFASTMST_ASSERT(nn_dist); QUITEFASTMST_ASSERT(nn_ind); const FLOAT* _nn_dist = tree.get_Mnn_dist(); const Py_ssize_t* _nn_ind = tree.get_Mnn_ind(); for (Py_ssize_t i=0; i= 0 && i1 < n); QUITEFASTMST_ASSERT(i2 >= 0 && i2 < n); tree_ind[2*i+0] = perm[i1]; tree_ind[2*i+1] = perm[i2]; } // the edges are not ordered, use Cmst_order } }; // namespace #endif quitefastmst/src/Makevars0000644000176200001440000000046215036407432015274 0ustar liggesusersPKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DQUITEFASTMST_R -Isrc/ -I../src/ PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) SOURCES = \ knn_euclid_brute.cpp \ mst_euclid_brute.cpp \ knn_euclid_kdtree.cpp \ mst_euclid_kdtree.cpp \ RcppFastmst.cpp \ RcppExports.cpp OBJECTS = $(SOURCES:.cpp=.o) quitefastmst/src/mst_euclid_brute.cpp0000644000176200001440000003705115133657460017647 0ustar liggesusers/* This file is part of the 'quitefastmst' package. * * Copyleft (C) 2025-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_fastmst.h" #include "c_common.h" #include "c_mst_triple.h" #include #include #include #include #define MST_OMP_CHUNK_SIZE 1024 /*! Order the n-1 edges of a spanning tree of n points in place, * w.r.t. the weights increasingly, resolving ties if needed based on * the points' IDs. * * @param n * @param mst_dist [in/out] size m * @param mst_ind [in/out] size m*2 */ template void Ctree_order(Py_ssize_t m, FLOAT* tree_dist, Py_ssize_t* tree_ind) { QUITEFASTMST_PROFILER_USE std::vector< CMstTriple > mst(m); for (Py_ssize_t i=0; i(tree_ind[2*i+0], tree_ind[2*i+1], tree_dist[i]); } QUITEFASTMST_PROFILER_START std::sort(mst.begin(), mst.end()); QUITEFASTMST_PROFILER_STOP("mst sort"); for (Py_ssize_t i=0; i 0 * @param tree_dist [in/out] size m - edge weights * @param tree_ind [in/out] size m*2 - edges of the tree * @param nn_dist [out] n*M Euclidean distances * to the n points' M nearest neighbours * @param nn_ind [out] n*M indexes of the n points' M nearest neighbours * * @return the number of leaves reconnected */ template Py_ssize_t Cleaves_reconnect_dcore_min( Py_ssize_t m, Py_ssize_t n, Py_ssize_t M, FLOAT* tree_dist, Py_ssize_t* tree_ind, FLOAT* nn_dist, Py_ssize_t* nn_ind ) { std::vector degrees(n, 0); for (Py_ssize_t i=0; i<2*m; ++i) { QUITEFASTMST_ASSERT(tree_ind[i] >= 0 && tree_ind[i] < n); degrees[tree_ind[i]]++; } std::vector closest_inlier(n, -1); for (Py_ssize_t v=0; v 0); if (degrees[v] == 1) continue; // a leaf FLOAT dcore_v = nn_dist[v*M+(M-1)]; for (Py_ssize_t j=0; j dcore_u) continue; // v cannot become adjacent to u (minimality condition!) if (closest_inlier[u] < 0 || dcore_v < nn_dist[closest_inlier[u]*M+(M-1)]) closest_inlier[u] = v; // choose v if u is amongst M NNs of v and v itself has "small" core distance } } Py_ssize_t num_changes = 0; for (Py_ssize_t i=0; i 1) continue; // we want u to be a leaf Py_ssize_t v = tree_ind[i*2+(1-j)]; QUITEFASTMST_ASSERT(degrees[v] > 1); // v is a non-leaf Py_ssize_t w = closest_inlier[u]; if (w >= 0 && w != v) { // w will now be the vertex adjacent to u num_changes++; degrees[v]--; degrees[w]++; tree_ind[i*2+(1-j)] = w; } } } return num_changes; } /*! A Jarník (Prim/Dijkstra)-like algorithm for determining * a(*) Euclidean minimum spanning tree (MST) or * one w.r.t. an M-mutual reachability distance. * * If `M>1`, the spanning tree is the smallest w.r.t. the degree-`M` * mutual reachability distance [9]_ given by * :math:`d_M(i, j)=\\max\\{ c_M(i), c_M(j), d(i, j)\\}`, where :math:`d(i,j)` * is the Euclidean distance between the `i`-th and the `j`-th point, * and :math:`c_M(i)` is the `i`-th `M`-core distance defined as the distance * between the `i`-th point and its `M`-th nearest neighbour * (not including the query points themselves). * * Note that [9]_ defines the core distance as the distance to the (M-1)-th NN. * * (*) We note that if there are many pairs of equidistant points, * there can be many minimum spanning trees. In particular, it is likely * that there are point pairs with the same mutual reachability distances. * To make the definition less ambiguous (albeit with no guarantees), * internally, we resolve ties as follows. * The `mutreach_ties` argument indicates the preference towards * connecting to farther(-1)/closer(1) points with respect to the original * metric or having smaller(-2)/larger(2) core distances. * * Time complexity: O(n^2). It is assumed that M is rather small * (say, M <= 20). If M>1, all pairwise the distances are computed twice * (first for the neighbours/core distance, then to determine the tree). * * * References: * ---------- * * [1] V. Jarník, O jistém problému minimálním, * Práce Moravské Přírodovědecké Společnosti 6, 1930, 57–63 * * [2] C.F. Olson, Parallel algorithms for hierarchical clustering, * Parallel Computing 21(8), 1995, 1313–1325 * * [3] R. Prim, Shortest connection networks and some generalizations, * The Bell System Technical Journal 36(6), 1957, 1389–1401 * * [9] R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based * on hierarchical density estimates, Lecture Notes in Computer Science 7819, * 2013, 160–172, https://doi.org/10.1007/978-3-642-37456-2_14 * * * @param X [destroyable] a C-contiguous data matrix, shape n*d * @param n number of rows * @param d number of columns * @param M the degree of the "core" distance if M > 0 * @param mst_dist [out] vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order * @param mst_ind [out] vector of length 2*(n-1), representing * a c_contiguous array of shape (n-1,2), defining the edges * corresponding to mst_d, with mst_i[j,0] < mst_i[j,1] for all j * @param nn_dist [out] NULL for M==0 or the n*M Euclidean distances * to the n points' M nearest neighbours * @param nn_ind [out] NULL for M==0 or the n*M indexes of the n points' * M nearest neighbours * @param mutreach_ties adjustment for mutual reachability distance ambiguity * (for M>1): -2 and 2 prefer connecting to points with, * respectively, smaller and larger core distance; -1 and 1 prefer, * respectively, farther and closer nearest neighbours * @param verbose should we output diagnostic/progress messages? */ template void Cmst_euclid_brute( FLOAT* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, FLOAT* mst_dist, Py_ssize_t* mst_ind, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t mutreach_ties, bool verbose ) { if (n <= 0) throw std::domain_error("n <= 0"); if (d <= 0) throw std::domain_error("d <= 0"); if (M < 0) throw std::domain_error("M < 0"); if (M >= n) throw std::domain_error("M >= n"); QUITEFASTMST_ASSERT(mst_dist); QUITEFASTMST_ASSERT(mst_ind); bool mutreach_adj_via_dcore = (std::abs(mutreach_ties) >= 2); FLOAT mutreach_adj = ((mutreach_ties<0)?-1:1); FLOAT mutreach_adj_factr = 0.00000011920928955078125; // 2**-23 std::vector d_core; if (M > 1) { d_core.resize(n); QUITEFASTMST_ASSERT(nn_dist); QUITEFASTMST_ASSERT(nn_ind); Cknn1_euclid_brute(X, n, d, M, nn_dist, nn_ind, /*squared=*/true, verbose); for (Py_ssize_t i=0; i ncl_ind(n); std::vector ncl_dist(n, INFINITY); // ncl_dist[j] = d_M(j, ncl_ind[j]) std::vector ncl_dist_adj; // ncl_dist[j] = adjustment for d_M(j, ncl_ind[j])'s ambiguity if (M > 1) ncl_dist_adj.resize(n, INFINITY); std::vector remaining_ind(n); // a.k.a. perm for (Py_ssize_t i=0; i > mst(n-1); //QUITEFASTMST_PRINT("here1!\n"); for (Py_ssize_t i=1; i removed #else if (M <= 1) { #if OPENMP_IS_ENABLED #pragma omp parallel for schedule(static,MST_OMP_CHUNK_SIZE) /* chunks get smaller and smaller... */ #endif for (Py_ssize_t j=i; j d_core_max) { dd_adj = 0.0; } else { dd = d_core_max; if (mutreach_adj_via_dcore) dd_adj = mutreach_adj*(-d_core_min+mutreach_adj_factr*dd_orig); else dd_adj = mutreach_adj*dd_orig; } if (dd < ncl_dist[j] || (dd == ncl_dist[j] && dd_adj < ncl_dist_adj[j])) { ncl_dist[j] = dd; ncl_dist_adj[j] = dd_adj; ncl_ind[j] = i-1; } } } #endif // we want to include the vertex that is closest to // the vertices of the tree constructed so far Py_ssize_t best_j = i; for (Py_ssize_t j=i+1; j 1 && ncl_dist[j] == ncl_dist[best_j] && ncl_dist_adj[j] < ncl_dist_adj[best_j])) best_j = j; } if (best_j != i) { // with swapping we get better locality of reference std::swap(remaining_ind[best_j], remaining_ind[i]); std::swap(ncl_dist[best_j], ncl_dist[i]); std::swap(ncl_ind[best_j], ncl_ind[i]); for (Py_ssize_t u=0; u 1) { std::swap(d_core[best_j], d_core[i]); std::swap(ncl_dist_adj[best_j], ncl_dist_adj[i]); } } // don't visit i again - it's being added to the tree // connect best_remaining_ind with the tree: add a new edge {best_remaining_ind, ncl_ind[best_remaining_ind]} QUITEFASTMST_ASSERT(ncl_ind[i] < i); //QUITEFASTMST_PRINT("%d %d %f\n", remaining_ind[ncl_ind[i]], remaining_ind[i], ncl_dist[i]); mst[i-1] = CMstTriple(remaining_ind[ncl_ind[i]], remaining_ind[i], ncl_dist[i], /*order=*/true); if (verbose) QUITEFASTMST_PRINT("\b\b\b\b%3d%%", (int)((n-1+n-i-1)*(i+1)*100/n/(n-1))); if (i % MST_OMP_CHUNK_SIZE == MST_OMP_CHUNK_SIZE-1) { #if QUITEFASTMST_R Rcpp::checkUserInterrupt(); // throws an exception, not a longjmp #elif QUITEFASTMST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } } //QUITEFASTMST_PRINT("here2!\n"); // sort the resulting MST edges in increasing order w.r.t. d std::sort(mst.begin(), mst.end()); for (Py_ssize_t i=0; i=1 if (M > 1) { for (Py_ssize_t i=0; i mst_dist[i]) { nn_dist[mst_ind[2*i+0]] = mst_dist[i]; nn_ind[mst_ind[2*i+0]] = mst_ind[2*i+1]; } if (nn_dist[mst_ind[2*i+1]] > mst_dist[i]) { nn_dist[mst_ind[2*i+1]] = mst_dist[i]; nn_ind[mst_ind[2*i+1]] = mst_ind[2*i+0]; } } } if (verbose) QUITEFASTMST_PRINT("\b\b\b\bdone.\n"); } // instantiate: template void Ctree_order(Py_ssize_t m, float* tree_dist, Py_ssize_t* tree_ind); template void Ctree_order(Py_ssize_t m, double* tree_dist, Py_ssize_t* tree_ind); template Py_ssize_t Cleaves_reconnect_dcore_min( Py_ssize_t m, Py_ssize_t n, Py_ssize_t M, float* tree_dist, Py_ssize_t* tree_ind, float* nn_dist, Py_ssize_t* nn_ind ); template Py_ssize_t Cleaves_reconnect_dcore_min( Py_ssize_t m, Py_ssize_t n, Py_ssize_t M, double* tree_dist, Py_ssize_t* tree_ind, double* nn_dist, Py_ssize_t* nn_ind ); template void Cmst_euclid_brute( float* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, float* mst_dist, Py_ssize_t* mst_ind, float* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t mutreach_ties, bool verbose ); template void Cmst_euclid_brute( double* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, double* mst_dist, Py_ssize_t* mst_ind, double* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t mutreach_ties, bool verbose ); quitefastmst/src/c_mst_triple.h0000644000176200001440000000345015132156655016442 0ustar liggesusers/* * Copyleft (C) 2018-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_mst_triple_h #define __c_mst_triple_h /*! Represents an edge in a weighted graph. * Features a comparer used to sort the edges w.r.t. increasing weights; * more precisely, lexicographically w.r.t. (d, i1, d2). */ template struct CMstTriple { Py_ssize_t i1; //!< first vertex defining an edge Py_ssize_t i2; //!< second vertex defining an edge T d; //!< edge weight CMstTriple() {} CMstTriple(Py_ssize_t i1, Py_ssize_t i2, T d, bool order=true) { QUITEFASTMST_ASSERT(i1 != i2); QUITEFASTMST_ASSERT(i1 >= 0); QUITEFASTMST_ASSERT(i2 >= 0); this->d = d; if (!order || (i1 < i2)) { this->i1 = i1; this->i2 = i2; } else { this->i1 = i2; this->i2 = i1; } } bool operator<(const CMstTriple& other) const { if (d == other.d) { if (i1 == other.i1) return i2 < other.i2; else return i1 < other.i1; } else return d < other.d; } }; #endif quitefastmst/src/c_fastmst.h0000644000176200001440000001134615132160206015727 0ustar liggesusers/* Minimum spanning tree and k-nearest neighbour algorithms * (quite fast in low-dimensional spaces, currently Euclidean distance only) * * * [1] V. Jarník, O jistém problému minimálním, * Práce Moravské Přírodovědecké Společnosti 6, 1930, 57–63 * * [2] C.F. Olson, Parallel algorithms for hierarchical clustering, * Parallel Computing 21(8), 1995, 1313–1325 * * [3] R. Prim, Shortest connection networks and some generalizations, * The Bell System Technical Journal 36(6), 1957, 1389–1401 * * [4] O. Borůvka, O jistém problému minimálním, * Práce Moravské Přírodovědecké Společnosti 3, 1926, 37–58 * * [5] W.B. March, R. Parikshit, A.G. Gray, Fast Euclidean minimum spanning * tree: Algorithm, analysis, and applications, Proc. 16th ACM SIGKDD Intl. * Conf. Knowledge Discovery and Data Mining (KDD '10), 2010, 603–612 * * [6] J.L. Bentley, Multidimensional binary search trees used for associative * searching, Communications of the ACM 18(9), 509–517, 1975, * https://doi.org/10.1145/361002.361007 * * [7] S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends * are fat, The 4th CGC Workshop on Computational Geometry, 1999 * * [8] N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search * strategies in K-d Trees, 5th WSES/IEEE Conf. on Circuits, Systems, * Communications & Computers (CSCC'01), 2001 * * [9] R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based * on hierarchical density estimates, Lecture Notes in Computer Science 7819, * 2013, 160–172, https://doi.org/10.1007/978-3-642-37456-2_14 * * [10] R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical * density estimates for data clustering, visualization, and outlier detection, * ACM Transactions on Knowledge Discovery from Data (TKDD) 10(1), * 2015, 1–51, https://doi.org/10.1145/2733381 * * [11] L. McInnes, J. Healy, Accelerated hierarchical density-based * clustering, IEEE Intl. Conf. Data Mining Workshops (ICMDW), 2017, 33–42, * https://doi.org/10.1109/ICDMW.2017.12 * * [12] M. Gagolewski, *quitefastmst*, in preparation, 2026, TODO * * * Copyleft (C) 2025-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_fastmst_h #define __c_fastmst_h #include "c_common.h" template void Cknn1_euclid_brute( const FLOAT* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, bool squared=false, bool verbose=false ); template void Cknn2_euclid_brute( const FLOAT* X, Py_ssize_t n, const FLOAT* Y, Py_ssize_t m, Py_ssize_t d, Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, bool squared=false, bool verbose=false ); template void Ctree_order(Py_ssize_t m, FLOAT* tree_dist, Py_ssize_t* tree_ind); template Py_ssize_t Cleaves_reconnect_dcore_min( Py_ssize_t m, Py_ssize_t n, Py_ssize_t M, FLOAT* tree_dist, Py_ssize_t* tree_ind, FLOAT* nn_dist, Py_ssize_t* nn_ind ); template void Cmst_euclid_brute( FLOAT* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, FLOAT* mst_dist, Py_ssize_t* mst_ind, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t mutreach_ties=-2, bool verbose=false ); template void Cknn2_euclid_kdtree( FLOAT* X, const Py_ssize_t n, const FLOAT* Y, const Py_ssize_t m, const Py_ssize_t d, const Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size=32, bool squared=false, bool verbose=false ); template void Cknn1_euclid_kdtree( FLOAT* X, const Py_ssize_t n, const Py_ssize_t d, const Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size=32, bool squared=false, bool verbose=false ); template void Cmst_euclid_kdtree( FLOAT* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, FLOAT* mst_dist, Py_ssize_t* mst_ind, FLOAT* nn_dist=nullptr, Py_ssize_t* nn_ind=nullptr, Py_ssize_t max_leaf_size=32, Py_ssize_t first_pass_max_brute_size=32, FLOAT boruvka_variant=1.5, Py_ssize_t mutreach_ties=-2, bool verbose=false ); #endif quitefastmst/src/mst_euclid_kdtree.cpp0000644000176200001440000002527315132160053017771 0ustar liggesusers/* This file is part of the 'quitefastmst' package. * * Copyleft (C) 2025-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_fastmst.h" #include "c_common.h" #include #include "c_kdtree_boruvka.h" /** * helper function called by Cmst_euclid_kdtree below */ template void _mst_euclid_kdtree( FLOAT* X, Py_ssize_t n, Py_ssize_t M, FLOAT* mst_dist, Py_ssize_t* mst_ind, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, Py_ssize_t first_pass_max_brute_size, FLOAT boruvka_variant, Py_ssize_t mutreach_ties, bool /*verbose*/ ) { using DISTANCE=quitefastkdtree::kdtree_distance_sqeuclid; QUITEFASTMST_PROFILER_USE QUITEFASTMST_PROFILER_START quitefastkdtree::kdtree_boruvka tree(X, n, M, max_leaf_size, first_pass_max_brute_size, boruvka_variant, mutreach_ties); QUITEFASTMST_PROFILER_STOP("tree init") QUITEFASTMST_PROFILER_START quitefastkdtree::mst( tree, mst_dist, mst_ind, nn_dist, nn_ind ); QUITEFASTMST_PROFILER_STOP("mst call") QUITEFASTMST_PROFILER_START for (Py_ssize_t i=0; i= 1) { for (Py_ssize_t i=0; i1`, the spanning tree is the smallest w.r.t. the degree-`M` * mutual reachability distance [9]_ given by * :math:`d_M(i, j)=\\max\\{ c_M(i), c_M(j), d(i, j)\\}`, where :math:`d(i,j)` * is the Euclidean distance between the `i`-th and the `j`-th point, * and :math:`c_M(i)` is the `i`-th `M`-core distance defined as the distance * between the `i`-th point and its `M`-th nearest neighbour * (not including the query points themselves). * In clustering and density estimation, `M` plays the role of a smoothing * factor; see [10]_ and the references therein for discussion. * * Note that [9]_ defines the core distance as the distance to the (M-1)-th NN. * * (\*) We note that if there are many pairs of equidistant points, * there can be many minimum spanning trees. In particular, it is likely * that there are point pairs with the same mutual reachability distances. * The ``mutreach_ties`` argument serves as an adjustment to address this * (partially). * * The implemented algorithm assumes that `M` is rather small; say, `M <= 20`. * * Our implementation of K-d trees [6]_ has been quite optimised; amongst * others, it has good locality of reference (at the cost of making a * copy of the input dataset), features the sliding midpoint (midrange) rule * suggested in [7]_, node pruning strategies inspired by some ideas * from [8]_, and a couple of further tuneups proposed by the current author. * * The "single-tree" version of the Borůvka algorithm is naively * parallelisable: in every iteration, it seeks each point's nearest "alien", * i.e., the nearest point thereto from another cluster. * The "dual-tree" Borůvka version of the algorithm is, in principle, based * on [5]_. As far as our implementation is concerned, the dual-tree approach * is only faster in 2- and 3-dimensional spaces, for `M<=1`, and in * a single-threaded setting. For another (approximate) adaptation * of the dual-tree algorithm to the mutual reachability distance, see [11]_. * * The "sesqui-tree" variant (by the current author) is a mixture of the two * approaches: it compares leaves against the full tree. It is usually * faster than the single- and dual-tree methods in very low dimensional * spaces and usually not much slower than the single-tree variant otherwise. * * Nevertheless, it is generally known that K-d trees perform well only in * spaces of rather low intrinsic dimensionality (a.k.a. the "curse"). * * * References: * ---------- * * [4] O. Borůvka, O jistém problému minimálním. Práce Mor. Přírodověd. Spol. * V Brně III 3, 1926, 37–58 * * [5] W.B. March, R. Parikshit, A.G. Gray, Fast Euclidean minimum spanning * tree: algorithm, analysis, and applications, Proc. 16th ACM SIGKDD Intl. * Conf. Knowledge Discovery and Data Mining (KDD '10), 2010, 603–612 * * [6] J.L. Bentley, Multidimensional binary search trees used for associative * searching, Communications of the ACM 18(9), 509–517, 1975, * https://doi.org/10.1145/361002.361007 * * [7] S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends * are fat, The 4th CGC Workshop on Computational Geometry, 1999 * * [8] N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search * strategies in K-d Trees, 5th WSES/IEEE Conf. on Circuits, Systems, * Communications & Computers (CSCC'01), 2001 * * [9] R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based * on hierarchical density estimates, Lecture Notes in Computer Science 7819, * 2021, 160–172, https://doi.org/10.1007/978-3-642-37456-2_14 * * [10] R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical * density estimates for data clustering, visualization, and outlier detection, * ACM Transactions on Knowledge Discovery from Data (TKDD) 10(1), * 2015, 1–51, https://doi.org/10.1145/2733381 * * [11] L. McInnes, J. Healy, Accelerated hierarchical density-based * clustering, IEEE Intl. Conf. Data Mining Workshops (ICMDW), 2017, 33–42, * https://doi.org/10.1109/ICDMW.2017.12 * * * @param X [destroyable] a C-contiguous data matrix, shape n*d * @param n number of rows * @param d number of columns, 2<=d<=20 * @param M the degree of the "core" distance if M > 0 * @param mst_dist [out] a vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order * @param mst_ind [out] a vector of length 2*(n-1), representing * a c_contiguous array of shape (n-1,2), defining the edges * corresponding to mst_d, with mst_i[j,0] < mst_i[j,1] for all j * @param nn_dist [out] NULL for M==0 or the n*M distances to the n points' * M nearest neighbours * @param nn_ind [out] NULL for M==0 or the n*M indexes of the n points' * M nearest neighbours * @param max_leaf_size maximal number of points in the K-d tree's leaves * @param first_pass_max_brute_size minimal number of points in a node to treat * it as a leaf (unless it's actually a leaf) in the first iteration * of the algorithm * @param boruvka_variant whether a dual- (2.0), a single- (1.0) or * a sesqui-tree (otherwise) Borůvka algorithm should be used * @param mutreach_ties (M>1 only) adjustment for mutual reachability distance * ambiguity (for M>1): -2 and 2 prefer connecting to points with, * respectively, smaller and larger core distance; -1 and 1 prefer, * respectively, farther and closer nearest neighbours * @param verbose should we output diagnostic/progress messages? */ template void Cmst_euclid_kdtree( FLOAT* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, FLOAT* mst_dist, Py_ssize_t* mst_ind, FLOAT* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, Py_ssize_t first_pass_max_brute_size, FLOAT boruvka_variant, Py_ssize_t mutreach_ties, bool verbose ) { QUITEFASTMST_PROFILER_USE if (n <= 0) throw std::domain_error("n <= 0"); if (d <= 0) throw std::domain_error("d <= 0"); if (M < 0) throw std::domain_error("M < 0"); if (M >= n) throw std::domain_error("M >= n"); QUITEFASTMST_ASSERT(mst_dist); QUITEFASTMST_ASSERT(mst_ind); if (max_leaf_size <= 0) throw std::domain_error("max_leaf_size <= 0"); //if (first_pass_max_brute_size <= 0) // does no harm - will have no effect if (verbose) QUITEFASTMST_PRINT("[quitefastmst] Computing the MST... "); #define IF_d_CALL_MST_EUCLID_KDTREE(D_) \ if (d == D_) \ _mst_euclid_kdtree(\ X, n, M, mst_dist, mst_ind, \ nn_dist, nn_ind, max_leaf_size, first_pass_max_brute_size, \ boruvka_variant, mutreach_ties, verbose \ ) /* LMAO; templates... */ QUITEFASTMST_PROFILER_START /**/ IF_d_CALL_MST_EUCLID_KDTREE(2); else IF_d_CALL_MST_EUCLID_KDTREE(3); else IF_d_CALL_MST_EUCLID_KDTREE(4); else IF_d_CALL_MST_EUCLID_KDTREE(5); else IF_d_CALL_MST_EUCLID_KDTREE(6); else IF_d_CALL_MST_EUCLID_KDTREE(7); else IF_d_CALL_MST_EUCLID_KDTREE(8); else IF_d_CALL_MST_EUCLID_KDTREE(9); else IF_d_CALL_MST_EUCLID_KDTREE(10); else IF_d_CALL_MST_EUCLID_KDTREE(11); else IF_d_CALL_MST_EUCLID_KDTREE(12); else IF_d_CALL_MST_EUCLID_KDTREE(13); else IF_d_CALL_MST_EUCLID_KDTREE(14); else IF_d_CALL_MST_EUCLID_KDTREE(15); else IF_d_CALL_MST_EUCLID_KDTREE(16); else IF_d_CALL_MST_EUCLID_KDTREE(17); else IF_d_CALL_MST_EUCLID_KDTREE(18); else IF_d_CALL_MST_EUCLID_KDTREE(19); else IF_d_CALL_MST_EUCLID_KDTREE(20); else { // TODO: does it work for d==1? // although then a trivial, faster algorithm exists... throw std::runtime_error("d should be between 2 and 20"); } QUITEFASTMST_PROFILER_STOP("Cmst_euclid_kdtree"); if (verbose) QUITEFASTMST_PRINT("done.\n"); } // instantiate: template void Cmst_euclid_kdtree( float* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, float* mst_dist, Py_ssize_t* mst_ind, float* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, Py_ssize_t first_pass_max_brute_size, float boruvka_variant, Py_ssize_t mutreach_ties, bool verbose ); template void Cmst_euclid_kdtree( double* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t M, double* mst_dist, Py_ssize_t* mst_ind, double* nn_dist, Py_ssize_t* nn_ind, Py_ssize_t max_leaf_size, Py_ssize_t first_pass_max_brute_size, double boruvka_variant, Py_ssize_t mutreach_ties, bool verbose ); quitefastmst/src/knn_euclid_brute.cpp0000644000176200001440000001674315132156655017636 0ustar liggesusers/* This file is part of the 'quitefastmst' package. * * Copyleft (C) 2025-2026, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_fastmst.h" #include "c_common.h" #include #include #define MST_OMP_CHUNK_SIZE 1024 /*! Determine the k nearest neighbours of each point * w.r.t. the Euclidean distance * * Exactly n*(n-1)/2 distance computations are performed. * * It is assumed that each query point is not its own neighbour. * * Worst-case time complexity: O(n*(n-1)/2*d*k). * So, use for small k, say, k <= 20. * * * @param X the n input points in R^d; a c_contiguous array, shape (n,d) * @param n number of points * @param d number of features * @param k number of nearest neighbours requested * @param nn_dist [out] a c_contiguous array, shape (n,k), * dist[i,j] gives the weight of the (undirected) edge {i, ind[i,j]} * @param nn_ind [out] a c_contiguous array, shape (n,k), * (undirected) edge definition, interpreted as {i, ind[i,j]} * @param squared return the squared Euclidean distance? * @param verbose output diagnostic/progress messages? */ template void Cknn1_euclid_brute( const FLOAT* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, bool squared, bool verbose ) { if (n <= 0) throw std::domain_error("n <= 0"); if (d <= 0) throw std::domain_error("d <= 0"); if (k <= 0) throw std::domain_error("k <= 0"); if (k >= n) throw std::domain_error("k >= n"); if (verbose) QUITEFASTMST_PRINT("[quitefastmst] Determining the nearest neighbours... "); for (Py_ssize_t i=0; i dij(n); for (Py_ssize_t i=0; i 0 && dd < nn_dist[j*k+l-1]) { nn_dist[j*k+l] = nn_dist[j*k+l-1]; nn_ind[j*k+l] = nn_ind[j*k+l-1]; l -= 1; } nn_dist[j*k+l] = dd; nn_ind[j*k+l] = i; } } // This part can't be (naively) parallelised for (Py_ssize_t j=i+1; j 0 && dij[j] < nn_dist[i*k+l-1]) { nn_dist[i*k+l] = nn_dist[i*k+l-1]; nn_ind[i*k+l] = nn_ind[i*k+l-1]; l -= 1; } nn_dist[i*k+l] = dij[j]; nn_ind[i*k+l] = j; } } // if (verbose) QUITEFASTMST_PRINT("\b\b\b\b%3d%%", (n-1+n-i-1)*(i+1)*100/n/(n-1)); if (i % MST_OMP_CHUNK_SIZE == MST_OMP_CHUNK_SIZE-1) { #if QUITEFASTMST_R Rcpp::checkUserInterrupt(); // throws an exception, not a longjmp #elif QUITEFASTMST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } } if (!squared) { for (Py_ssize_t i=0; i void Cknn2_euclid_brute( const FLOAT* X, Py_ssize_t n, const FLOAT* Y, Py_ssize_t m, Py_ssize_t d, Py_ssize_t k, FLOAT* nn_dist, Py_ssize_t* nn_ind, bool squared, bool verbose ) { if (n <= 0) throw std::domain_error("n <= 0"); if (m <= 0) throw std::domain_error("m <= 0"); if (d <= 0) throw std::domain_error("d <= 0"); if (k <= 0) throw std::domain_error("k <= 0"); if (k > n) throw std::domain_error("k > n"); if (verbose) QUITEFASTMST_PRINT("[quitefastmst] Determining the nearest neighbours... "); for (Py_ssize_t i=0; i 0 && dd < nn_dist[i*k+l-1]) { nn_dist[i*k+l] = nn_dist[i*k+l-1]; nn_ind[i*k+l] = nn_ind[i*k+l-1]; l -= 1; } nn_dist[i*k+l] = dd; nn_ind[i*k+l] = j; } } } if (!squared) { for (Py_ssize_t i=0; i( const float* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t k, float* nn_dist, Py_ssize_t* nn_ind, bool squared, bool verbose ); template void Cknn2_euclid_brute( const float* X, Py_ssize_t n, const float* Y, Py_ssize_t m, Py_ssize_t d, Py_ssize_t k, float* nn_dist, Py_ssize_t* nn_ind, bool squared, bool verbose ); template void Cknn1_euclid_brute( const double* X, Py_ssize_t n, Py_ssize_t d, Py_ssize_t k, double* nn_dist, Py_ssize_t* nn_ind, bool squared, bool verbose ); template void Cknn2_euclid_brute( const double* X, Py_ssize_t n, const double* Y, Py_ssize_t m, Py_ssize_t d, Py_ssize_t k, double* nn_dist, Py_ssize_t* nn_ind, bool squared, bool verbose ); quitefastmst/NAMESPACE0000644000176200001440000000032115036406753014227 0ustar liggesusers# Generated by roxygen2: do not edit by hand export(knn_euclid) export(mst_euclid) export(omp_get_max_threads) export(omp_set_num_threads) importFrom(Rcpp,evalCpp) useDynLib(quitefastmst, .registration=TRUE) quitefastmst/man/0000755000176200001440000000000015037457433013571 5ustar liggesusersquitefastmst/man/mst_euclid.Rd0000644000176200001440000002246715143122521016204 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \encoding{UTF-8} \name{mst_euclid} \alias{mst_euclid} \title{Euclidean and Mutual Reachability Minimum Spanning Trees} \usage{ mst_euclid( X, M = 0L, algorithm = "auto", max_leaf_size = 0L, first_pass_max_brute_size = 0L, mutreach_ties = "dist_min", mutreach_leaves = "keep", verbose = FALSE ) } \arguments{ \item{X}{the "database"; a matrix of shape \eqn{n\times d}} \item{M}{the smoothing factor a.k.a. the degree of the mutual reachability distance; \eqn{M\leq 1} gives the ordinary Euclidean distance} \item{algorithm}{\code{"auto"}, \code{"single_kd_tree"}, \code{"sesqui_kd_tree"}, \code{"dual_kd_tree"}, or \code{"brute"}; K-d trees can only be used for \eqn{d} between 2 and 20 only; \code{"auto"} selects \code{"sesqui_kd_tree"} for \eqn{d\leq 20}. \code{"brute"} is used otherwise} \item{max_leaf_size}{maximal number of points in the K-d tree leaves; smaller leaves use more memory, yet are not necessarily faster; use \code{0} to select the default value, currently set to 32 for the single-tree and sesqui-tree and 8 for the dual-tree Borůvka algorithm} \item{first_pass_max_brute_size}{minimal number of points in a node to treat it as a leaf (unless it actually is a leaf) in the first iteration of the algorithm; use \code{0} to select the default value, currently set to 32} \item{mutreach_ties}{adjustment for mutual reachability distance ambiguity (for \eqn{M>1}); one of \code{"dcore_min"}, \code{"dist_max"}, \code{"dist_min"} (default), or \code{"dcore_max"}} \item{mutreach_leaves}{a way to postprocess the leaves of the computed tree; one of \code{"keep"} (default: do nothing), or \code{"reconnect_dcore_min"} (try reconnecting leaves to inner vertices which have them amongst their M nearest neighbours; prefer vertices of the smallest core distance)} \item{verbose}{whether to print diagnostic messages} } \value{ A list with two $(M=0)$ or four $(M>0)$ elements, \code{mst.index} and \code{mst.dist}, and additionally \code{nn.index} and \code{nn.dist}. \code{mst.index} is a matrix with \eqn{n-1} rows and \eqn{2} columns, whose rows define the tree edges. \code{mst.dist} is a vector of length \eqn{n-1} giving the weights of the corresponding edges. The tree edges are ordered with respect to weights nondecreasingly, and then by the indexes (lexicographic ordering of the \code{(weight, index1, index2)} triples). For each \code{i}, it holds \code{mst_ind[i,1]1}, the spanning tree is the smallest with respect to the degree-\eqn{M} mutual reachability distance (Campello et al., 2013) given by \eqn{d_M(i, j)=\max\{ c_M(i), c_M(j), d(i, j)\}}, where \eqn{d(i,j)} is the standard Euclidean distance between the \eqn{i}-th and the \eqn{j}-th point, and \eqn{c_M(i)} is the \eqn{i}-th \eqn{M}-core distance defined as the distance between the \eqn{i}-th point and its \eqn{M}-th nearest neighbour (not including the query point itself). Note that (Campello et al., 2013) defines the core distance as the distance to the \eqn{(M-1)}-th nearest neighbour (or the \eqn{M}-th one, but including self). } \details{ (*) Note that if there are many pairs of equidistant points, there can be many minimum spanning trees. In particular, it is likely that there are point pairs with the same mutual reachability distances. To make the definition unambiguous, the \code{mutreach_ties} argument indicates the preference towards connecting to farther/closer points with respect to the original metric, or having smaller/larger core distances in cases of tied distances; see (Gagolewski, 2026). Empirically, \code{mutreach_ties="dcore_min"} and \code{mutreach_leaves="reconnect_dcore_min"} leads to MSTs with more leaves and hubs. The brute force method always resolves all ties, whilst, for efficiency, the K-d tree-based algorithms use this adjustment only for the first \eqn{M} nearest neighbours, so the resulting trees might be slightly different. The implemented algorithms, see the \code{algorithm} parameter, assume that \eqn{M} is rather small. Our implementation of K-d trees (Bentley, 1975) has been quite optimised; amongst others, it has good locality of reference (at the cost of making a copy of the input dataset), features the sliding midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), node pruning strategies inspired by some ideas from (Sample et al., 2001), and a couple of further tuneups proposed by the current author. The "single-tree" version of the Borůvka algorithm is parallelised: in every iteration, it seeks each point's nearest "alien", i.e., the nearest point thereto from another cluster. The "dual-tree" Borůvka version of the algorithm is, in principle, based on (March et al., 2010). As far as our implementation is concerned, the dual-tree approach is often only faster in 2- and 3-dimensional spaces, for \eqn{M\leq 1}, and in a single-threaded setting. For another (approximate) adaptation of the dual-tree algorithm to mutual reachability distances, see (McInnes and Healy, 2017). The "sesqui-tree" variant (by the current author) is a mixture of the two approaches: it compares leaves against the full tree and can be run in parallel. It is usually faster than the single- and dual-tree methods in very low dimensional spaces and usually not much slower than the single-tree variant otherwise. Nevertheless, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality (the "curse"). For high \eqn{d}, the "brute-force" algorithm is recommended. Here, we provided a parallelised (see Olson, 1995) version of the Jarník (1930) (a.k.a. Prim, 1957) algorithm, where the distances are computed on the fly (only once for \eqn{M\leq 1}). The number of threads used is controlled via the \code{OMP_NUM_THREADS} environment variable or via the \code{\link{omp_set_num_threads}} function at runtime. For best speed, consider building the package from sources using, e.g., \code{-O3 -march=native} compiler flags. } \examples{ library("datasets") data("iris") X <- jitter(as.matrix(iris[1:2])) # some data T <- mst_euclid(X) # Euclidean MST of X plot(X, asp=1, las=1) segments(X[T$mst.index[, 1], 1], X[T$mst.index[, 1], 2], X[T$mst.index[, 2], 1], X[T$mst.index[, 2], 2]) } \references{ V. Jarník, O jistém problému minimálním, \emph{Práce Moravské Přírodovědecké Společnosti} 6, 1930, 57–63. C.F. Olson, Parallel algorithms for hierarchical clustering, Parallel Computing 21(8), 1995, 1313–1325. R. Prim, Shortest connection networks and some generalizations, \emph{The Bell System Technical Journal} 36(6), 1957, 1389–1401. O. Borůvka, O jistém problému minimálním, \emph{Práce Moravské Přírodovědecké Společnosti} 3, 1926, 37–58. W.B. March, R. Parikshit, A.G. Gray, Fast Euclidean minimum spanning tree: Algorithm, analysis, and applications, \emph{Proc. 16th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining (KDD '10)}, 2010, 603–612. J.L. Bentley, Multidimensional binary search trees used for associative searching, \emph{Communications of the ACM} 18(9), 509–517, 1975, \doi{10.1145/361002.361007} S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends are fat, \emph{4th CGC Workshop on Computational Geometry}, 1999 N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search strategies in K-d Trees, \emph{5th WSES/IEEE Conf. on Circuits, Systems, Communications & Computers} (CSCC'01), 2001 R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, \emph{Lecture Notes in Computer Science} 7819, 2013, 160–172. \doi{10.1007/978-3-642-37456-2_14} R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical density estimates for data clustering, visualization, and outlier detection, \emph{ACM Transactions on Knowledge Discovery from Data (TKDD)} 10(1), 2015, 1–51, \doi{10.1145/2733381} L. McInnes, J. Healy, Accelerated hierarchical density-based clustering, \emph{IEEE Intl. Conf. Data Mining Workshops (ICMDW)}, 2017, 33–42, \doi{10.1109/ICDMW.2017.12} M. Gagolewski, quitefastmst, in preparation, 2026, TODO } \seealso{ The official online manual of \pkg{quitefastmst} at \url{https://quitefastmst.gagolewski.com/} \code{\link{knn_euclid}} } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} } quitefastmst/man/knn_euclid.Rd0000644000176200001440000001004615143122521016155 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \encoding{UTF-8} \name{knn_euclid} \alias{knn_euclid} \title{Euclidean Nearest Neighbours} \usage{ knn_euclid( X, k = 1L, Y = NULL, algorithm = "auto", max_leaf_size = 0L, squared = FALSE, verbose = FALSE ) } \arguments{ \item{X}{the "database"; a matrix of shape \eqn{n\times d}} \item{k}{requested number of nearest neighbours} \item{Y}{the "query points"; \code{NULL} or a matrix of shape \eqn{m\times d}; note that setting \code{Y=X}, contrary to \code{NULL}, will include the query points themselves amongst their own neighbours} \item{algorithm}{\code{"auto"}, \code{"kd_tree"} or \code{"brute"}; K-d trees can be used for \code{d} between 2 and 20 only; \code{"auto"} selects \code{"kd_tree"} in low-dimensional spaces} \item{max_leaf_size}{maximal number of points in the K-d tree leaves; smaller leaves use more memory, yet are not necessarily faster; use \code{0} to select the default value, currently set to 32} \item{squared}{whether the output \code{nn.dist} should be based on the squared Euclidean distance} \item{verbose}{whether to print diagnostic messages} } \value{ A list with two elements, \code{nn.index} and \code{nn.dist}, is returned. \code{nn.dist} and \code{nn.index} have shape \eqn{n\times k} or \eqn{m\times k}, depending whether \code{Y} is given. \code{nn.index[i,j]} is the index (between \eqn{1} and \eqn{n}) of the \eqn{j}-th nearest neighbour of \eqn{i}. \code{nn.dist[i,j]} gives the weight of the edge \code{{i, nn.index[i,j]}}, i.e., the distance between the \eqn{i}-th point and its \eqn{j}-th nearest neighbour, \eqn{j=1,\dots,k}. \code{nn.dist[i,]} is sorted nondecreasingly for all \eqn{i}. } \description{ If \code{Y} is \code{NULL}, then the function determines the first \code{k} nearest neighbours of each point in \code{X} with respect to the Euclidean distance. It is assumed that each query point is not its own neighbour. Otherwise, for each point in \code{Y}, this function determines the \code{k} nearest points thereto from \code{X}. } \details{ The implemented algorithms, see the \code{algorithm} parameter, assume that \eqn{k} is rather small. Our implementation of K-d trees (Bentley, 1975) has been quite optimised; amongst others, it has good locality of reference (at the cost of making a copy of the input dataset), features the sliding midpoint (midrange) rule suggested by Maneewongvatana and Mound (1999), node pruning strategies inspired by some ideas from (Sample et al., 2001), and a couple of further tuneups proposed by the current author. Still, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality. Thus, due to the so-called curse of dimensionality, for high \code{d}, the brute-force algorithm is recommended. The number of threads is controlled via the \code{OMP_NUM_THREADS} environment variable or via the \code{\link{omp_set_num_threads}} function at runtime. For best speed, consider building the package from sources using, e.g., \code{-O3 -march=native} compiler flags. } \examples{ library("datasets") data("iris") X <- jitter(as.matrix(iris[1:2])) # some data neighbours <- knn_euclid(X, 1) # 1-NNs of each point plot(X, asp=1, las=1) segments(X[,1], X[,2], X[neighbours$nn.index,1], X[neighbours$nn.index,2]) knn_euclid(X, 5, matrix(c(6, 4), nrow=1)) # five closest points to (6, 4) } \references{ J.L. Bentley, Multidimensional binary search trees used for associative searching, \emph{Communications of the ACM} 18(9), 509–517, 1975, \doi{10.1145/361002.361007} S. Maneewongvatana, D.M. Mount, It's okay to be skinny, if your friends are fat, \emph{4th CGC Workshop on Computational Geometry}, 1999 N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search strategies in K-d Trees, \emph{5th WSES/IEEE Conf. on Circuits, Systems, Communications & Computers} (CSCC'01), 2001 } \seealso{ The official online manual of \pkg{quitefastmst} at \url{https://quitefastmst.gagolewski.com/} \code{\link{mst_euclid}} } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} } quitefastmst/man/omp.Rd0000644000176200001440000000203615073706737014660 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \encoding{UTF-8} \name{omp_set_num_threads} \alias{omp_set_num_threads} \alias{omp_get_max_threads} \title{Get or Set the Number of Threads} \usage{ omp_set_num_threads(n_threads) omp_get_max_threads() } \arguments{ \item{n_threads}{maximal number of threads to use} } \value{ \code{omp_get_max_threads} returns the maximal number of threads that will be used during the next call to a parallelised function, not the maximal number of threads possibly available. It there is no built-in support for OpenMP, 1 is always returned. For \code{omp_set_num_threads}, the previous value of \code{max_threads} is returned. } \description{ These functions get or set the maximal number of OpenMP threads that can be used by \code{\link{knn_euclid}} and \code{\link{mst_euclid}}, amongst others. } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} } \seealso{ The official online manual of \pkg{quitefastmst} at \url{https://quitefastmst.gagolewski.com/} } quitefastmst/man/quitefastmst-package.Rd0000644000176200001440000000173115037667126020206 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/quitefastmst-package.R \docType{package} \encoding{UTF-8} \name{quitefastmst-package} \alias{quitefastmst} \alias{quitefastmst-package} \title{Euclidean and Mutual Reachability Minimum Spanning Trees} \description{ See \code{\link{mst_euclid}()} for more details. } \details{ For best speed, consider building the package from sources using, e.g., \code{-O3 -march=native} compiler flags and with OpenMP support on. } \seealso{ The official online manual of \pkg{quitefastmst} at \url{https://quitefastmst.gagolewski.com/} Useful links: \itemize{ \item \url{https://quitefastmst.gagolewski.com/} \item \url{https://github.com/gagolews/quitefastmst} \item Report bugs at \url{https://github.com/gagolews/quitefastmst/issues} } } \author{ \strong{Maintainer}: Marek Gagolewski \email{marek@gagolewski.com} (\href{https://orcid.org/0000-0003-0637-6028}{ORCID}) [copyright holder] } \keyword{internal} quitefastmst/DESCRIPTION0000644000176200001440000000257115143132623014516 0ustar liggesusersPackage: quitefastmst Type: Package Title: Euclidean and Mutual Reachability Minimum Spanning Trees Version: 0.9.1 Date: 2026-02-11 Authors@R: person("Marek", "Gagolewski", role=c("aut", "cre", "cph"), email="marek@gagolewski.com", comment=c(ORCID="0000-0003-0637-6028")) Description: Functions to compute Euclidean minimum spanning trees using single-, sesqui-, and dual-tree Boruvka algorithms. Thanks to K-d trees, they are fast in spaces of low intrinsic dimensionality. Mutual reachability distances (used in the definition of the 'HDBSCAN*' algorithm) are supported too. The package also includes relatively fast fallback minimum spanning tree and nearest-neighbours algorithms for spaces of higher dimensionality. The 'Python' version of 'quitefastmst' is available via 'PyPI'. BugReports: https://github.com/gagolews/quitefastmst/issues URL: https://quitefastmst.gagolewski.com/, https://github.com/gagolews/quitefastmst License: AGPL-3 Imports: Rcpp Suggests: datasets LinkingTo: Rcpp Encoding: UTF-8 SystemRequirements: OpenMP RoxygenNote: 7.3.3 NeedsCompilation: yes Packaged: 2026-02-11 16:24:15 UTC; gagolews Author: Marek Gagolewski [aut, cre, cph] (ORCID: ) Maintainer: Marek Gagolewski Repository: CRAN Date/Publication: 2026-02-11 17:00:03 UTC