API reference

Example datasets

Making regions

headss2.regions.get_step_and_limits(df: DataFrame, cluster_columns: List[str], n: int) → tuple[ndarray, ndarray]

Returns:: np.ndarray of step sizes for each dimension limits: np.ndarray of region start coordinates (shape: [num_regions, num_dims])
Return type:: step

headss2.regions.make_regions(spark_session: SparkSession, df: DataFrame | DataFrame, n: int, cluster_columns: List[str]) → Regions

Computes regions for a DataFrame by assigning each row a region ID based on spatial bins, and returns metadata (split + stitch regions) for those spatial divisions.

Parameters:

spark_session – Active SparkSession.
df – Input Spark or Pandas DataFrame.
n – Number of bins per dimension.
cluster_columns – List of column names to split on.

Returns:

Contains region-annotated DataFrame and region metadata as pandas DataFrames.

Return type:

Regions

Clustering

headss2.clustering.cluster(split_data: DataFrame, min_cluster_size: int, min_samples: int | None, allow_single_cluster: bool, clustering_method: str, cluster_columns: List[str], drop_unclustered: bool = True) → DataFrame: Perform HDBSCAN clustering on a Spark DataFrame, by region, using applyInArrow.

headss2.clustering.run_hdbscan(arrow_table: Table, region: int, min_cluster_size: int, min_samples: int | None, allow_single_cluster: bool, clustering_method: str, cluster_columns: List[str], drop_unclustered: bool = True, random_seed: int = 11) → Table: Cluster objects using HDBSCAN, given a pyarrow.Table, return a pyarrow.Table with a string ‘cluster’ column (formatted region_cluster or ‘-1’).

Stitching regions

headss2.stitching.calculate_centers(clustered_data: DataFrame, cluster_columns: List[str]) → DataFrame

Calculate the median center and size for each cluster.

Robust against empty partitions, missing clusters, or metadata inference.

Parameters:

data – A pandas DataFrame with a ‘cluster’ column and coordinate columns.
cluster_columns – Names of columns to include in median calculation.

Returns:

DataFrame with median center coordinates, cluster size (N), and cluster ID.

headss2.stitching.cut_misplaced_clusters(centers: list[DataFrame], stitch_regions: DataFrame, cluster_columns: List[str]) → DataFrame

Drop clusters whose centers occupy the incorrect region defined by: stitching_regions.

headss2.stitching.get_centers(clustered: DataFrame, cluster_columns: List[str]) → List[DataFrame]: Compute the median center and size of each cluster, per region. Returns a list of Spark DataFrames (one per region). Uses the Pandas-based calculate_centers() function internally.

headss2.stitching.stitch(clustered: DataFrame, cluster_columns: List[str], stitch_regions: DataFrame) → DataFrame: Stitch regions by removing misplaced clusters using PySpark.

headss2.stitching.stitch_clusters(regions: DataFrame, centers: List[DataFrame], stitch_regions: DataFrame, cluster_columns: List[str]) → DataFrame: Filter regions to include only valid clusters based on their center positions.

Merging clusters

class headss2.merging.OverlapStats(cluster1: str, cluster2: str, n_overlap: int, n1: int, n2: int, bound_region_point_overlap: float, total_point_overlap: float)

headss2.merging.cluster_merge(clustered: DataFrame, cluster_columns: list[str], split_regions: DataFrame, bound_region_point_overlap_threshold: float = 0.1, total_point_overlap_threshold: float = 0.5, min_n_overlap: int = 10, min_members=10) → DataFrame

Merge clusters based on overlaps.

Parameters:

clustered (sql.DataFrame) – Clustered data.
cluster_columns (list[str]) – Columns that we clustered on.
split_regions (pd.DataFrame) – Split regions data.
bound_region_point_overlap_threshold (float | None, optional) – Minimum threshold for merging: fraction of joint data points lying within the bound overlap region divided by the smallest of the two clusters. Previously known as ‘total threshold’. Defaults to 0.5.
total_point_overlap_threshold (float | None, optional) – Minimum threshold for merging: fraction of all joint data points divided by the smallest of the two clusters. Previously known as ‘overlap threshold’. Defaults to 0.1.
min_n_overlap (int | None, optional) – Minimum number of overlapping points to allow merging. Defaults to 10.
min_members (int, optional) – Minimum number of members per cluster. Defaults to 10.

Returns:

Clustered data with merged clusters.

Return type:

sql.DataFrame