API reference
Example datasets
Making regions
- headss2.regions.get_step_and_limits(df: DataFrame, cluster_columns: List[str], n: int) tuple[ndarray, ndarray]
- Returns:
np.ndarray of step sizes for each dimension limits: np.ndarray of region start coordinates (shape: [num_regions, num_dims])
- Return type:
step
- headss2.regions.make_regions(spark_session: SparkSession, df: DataFrame | DataFrame, n: int, cluster_columns: List[str]) Regions
Computes regions for a DataFrame by assigning each row a region ID based on spatial bins, and returns metadata (split + stitch regions) for those spatial divisions.
- Parameters:
spark_session – Active SparkSession.
df – Input Spark or Pandas DataFrame.
n – Number of bins per dimension.
cluster_columns – List of column names to split on.
- Returns:
Contains region-annotated DataFrame and region metadata as pandas DataFrames.
- Return type:
Regions
Clustering
- headss2.clustering.cluster(split_data: DataFrame, min_cluster_size: int, min_samples: int | None, allow_single_cluster: bool, clustering_method: str, cluster_columns: List[str], drop_unclustered: bool = True) DataFrame
Perform HDBSCAN clustering on a Spark DataFrame, by region, using applyInArrow.
- headss2.clustering.run_hdbscan(arrow_table: Table, region: int, min_cluster_size: int, min_samples: int | None, allow_single_cluster: bool, clustering_method: str, cluster_columns: List[str], drop_unclustered: bool = True, random_seed: int = 11) Table
Cluster objects using HDBSCAN, given a pyarrow.Table, return a pyarrow.Table with a string ‘cluster’ column (formatted region_cluster or ‘-1’).
Stitching regions
- headss2.stitching.calculate_centers(clustered_data: DataFrame, cluster_columns: List[str]) DataFrame
Calculate the median center and size for each cluster.
Robust against empty partitions, missing clusters, or metadata inference.
- Parameters:
data – A pandas DataFrame with a ‘cluster’ column and coordinate columns.
cluster_columns – Names of columns to include in median calculation.
- Returns:
DataFrame with median center coordinates, cluster size (N), and cluster ID.
- headss2.stitching.cut_misplaced_clusters(centers: list[DataFrame], stitch_regions: DataFrame, cluster_columns: List[str]) DataFrame
- Drop clusters whose centers occupy the incorrect region defined by
stitching_regions.
- headss2.stitching.get_centers(clustered: DataFrame, cluster_columns: List[str]) List[DataFrame]
Compute the median center and size of each cluster, per region. Returns a list of Spark DataFrames (one per region). Uses the Pandas-based calculate_centers() function internally.
- headss2.stitching.stitch(clustered: DataFrame, cluster_columns: List[str], stitch_regions: DataFrame) DataFrame
Stitch regions by removing misplaced clusters using PySpark.
- headss2.stitching.stitch_clusters(regions: DataFrame, centers: List[DataFrame], stitch_regions: DataFrame, cluster_columns: List[str]) DataFrame
Filter regions to include only valid clusters based on their center positions.
Merging clusters
- class headss2.merging.OverlapStats(cluster1: str, cluster2: str, n_overlap: int, n1: int, n2: int, bound_region_point_overlap: float, total_point_overlap: float)
- headss2.merging.cluster_merge(clustered: DataFrame, cluster_columns: list[str], split_regions: DataFrame, bound_region_point_overlap_threshold: float = 0.1, total_point_overlap_threshold: float = 0.5, min_n_overlap: int = 10, min_members=10) DataFrame
Merge clusters based on overlaps.
- Parameters:
clustered (sql.DataFrame) – Clustered data.
cluster_columns (list[str]) – Columns that we clustered on.
split_regions (pd.DataFrame) – Split regions data.
bound_region_point_overlap_threshold (float | None, optional) – Minimum threshold for merging: fraction of joint data points lying within the bound overlap region divided by the smallest of the two clusters. Previously known as ‘total threshold’. Defaults to 0.5.
total_point_overlap_threshold (float | None, optional) – Minimum threshold for merging: fraction of all joint data points divided by the smallest of the two clusters. Previously known as ‘overlap threshold’. Defaults to 0.1.
min_n_overlap (int | None, optional) – Minimum number of overlapping points to allow merging. Defaults to 10.
min_members (int, optional) – Minimum number of members per cluster. Defaults to 10.
- Returns:
Clustered data with merged clusters.
- Return type:
sql.DataFrame