Skip to contents

This helper reproduces the preprocessing pipeline that was previously in temp.R. It reads the baseline and follow-up Excel files, performs column harmonisation, derives follow-up times and laboratory summaries, and returns a cleaned dataset split into training and validation subsets.

Usage

prepare_adpkd_dataset(
  baseline_path,
  followup_path,
  followup_reference = as.Date("2025-08-01"),
  train_size = 300,
  seed = 123456L
)

Arguments

baseline_path

Path to the baseline Excel file (sheet 1, skip = 1).

followup_path

Path to the follow-up Excel file (sheet 1).

followup_reference

Date used when the RRT start date is missing. Either a Date or something coercible via as.Date(). Defaults to "2025-08-01".

train_size

Number of subjects to sample into the training set. If this exceeds the number of rows it falls back to nrow(data).

seed

Integer seed for the train/validation split (default 123456).

Value

A list with elements data (cleaned dataset), train, validation, varlist, and labels (if jstable is available). All datasets are returned as data.table objects.