The make_val_subset.py script is a utility for creating a subset of validation data from a larger dataset. It takes a JSON file containing validation labels (e.g., COCO-style annotations) and extracts a specified number of images and their corresponding annotations to create a smaller dataset. This is useful for testing or debugging on a smaller, more manageable dataset.


How It Works

1. Command-Line Arguments

parser = argparse.ArgumentParser()
parser.add_argument('--labels', type=str, required=True, help='path to json with keypoints val labels')
parser.add_argument('--output-name', type=str, default='val_subset.json',
                    help='name of output file with subset of val labels')
parser.add_argument('--num-images', type=int, default=250, help='number of images in subset')
args = parser.parse_args()

2. Load the Input JSON

with open(args.labels, 'r') as f:
    data = json.load(f)

3. Shuffle and Select Images

random.seed(0)
total_val_images = 5000
idxs = list(range(total_val_images))
random.shuffle(idxs)

images_by_id = {}
for idx in idxs[:args.num_images]:
    images_by_id[data['images'][idx]['id']] = data['images'][idx]

4. Collect Annotations for Selected Images

annotations_by_image_id = {}
for annotation in data['annotations']:
    if annotation['image_id'] in images_by_id:
        if not annotation['image_id'] in annotations_by_image_id:
            annotations_by_image_id[annotation['image_id']] = []
        annotations_by_image_id[annotation['image_id']].append(annotation)