Make Val Subset

The make_val_subset.py script is a utility for creating a subset of validation data from a larger dataset. It takes a JSON file containing validation labels (e.g., COCO-style annotations) and extracts a specified number of images and their corresponding annotations to create a smaller dataset. This is useful for testing or debugging on a smaller, more manageable dataset.

How It Works

1. Command-Line Arguments

parser = argparse.ArgumentParser()
parser.add_argument('--labels', type=str, required=True, help='path to json with keypoints val labels')
parser.add_argument('--output-name', type=str, default='val_subset.json',
                    help='name of output file with subset of val labels')
parser.add_argument('--num-images', type=int, default=250, help='number of images in subset')
args = parser.parse_args()

The script accepts three command-line arguments:
1. -labels: Path to the input JSON file containing validation labels (required).
2. -output-name: Name of the output JSON file for the subset (default: val_subset.json).
3. -num-images: Number of images to include in the subset (default: 250).

2. Load the Input JSON

with open(args.labels, 'r') as f:
    data = json.load(f)

The script reads the input JSON file specified by -labels and loads its content into the data dictionary.

3. Shuffle and Select Images

random.seed(0)
total_val_images = 5000
idxs = list(range(total_val_images))
random.shuffle(idxs)

images_by_id = {}
for idx in idxs[:args.num_images]:
    images_by_id[data['images'][idx]['id']] = data['images'][idx]

Shuffling:
- A fixed random seed (0) is set to ensure reproducibility.
- The script assumes the dataset contains 5000 validation images (total_val_images).
- It creates a shuffled list of indices (idxs) and selects the first num_images indices.
Selecting Images:
- The script creates a dictionary images_by_id to store the selected images, indexed by their unique id.

4. Collect Annotations for Selected Images

annotations_by_image_id = {}
for annotation in data['annotations']:
    if annotation['image_id'] in images_by_id:
        if not annotation['image_id'] in annotations_by_image_id:
            annotations_by_image_id[annotation['image_id']] = []
        annotations_by_image_id[annotation['image_id']].append(annotation)

The script iterates through all annotations in the dataset.
For each annotation, it checks if the image_id matches one of the selected images.