synthesisai package

Submodules

synthesisai.data_types module

class synthesisai.data_types.OutOfFrameLandmarkStrategy(value)[source]

Bases: Enum

An enumeration.

CLIP = 'clip'

IGNORE = 'ignore'

static clip_landmarks_(landmarks: Dict[int, Tuple[float, float]], height: int, width: int) → Dict[int, Tuple[float, float]][source]

synthesisai.dataset module

class synthesisai.dataset.Grouping(value)[source]

Bases: Enum

Different modalities of grouping Synthesis AI dataset.

CAMERA = 'CAMERA'

Items with the same camera are grouped into the list.

The size of the dataset is #cameras. Each element is a List[Item] with the same `CAMERA_NAME.

NONE = 'NONE'

Each image is treated independently.

The size of the dataset is #scenes * #cameras * #frames (assuming the same number of the cameras/frames per scene).

SCENE = 'SCENE'

Items with the same scene are grouped into the list.

The size of the dataset is #scenes. Each element is a List[Item] with the same `SCENE_ID.

SCENE_CAMERA = 'SCENE_CAMERA'

Items are grouped first by camera and then by scene. List of frames for a particular scene is indexed by scene_id.

The size of the dataset is #scenes * #cameras , each element is a List[Item] is a list of consecutive frames for given scene and camera.

class synthesisai.dataset.SaiDataset(root: Union[str, PathLike], modalities: Optional[List[Modality]] = None, body_segmentation_mapping: Optional[Dict[str, int]] = None, clothing_segmentation_mapping: Optional[Dict[str, int]] = None, face_segmentation_classes: Optional[List[str]] = None, face_bbox_pad: int = 0, grouping: Grouping = Grouping.NONE, out_of_frame_landmark_strategy: OutOfFrameLandmarkStrategy = OutOfFrameLandmarkStrategy.IGNORE, transform: Optional[Callable[[Dict[Modality, Any]], Dict[Modality, Any]]] = None)[source]

Bases: Sequence

Synthesis AI dataset.

This class provides access to all the modalities available in Synthesis AI generated datasets.

BODY_SEGMENTATION_MAPPING = {'arm_lower_left': 54, 'arm_lower_right': 55, 'arm_upper_left': 56, 'arm_upper_right': 57, 'background': 0, 'beard': 1, 'body': 104, 'brow': 2, 'caruncle_left': 29, 'caruncle_right': 30, 'cheek_left': 3, 'cheek_right': 4, 'chin': 5, 'clothing': 105, 'cornea_left': 106, 'cornea_right': 107, 'default': 0, 'ear_left': 6, 'ear_right': 7, 'eye_fluid': 114, 'eye_fluid_left': 35, 'eye_fluid_right': 36, 'eye_left': 115, 'eye_right': 116, 'eyebrow_left': 31, 'eyebrow_right': 32, 'eyebrows': 108, 'eyelashes': 109, 'eyelashes_left': 33, 'eyelashes_right': 34, 'eyelid': 110, 'eyelid_left': 112, 'eyelid_right': 113, 'eyes': 111, 'finger1_mid_bottom_left': 58, 'finger1_mid_bottom_right': 59, 'finger1_mid_left': 60, 'finger1_mid_right': 61, 'finger1_mid_top_left': 62, 'finger1_mid_top_right': 63, 'finger2_mid_bottom_left': 64, 'finger2_mid_bottom_right': 65, 'finger2_mid_left': 66, 'finger2_mid_right': 67, 'finger2_mid_top_left': 68, 'finger2_mid_top_right': 69, 'finger3_mid_bottom_left': 70, 'finger3_mid_bottom_right': 71, 'finger3_mid_left': 72, 'finger3_mid_right': 73, 'finger3_mid_top_left': 74, 'finger3_mid_top_right': 75, 'finger4_mid_bottom_left': 76, 'finger4_mid_bottom_right': 77, 'finger4_mid_left': 78, 'finger4_mid_right': 79, 'finger4_mid_top_left': 80, 'finger4_mid_top_right': 81, 'finger5_mid_bottom_left': 82, 'finger5_mid_bottom_right': 83, 'finger5_mid_left': 84, 'finger5_mid_right': 85, 'finger5_mid_top_left': 86, 'finger5_mid_top_right': 87, 'foot_left': 92, 'foot_right': 93, 'forehead': 8, 'glasses': 117, 'glasses_frame': 98, 'glasses_lens_left': 99, 'glasses_lens_right': 100, 'hair': 9, 'hand_left': 88, 'hand_right': 89, 'head': 10, 'headphones': 101, 'headwear': 102, 'iris_left': 37, 'iris_right': 38, 'jaw': 11, 'jowl': 12, 'leg_lower_left': 94, 'leg_lower_right': 95, 'leg_upper_left': 96, 'leg_upper_right': 97, 'lip_lower': 13, 'lip_upper': 14, 'lower_eyelid_left': 39, 'lower_eyelid_right': 40, 'mask': 103, 'mouth': 15, 'mouthbag': 16, 'mustache': 17, 'nails_left': 90, 'nails_right': 91, 'nape': 18, 'neck': 19, 'nose': 20, 'nose_outer': 21, 'nostrils': 22, 'orbit_left': 23, 'orbit_right': 24, 'pupil_left': 41, 'pupil_right': 42, 'sclera_left': 43, 'sclera_right': 44, 'shoulders': 47, 'smile_line': 25, 'teeth': 26, 'temples': 27, 'tongue': 28, 'torso_lower_left': 48, 'torso_lower_right': 49, 'torso_mid_left': 50, 'torso_mid_right': 51, 'torso_upper_left': 52, 'torso_upper_right': 53, 'undereye': 120, 'undereye_left': 118, 'undereye_right': 119, 'upper_eyelid_left': 45, 'upper_eyelid_right': 46}: Default body segmentation mapping.

CLOTHING_SEGMENTATION_MAPPING = {'background': 0, 'default': 0, 'long sleeve dress': 2, 'long sleeve outerwear': 3, 'long sleeve shirt': 8, 'scarf': 12, 'shoe': 14, 'short sleeve dress': 6, 'short sleeve outerwear': 10, 'short sleeve shirt': 5, 'shorts': 9, 'skirt': 11, 'sling dress': 1, 'trousers': 4, 'vest': 13, 'vest dress': 7}: Default clothing segmentation mapping

FACE_SEGMENTATION_CLASSES = ['brow', 'cheek_left', 'cheek_right', 'chin', 'eye_left', 'eye_right', 'eyelid_left', 'eyelid_right', 'eyes', 'forehead', 'jaw', 'jowl', 'lip_lower', 'lip_upper', 'mouth', 'mouthbag', 'nose', 'nose_outer', 'nostrils', 'smile_line', 'teeth', 'undereye', 'eyelashes_left', 'eyelashes_right', 'eyebrow_left', 'eyebrow_right', 'undereye_left', 'undereye_right']: Segmentation classes included in the face bounding box.

property body_segmentation_mapping: Dict[str, int]

Body segmentation mapping for the dataset.

Type: Dict[str, int]

property clothing_segmentation_mapping: Dict[str, int]

Clothing segmentation mapping for the dataset.

Type: Dict[str, int]

get_group_index() → DataFrame[source]

property modalities: List[Modality]

List of the loaded modalities.

Type: List[Modality]

synthesisai.item_loader module

synthesisai.item_loader_factory module

synthesisai.item_loader_v1 module

synthesisai.item_loader_v2 module

synthesisai.modality module

class synthesisai.modality.Modality(value)[source]

Bases: Enum

Different modalities of Synthesis AI dataset. All image modalities are in [y][x][channel] format, with axis going as follows:

┌-----> x
|
|
v
y

BODY_SEGMENTATION = 5

Semantic segmentation map of various body parts.

Type: ndarray[uint16]. Channels: 1.

CAMERA_NAME = 39

Camera name consisting of lowercase alphanumeric characters. Usually used when more than one are defined in a scene. Default is “cam_default”.

Type: str.

CAM_INTRINSICS = 38

Camera intrinsics matrix in OpenCV format: https://docs.opencv.org/3.4.15/dc/dbb/tutorial_py_calibration.html.

Type: ndarray[float32]. Shape: (4, 4).

CAM_TO_HEAD = 33

Transformation matrix from the camera to the head coordinate system.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (4, 4).

CAM_TO_WORLD = 36

Transformation matrix from the camera to the world coordinate system.

Type: ndarray[float32]. Shape: (4, 4).

CLOTHING_SEGMENTATION = 6

Semantic segmentation map of different types of clothing.

Type: ndarray[uint16]. Channels: 1.

DEPTH = 4

Depth Image. All values are positive floats. Background has depth=0.

Type: ndarray[float16]. Channels: 1.

EXPRESSION = 26

Expression and its intensity.

Format:

{
    instance_id: {
        'intensity': float64, 
        'name': str
    },
    ...
}

FACE_BBOX = 31

Face bounding box in the format (left, top, right, bottom) in pixels.

Type: Dict[InstanceId, Tuple[int, int, int, int]].

FACIAL_HAIR = 25

Facial hair metadata. If no facial hair is present for a human, None is provided.

Format:

{
    instance_id: {
        'relative_length': float64,
        'relative_density': float64,
        'style': str,
        'color_seed': float64,
        'color': str
    },
    ...
}

FRAME_NUM = 40

Frame number used for consecutive animation frames. Used for animation.

Type: int.

GAZE = 28

Gaze direction in camera space.

Format:

{
    instance_id: {
        'horizontal_angle': ndarray[float64] **Shape**: `(3,)`.
        'vertical_angle': ndarray[float64] **Shape**: `(3,)`.
    },
    ...
}

GAZE_TARGET = 29

The target that the gaze direction points to.

Format:

{
    instance_id: str
}

GESTURE = 27

Name of the gesture.

Type: str

HAIR = 24

Hair metadata. If no hair are present None is returned.

Format:

{
    instance_id: {
        'relative_length': float64,
        'relative_density': float64,
        'style': str,
        'color_seed': float64,
        'color': str
    },
    ...
}

HEAD_TO_CAM = 32

Transformation matrix from the head to the camera coordinate system.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (4, 4).

HEAD_TO_WORLD = 34

Transformation matrix from the head to the world coordinate system.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (4, 4).

HEAD_TURN = 30

The direction that the head is turned, relative to the body, provided in degrees.

Format:

{
    "roll": int,
    "pitch": int,
    "yaw": int
}

IDENTITY = 22

Mapping from instance ID (ranging from 0 to the number of humans in the image) to human ID (corresponding to the ID provided in the job config)

Type: Dict[InstanceId, int].

IDENTITY_METADATA = 23

Additional metadata about the people in the image.

Format:

{
    instance_id: {
        'gender': 'female'|'male',
        'age': int,
        'weight_kg': int,
        'height_cm': int,
        'id': int,
        'ethnicity': 'arab'|'asian'|'black'|'hisp'|'white'
    },
    ...
}

INSTANCE_SEGMENTATION = 7

Semantic segmentation map defining the various characters in the image.

Type: ndarray[uint16]. Channels: 1.

LANDMARKS_3D_COCO = 18

COCO whole body landmarks in 3D. Each landmark is given by name and three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark3d]]. Should have no more than 133 points.

LANDMARKS_3D_IBUG68 = 15

iBUG-68 landmarks in 3D. Each landmark is given by name and three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark3d]]. Should have no more than 68 points.

LANDMARKS_3D_KINECT_V2 = 16

Kinect v2 landmarks in 3D. Each landmark is given by name and three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark3d]]. Should have no more than 32 points.

LANDMARKS_3D_MEDIAPIPE = 17

MediaPipe pose landmarks in 3D. Each landmark is given by name and three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark3d]]. hould have no more than 33 points.

LANDMARKS_3D_MEDIAPIPE_FACE = 41

MediaPipe dense face landmarks in 3D. Each landmark is given by three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (468, 3).

LANDMARKS_3D_MPEG4 = 19

MPEG4 landmarks in 3D. Each landmark is given by name and three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark3d]].

LANDMARKS_3D_SAI = 42

SAI dense face landmarks in 3D. Each landmark is given by three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (4840, 3).

LANDMARKS_COCO = 13

COCO whole body landmarks. Each landmark is given by name and two coordinates (x,y) in pixels.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark2d]]. Should have no more than 133 points.

LANDMARKS_CONTOUR_IBUG68 = 10

iBUG-68 contour landmarks. Each landmark is given by two coordinates (name, x,y) in pixels. Each keypoint is defined in a similar manner to human labelers marking 2D face kepoints.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark2d]]. Should have no more than 68 points.

LANDMARKS_IBUG68 = 9

iBUG-68 landmarks. Each landmark is given by name and two coordinates (x,y) in pixels. Each keypoint is a 2D projection of a 3D landmark.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark2d]. Should have no more than 68 points.

LANDMARKS_KINECT_V2 = 11

Kinect v2 landmarks. Each landmark by name and two coordinates (x,y) in pixels.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark2d]]. Should have no more than 32 points.

LANDMARKS_MEDIAPIPE = 12

MediaPipe pose landmarks. Each landmark is given by name and two coordinates (x,y) in pixels.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark2d]]. Should have no more than 33 points.

LANDMARKS_MEDIAPIPE_FACE = 43

MediaPipe dense face landmarks. Each landmark is given by two coordinates (x,y) in pixels.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (468, 2).

LANDMARKS_MPEG4 = 14

MPEG4 landmarks. Each landmark is given by name and two coordinates (x,y) in pixels.

Type: Dict[InstanceId, Dict[LandmarkId, Landmark2d]].

LANDMARKS_SAI = 44

SAI dense face landmarks. Each landmark is given by two coordinates (x,y) in pixels.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (4840, 2).

NORMALS = 3

Normals image. All values are in [-1,1] range.

Type: ndarray[float16]. Channels: 3.

PUPILS = 20

Coordinates of pupils. Each pupil is given by name and two coordinates (x,y) in pixels.

Type: Dict[InstanceId, Dict[str, Landmark2d]].

PUPILS_3D = 21

Coordinates of pupils in 3D. Each pupil is given by name and three coordinates (x,y,z) in camera space.

Type: Dict[InstanceId, Dict[str, Landmark3d]].

RGB = 2

RGB image modality.

Type: ndarray[uint8]. Channels: 3.

SCENE_ID = 1

Scene ID (rendered scene number).

Type: int.

UV = 8

UV Image. This is a 2-channel image containing UV coordinates, where the first channel corresponds to the U coordinate and the second corresponds to the V coordinate.

Type: ndarray[uint16]. Channels: 2.

WORLD_TO_CAM = 37

Transformation matrix from the world to the camera coordinate system.

Type: ndarray[float32]. Shape: (4, 4).

WORLD_TO_HEAD = 35

Transformation matrix from the world to the head coordinate system.

Type: Dict[InstanceId, ndarray[float32]]. Shape: (4, 4).

Module contents

Top-level package for synthesisai.