PoseScript


Open-source Projects in SenseTime Research

2024 MultiModal Frame Retrieval in Video and Editing

  • VideoLLM Retrieves Video Frames
    Frame Localization Image

    BestMoment Annotation Example using our pipeline and GPT4o API, which is the best in action retrieval.

  • Large-scale (18.6M instances) Synthetic Pose Text Annotation

    This dataset aims to address the issues of high cost (¥0.03/character) in manual annotation and low accuracy (accuracy manual:GPT:ours=95%:70%:95%) in GPT4o annotation for pose description. It is divided into two versions: (a) single-frame pose description and (b) dual-frame pose change description, used for training text-to-frame models and image editing models.

    (a) Single-frame pose + Tracking visualization, image version of PoseScript, the text on the image describes the pose of the current bbox person. Double-click to zoom in for detailed annotations, including text, MPJPE (mean per-joint position error), and Y-axis orientation (±180 degrees for front view).

    (b) Dual-frame pose description change + second-frame pose description visualization (non-tracking version), image version of PoseFix, hover to pause.

  • Fine-grained (Action/Pose) Text Description to Locate Video Frames

    (Hover to zoom in)