Innovations in Video and Image Content Understanding for Social Media Platforms

In 2019, we pioneered advanced content moderation technologies, utilizing Yolov3 and darknet for real-time object detection and addressing challenges in video-action detection, speech transcription, and automatic text summarization to enhance user experience and ad targeting on social platforms.

During a transformative journey in late summer 2019, a chance encounter led to a groundbreaking project aimed at redefining content understanding on social media platforms akin to Instagram and TikTok. Our mission was clear: to filter content for sensitive elements like nudity and violence effectively and to refine ad targeting through cutting-edge machine learning techniques.

Advanced Object Detection

Leveraging pjreddie’s darknet and the Yolov3 model, we crafted a high-speed image and video classification pipeline. This system identified predominant objects within media, outputting results in a structured JSON format, surpassing traditional methods in both speed and accuracy.

Pioneering Video-Action Detection

Video-Action Detection Visualization

Action detection in videos posed a unique set of challenges, from vast parameter spaces to scant training data. Our solution involved integrating NVIDIA-STEP, a model that excels in interpreting the complex temporal dynamics of video content, setting a new standard for action recognition.

Revolutionizing Speech Transcription

Speech Transcription Process

We tapped into the prowess of Google’s Speech-to-Text API, transforming audio extracted from videos into coherent transcriptions. This process not only enriched content understanding but also paved the way for more contextual ad placements and user interactions.

Innovating with Automatic Text Summarization

Exploring both abstractive and extractive summarization techniques allowed us to condense the transcribed text into digestible summaries. This dual approach highlighted essential information and generated new, concise sentences, mirroring a human-like understanding of the content.

Overcoming Challenges

The venture into video-action detection marked our project’s most formidable challenge, navigating through the latest research and parallel processing requirements. By employing multiprocessing and strategic design, we synchronized the analysis of audio and visual elements, achieving unprecedented efficiency and accuracy in content understanding.

Project Overview Image

For an in-depth look at our journey and technological breakthroughs, visit our project page.