1,720,967 research outputs found
Automated Self-Improving and User-Responsive Video Chapter Generation Using Generative Artificial Intelligence
Traditional video segmentation methods that offer limited granularity, static chapter structures. Such solutions lack deep semantic understanding and do not incorporate user interaction data for video segmentation, leading to inefficient information discovery and a poor user experience. This disclosure describes the use of generative artificial intelligence (genAI) to automatically create and dynamically refine smart chapters for long-form videos. These chapters are rich in detail, including titles, summaries, key concepts, and preview clips. With user permission, user interaction data can be used to dynamically adapt the chapter structure for individual viewers or the overall population. The techniques transform static videos into dynamic, easily navigable content with a chapter structure that improves over time and eases navigation within a video
LLM-based Automatic Generation of Contextually Appropriate User Interface
Currently, the user interface (UI) for digital photo management applications is a static, reverse chronological grid of thumbnails. From a user perspective, this UI has several drawbacks such as high interaction cost for retrieval, lack of contextual adaptation, and higher task friction. This disclosure describes a generative situational UI engine for media management applications. The engine is powered by an integrated large language model that is capable of automatic user interface generation. This engine replaces the static, one-size-fits-all grid with a dynamic interface that, with the user’s permission, is generated based on the user’s context and the content of their media to determine a likely user intent and automatically generate a bespoke layout and toolset that matches the intent. The techniques transform the user interface from a passive digital shoebox into a proactive, intelligent partner that is responsive to the user’s context, shaped to be helpful and resonant
Multimodal LLM for automated validation and correction of geospatial data.
Manual verification of potential map inaccuracies can be resource and time intensive. Some systems may utilize a multimodal large language model (LLM) as a reasoning engine for map data validation. An LLM can be configured to ingest and analyze disparate data sources, such as user-generated reports, street-level and satellite imagery, and anonymized aggregate location data. By performing a cross-validation analysis on these inputs, the system may calculate a confidence score for a potential map discrepancy. In some examples, when the score exceeds a predetermined threshold, the system can automatically execute a correction in a geospatial database. This automated correction process may be used to facilitate more efficient and scalable maintenance of map data
On-Device Multi-Modal Hazard Detection Using a Language Model
Existing hazard detection methods often depend on cloud processing, which can introduce latency and privacy concerns. These methods may also be limited to single data modalities, lacking comprehensive contextual understanding. This disclosure describes techniques for on-device, multi-modal hazard detection using a language model or a large language model (LLM). The method involves the continuous, real-time processing of raw data streams from various sensors, such as cameras and microphones. An on-device LLM analyzes this fused sensor data to reason about the user\u27s environment and calculate a probability of danger. The purpose is to provide timely, context-aware safety notifications to the user while preserving data privacy by performing all processing locally on the device
Artificial Intelligence Powered Distillation and Visualization of Audience Chat for Live Game Streaming
During live game streaming, the high volume and speed of audience chat can present challenges for a streamer attempting to identify and act upon crowd-sourced information. This disclosure describes a system that may use an artificial intelligence agent to ingest and analyze multimodal data streams, which can include live chat, game video, and game audio. The agent can classify chat messages, score their contextual relevance, and use techniques such as semantic clustering to synthesize consensus-based advice from multiple viewers. This distilled guidance may then be rendered as a non-disruptive visual hint, for example, an on-screen object highlight or path indicator, which can be configured for visibility to the streamer. Such a system may assist streamers in leveraging collective audience intelligence in real-time, potentially reducing information overload and enhancing the collaborative gameplay experience, and can be designed to operate without direct game integration
AUTOMATING TASKS IN VIDEO GAMES VIA INTELLIGENT AGENT
This document describes techniques that enable a computing device (e.g., a personal computer, laptop, smartphone, smart watch, server system, gaming console, tablet, wearable device, virtual reality (VR) headset, augmented reality (AR) glasses, etc.) to automate repetitive tasks in video games using an intelligent software agent (e.g., artificial intelligence (AI), machine learning model, etc.). Video games often necessitate repetitive, time-consuming tasks for player progression, leading to user fatigue and diminished enjoyment (e.g., loot farming, resource gathering, experience point (XP) grinding, repetitive combat encounters). In some instances, the player may have to complete the same tedious tasks multiple times to reach engaging content, for example, if the player fails a boss fight in a video game and must replay an entire dungeon to try again. The computing device may configure the intelligent agent to identify and automate such recurring in-game task sequences based on user authorization and input. The intelligent agent may incorporate input monitoring and/or pattern recognition to detect repetitive user actions. The intelligent agent may also include a context-aware automation functionality so the device may execute automated tasks while monitoring for changes in the video game environment. The techniques may incorporate user control and oversight features to help ensure explicit consent and provide an override mechanism. Accordingly, if the player wishes to regain control in the video game, the device may quickly disable the intelligent agent. In this way, the techniques of this disclosure may streamline repetitive gameplay, thereby reducing user fatigue and/or repetitive strain injury, improving the user experience, and allowing the player to focus on more engaging aspects of the video game
Generating Interactive, LLM-driven Annotations for Inclusion in Digital Image Files
While photographs capture visual information, they lack depth and interactivity. Contextual information, memories, stories, and related data associated with a photograph are external to the image file itself, and are scattered across applications, cloud services, or within the user’s mind. This disclosure describes techniques to enhance a digital image by automatically generating interactive annotations using a large language model (LLM) and incorporating the annotations into the image. Entities within a photo are identified and an LLM is instructed to generate an annotation object for the recognized entities. The annotation object is a self-contained package of data and instructions that specifies an interactive experience associated with an entity identified within the image. The annotation object is incorporated into the image file. When a user views the image, the viewer application can read and render the annotation objects. For example, if the user taps on an entity within the image, the corresponding annotation object is made available for interactive engagement with the user
Integrating Generative Animations of Photographs into a Coherent Video Journey
Users enjoy creating slideshows, animated videos, etc. that capture their trips. Current techniques to prepare such creations are limited by the content of the captured photographs and user knowledge of such features. This disclosure describes the use of artificial intelligence to automatically generate a dynamic video journey from a collection of user photographs. With user permission, an artificial intelligence model performs semantic analysis of user photos to infer the user’s geographic path during the photo capture journey. A generative video model accesses street-level images along the path and synthesizes a base video that simulates first-person movement along the path. The user’s photos are synchronized in time and perspective with the base video and are transformed into short animations using a multimodal generative model. Context-aware outpainting from the boundaries of the generated animation is performed to blend the animation with the base video to generate a visual summary that enables the user to re-live and share their travel memories
Reconstructing Missed Photographic Moments Within a Spatiotemporal Event Cluster
Group photos captured at events such as birthdays, weddings, holidays, etc. may sometimes leave out important individuals, rendering the memory of the event incomplete. This disclosure describes techniques that utilize generative artificial intelligence to reconstruct missed moments within a set of photos taken during a particular event by leveraging the collective context of the photos within the photo set. A multi-state artificial intelligence (AI) pipeline is implemented that includes missed moment identification using event analysis; asset and scene reconstruction; context-aware reconstruction; and generative rendering. The techniques shift the paradigm of digital photo albums from organizing photos to helping users obtain a photo library that represents their experience of an event. The artificially-enhanced photograph is clearly marked as being synthetic and supports identification of the synthetic nature via image steganography, watermarking, or other techniques
Automatic Generation of a Labeled Corpus of Conversational Failures During Interaction Between a User and a Conversational AI Agent
Current techniques to detect conversational failures during interaction between users and conversational artificial intelligence (AI) have many limitations - they capture only a tiny fraction of failures, lack context about the user\u27s internal state during the conversation, and are not scalable. This disclosure describes techniques to automatically detect, classify, and log conversational failures in interactions between users and conversational AI agents to generate a rich, structured dataset. With user permission, conversations are analyzed to detect prosodic features and acoustic cues within context to determine likely occurrences of conversational failures. The conversational state is tracked using a state vector that encodes the nature of failure. Upon failure detection, a structured data entry with information such as conversational context, agent action, user feedback, inferred state vector, and type of failure is automatically generated and added to a corpus. This automatically generated corpus can be used to improve conversational AI agents by creating challenging evaluation sets, fine-tuning components, and providing training data
- …
