1,721,166 research outputs found

    AIM 2020 challenge on video extreme super-resolution: methods and results

    No full text
    This paper reviews the video extreme super-resolution challenge associated with the AIM 2020 workshop at ECCV 2020. Common scaling factors for learned video super-resolution (VSR) do not go beyond factor 4. Missing information can be restored well in this region, especially in HR videos, where the high-frequency content mostly consists of texture details. The task in this challenge is to upscale videos with an extreme factor of 16, which results in more serious degradations that also affect the structural integrity of the videos. A single pixel in the low-resolution (LR) domain corresponds to 256 pixels in the high-resolution (HR) domain. Due to this massive information loss, it is hard to accurately restore the missing information. Track 1 is set up to gauge the state-of-the-art for such a demanding task, where fidelity to the ground truth is measured by PSNR and SSIM. Perceptually higher quality can be achieved in trade-off for fidelity by generating plausible high-frequency content. Track 2 therefore aims at generating visually pleasing results, which are ranked according to human perception, evaluated by a user study. In contrast to single image super-resolution (SISR), VSR can benefit from additional information in the temporal domain. However, this also imposes an additional requirement, as the generated frames need to be consistent along time

    NTIRE 2020 challenge on video quality mapping: methods and results

    No full text
    This paper reviews the NTIRE 2020 challenge on videoquality mapping (VQM), which addresses the issues of quality mapping from source video domain to target video domain. The challenge includes both a supervised track (track1) and a weakly-supervised track (track 2) for two benchmark datasets. In particular, track 1 offers a new Internet video benchmark, requiring algorithms to learn the mapfrom more compressed videos to less compressed videos ina supervised training manner. In track 2, algorithms arerequired to learn the quality mapping from one device toanother when their quality varies substantially and weaklyaligned video pairs are available. For track 1, in total 7teams competed in the final test phase, demonstrating noveland effective solutions to the problem. For track 2, some existing methods are evaluated, showing promising solutionsto the weakly-supervised video quality mapping problem

    LocalViT: Analyzing Locality in Vision Transformers

    No full text
    The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for infor-mation exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines Swin-T [1], DeiT-T [2] and PVT-T [3] by 1.0%, 2.6 % and 3.1 % with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT

    An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time Video Enhancement

    Full text link
    Video enhancement is a challenging problem, more than that of stills, mainly due to high computational cost, larger data volumes and the difficulty of achieving consistency in the spatio-temporal domain. In practice, these challenges are often coupled with the lack of example pairs, which inhibits the application of supervised learning strategies. To address these challenges, we propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples. In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information. The proposed design allows our recurrent cells to efficiently propagate spatio-temporal information across frames and reduces the need for high complexity networks. Our setting enables learning from unpaired videos in a cyclic adversarial manner, where the proposed recurrent units are employed in all architectures. Efficient training is accomplished by introducing one single discriminator that learns the joint distribution of source and target domain simultaneously. The enhancement results demonstrate clear superiority of the proposed video enhancer over the state-of-the-art methods, in all terms of visual quality, quantitative metrics, and inference speed. Notably, our video enhancer is capable of enhancing over 35 frames per second of FullHD video (1080x1920)

    Generative Flows with Invertible Attentions

    No full text
    Flow-based generative models have shown an excellent ability to explicitly learn the probability density function of data via a sequence of invertible transformations. Yet, learning attentions in generative flows remains understudied, while it has made breakthroughs in other domains. To fill the gap, this paper introduces two types of invertible attention mechanisms, i.e., map-based and transformer-based attentions, for both unconditional and conditional generative flows. The key idea is to exploit a masked scheme of these two attentions to learn long-range data dependencies in the context of generative flows. The masked scheme allows for invertible attention modules with tractable Jacobian determinants, enabling its seamless integration at any positions of the flow-based models. The proposed attention mechanisms lead to more efficient generative flows, due to their capability of modeling the long-term data dependencies. Evaluation on multiple image synthesis tasks shows that the proposed attention flows result in efficient models and compare favorably against the state-of-the-art unconditional and conditional generative flows

    The Vid3oC and IntVID Datasets for Video Super Resolution and Quality Mapping

    No full text
    The current rapid advancements of computational hardware has opened the door for deep networks to be applied for real-time video processing, even on consumer devices. Appealing tasks include video super-resolution, compression artifact removal, and quality enhancement. These problems require high-quality datasets that can be applied for training and benchmarking. In this work, we therefore introduce two video datasets, aimed for a variety of tasks. First, we propose the Vid3oC dataset, containing 82 simultaneous recordings of 3 camera sensors. It is recorded with a multi-camera rig, including a high-quality DSLR camera, a high-end smartphone, and a stereo camera sensor. Second, we introduce the IntVID dataset, containing over 150 high-quality videos crawled from the internet. The datasets were employed for the AIM 2019 challenges for video super-resolution and quality mapping

    Learned Image Signal Processing Pipeline for Mobile Cameras

    No full text
    The image signal processing (ISP) pipeline is a crucial part of the image creation process. This pipeline consists of a handcrafted and complex sequence of image-processing tasks that are used to process the raw image from the camera sensor and produce the final RGB image. Because of the hardware limitation in mobile cameras from their compact size, the ISP of mobile phones became more advanced and complex to overcome these limitations. In previous years a new research direction proposed to replace this complex hand-crafted pipeline with an end-to-end learned-based ISP using deep learning. They achieved that by training a deep learning network to process the raw image of a phone camera by imitating the output of a DSLR camera. This approach showed promising results without the need for the long and complex process of handcrafted conventional ISP. But this approach is still a research direction that has a lot of limitations and problems compare to the conventional ISP used in mobile cameras nowadays. In order to reach production-level accuracy and robustness with this approach a lot of work needs to be done to address its issues. In this work, we tried to improve the current state of learned-based ISP by addressing some of its main problems. We worked on night image rendering by using a learned-based ISP Network. We proposed an efficient network that was trained without the need for annotated data. Our proposed approach was one of the top 10 solutions on the NTIRE 2023 Challenge on Night Photography Rendering. We also worked on the problems of the ISP datasets like alignment and availability. We proposed a novel idea to create a fully aligned high-quality synthetic ISP dataset with a weakly aligned ISP dataset. Our experiments show that We get better performance by training on our synthetic dataset than directly training on the weakly aligned dataset which shows the effectiveness of our pipeline. We also showed the ability of our pipeline to generate a new synthetic dataset from just DSLR RGB images. Lastly, we addressed the problem of missed global information in the learned ISP networks. We proposed a novel color module that utilizes the global information from the full raw image in addition to local information from the input raw patch. Our module is a general module that can be integrated with any ISP Network to improve its color reproduction accuracy. We achieved state-of-the-art performance by utilizing our simple and efficient color module with a simple ISP network. We showed that by just utilizing the global information from the full image we can immensely improve the performance of ISP Networks
    corecore