1,721,012 research outputs found

    Measuring the loss of duplicated genes in plant genomes assembled by means of short reads.

    Full text link
    Introduction The assembly of a genome is a complex task, whose hardest step is the resolution of repeats. As these regions are usually considered of minor concern for the description of the features of the genome, they are often poorly characterized in a genome analysis. By definition, any region present at least twice in the genome is a repeat, therefore duplicated genes could fall in this category, leading to an underestimation of duplication events in genomes. This effect could be exacerbated in the k-mers based short reads assembly algorithms. Methods While working in gap closure experiments for the tomato genome it came out that some of the unplaced contigs were duplicated genes. To our knowledge this loss of duplicated genes has never been measured for plant genomes. For this reason, the Arabidopsis thaliana genome was used as a reference sequence to generate simulated paired-end Illumina reads, that were assembled with De Bruijn graph based algorithms. Moreover, short reads data of other publicly available Arabidopsis thaliana ecotypes were similarly assembled and compared to the corresponding reference guided assemblies. Results The comparison between the already published genome assemblies and the De Bruijn graph based assemblies allowed us to investigate duplicated genes in terms of: 1) how many genes are missing in the genomes; 2) how the k-mers lengths may affect the loss/presence of duplicated genes in the genomes; 3) highlight how the structure of the duplicated genes can be affected by differential degree of nucleotide conservation. Discussion All the eukaryotic genome projects are now performed by means of short reads production and assembly. The impact of the sequencing strategy on duplicated gene representativeness should produce new insight to be considered when studying plant genomes and their evolution

    The hydrobiological station of Chioggia: a platform to study the coasts and lagoons of the north-western Adriatic Sea

    No full text
    The Hydrobiological Station “Umberto D’Ancona” (chioggia.biologia.unipd.it/en/), founded in 1941 in Chioggia, inside the Venetian Lagoon (northern Adriatic Sea), constitutes an historical field station for researches on marine and lagoon environments. Strongly linked to the local realities, the Station collected several data and built databases over time. Two continuous datasets are freely available and yearly updated. From 1970s data on air and water temperature, salinity, pH, dissolved oxygen are collected inside the Venetian Lagoon. On the other hand a database on monthly landings of several marine species, reported by the local fishery, is maintained starting from 1945. Ongoing researches on coasts include: the relationship between environmental characteristic and fish communities of hard substrates, the spatial and temporal variation in microbiological communities, the effects of climate changes (acidification and temperature) and pollutions on marine invertebrates, fish, microbes and algae, the genetic structure of marine organisms, the relationship between environment and behavior in marine fish. The Station constitutes also a reference for local fishermen for the early warning of invasive species. Here we propose to expand the researches to address the challenges faced by coastal areas including the monitoring of key species (defined as study species with other research institutions), over space and time as sentinel for an earlier management of ecosystem changes

    Measuring the loss of duplicated genes in plant genomes assembled by means of short reads

    No full text
    Introduction The assembly of a genome is a complex task, whose hardest step is the resolution of repeats. As these regions are usually considered of minor concern for the description of the features of the genome, they are often poorly characterized in a genome analysis. By definition, any region present at least twice in the genome is a repeat, therefore duplicated genes could fall in this category, leading to an underestimation of duplication events in genomes. This effect could be exacerbated in the k-mers based short reads assembly algorithms. Methods While working in gap closure experiments for the tomato genome it came out that some of the unplaced contigs were duplicated genes. To our knowledge this loss of duplicated genes has never been measured for plant genomes. For this reason, the Arabidopsis thaliana genome was used as a reference sequence to generate simulated paired-end Illumina reads, that were assembled with De Bruijn graph based algorithms. Moreover, short reads data of other publicly available Arabidopsis thaliana ecotypes were similarly assembled and compared to the corresponding reference guided assemblies. Results The comparison between the already published genome assemblies and the De Bruijn graph based assemblies allowed us to investigate duplicated genes in terms of: 1) how many genes are missing in the genomes; 2) how the k-mers lengths may affect the loss/presence of duplicated genes in the genomes; 3) highlight how the structure of the duplicated genes can be affected by differential degree of nucleotide conservation. Discussion All the eukaryotic genome projects are now performed by means of short reads production and assembly. The impact of the sequencing strategy on duplicated gene representativeness should produce new insight to be considered when studying plant genomes and their evolution

    A web-based platform to retrieve user-ranked data from human exome/genome sequencing projects.

    No full text
    Genome and exome sequencing projects produce huge amount of data, which in turns can yield extensive catalogues of human genetic variations. However, how to identify which genetic variations are implicated in the onset and progression of human diseases remains still a difficult task. New bioinformatic tools are required to efficiently spill out a small number of candidate variants from the large amounts of DNA sequencing data produced. Here we present the development of a platform designed to manage and retrieve data from human exome/genome sequencing projects. The platform integrates heterogeneous information to help the association of variations to the pathology/phenotype under study. The information can be related to gene features (Gene Ontology, Disease Ontology, OMIM, InterPro annotations), to genomic context, or it can describe the CDS-effects of variants (dbSNP, degree of deleteriousness) and their confidence in terms of depth of sequence coverage and calling score. The platform is accessible through a web interface where the user can upload one or more files containing the variants in VCF format. SNPs and microindels are automatically mapped on the genome and stored in a relational database together with their possible effects on the corresponding transcripts and proteins. A powerful and flexible query system allows then to explore the data applying different criteria which are related to the heterogeneous information stored in the database. The results of the processed query are displayed on a ranked list ordered according to how many of the imposed criteria are satisfied. Therefore the query and the ranking systems allow the user to filter the information at different levels and to directly assess the significance of the results. The web platform and the query system are based on a scalable and easily configurable XML-based language. This allows to easily face the continuous increase of data volume and heterogeneity and the subsequent database structure updates, without any modification of software code

    A global gene evolution analysis on <it>Vibrionaceae </it>family using phylogenetic profile

    Full text link
    Abstract Background Vibrionaceae represent a significant portion of the cultivable heterotrophic sea bacteria; they strongly affect nutrient cycling and some species are devastating pathogens. In this work we propose an improved phylogenetic profile analysis on 14 Vibrionaceae genomes, to study the evolution of this family on the basis of gene content. The phylogenetic profile is based on the observation that genes involved in the same process (e.g. metabolic pathway or structural complex) tend to be concurrently present or absent within different genomes. This allows the prediction of hypothetical functions on the basis of a shared phylogenetic profiles. Moreover this approach is useful to identify putative laterally transferred elements on the basis of their presence on distantly phylogenetically related bacteria. Results Vibrionaceae ORFs were aligned against all the available bacterial proteomes. Phylogenetic profile is defined as an array of distances, based on aminoacid substitution matrixes, from single genes to all their orthologues. Final phylogenetic profiles, derived from non-redundant list of all ORFs, was defined as the median of all the profiles belonging to the cluster. The resulting phylogenetic profiles matrix contains gene clusters on the rows and organisms on the columns. Cluster analysis identified groups of "core genes" with a widespread high similarity across all the organisms and several clusters that contain genes homologous only to a limited set of organisms. On each of these clusters, COG class enrichment has been calculated. The analysis reveals that clusters of core genes have the highest number of enriched classes, while the others are enriched just for few of them like DNA replication, recombination and repair. Conclusion We found that mobile elements have heterogeneous profiles not only across the entire set of organisms, but also within Vibrionaceae; this confirms their great influence on bacteria evolution even inside the same family. Furthermore, several hypothetical proteins highly correlate with mobile elements profiles suggesting a possible horizontal transfer mechanism for the evolution of these genes. Finally, we suggested the putative role of some ORFs having an unknown function on the basis of their phylogenetic profile similarity to well characterized genes.</p

    PABS: an online Platform to Assist BAC-by-BAC Sequencing projects.

    Full text link
    Genome sequencing projects are either based on whole genome shotgun (WGS) or on a BAC-by-BAC strategy. Although WGS is in most cases the preferred choice, sometimes the BAC-by-BAC approach may be better because it requires a much simpler assembly process. Furthermore, when the study is limited to specific regions of the genome, the WGS would require an unjustified effort, making the BAC-by-BAC the only feasible strategy. In this paper we describe an informatics pipeline called PABS (Platform Assisted BAC-by-BAC Sequencing) that we developed to provide a tool to optimize the BAC-by-BAC sequencing strategy. PABS has two main functions: (i) PABS-Select, to choose suitable overlapping clones; and (ii) PABS-Validate, to verify whether a BAC under analysis is actually overlapping the neighboring BAC

    Genome Physical Mapping with Next Generation Sequencing

    No full text
    Next generation sequencing technology has considerably improved over the past few years, making easier and more affordable the shotgun sequencing approach. Short reads are particularly popular as they are very easy and cheap to produce. On the other hand, their assembly results in the generation of a vast number of relatively short contigs that would require suitable physical maps and scaffolding procedures to be further assembled in a draft genomic sequence. Unfortunately, physical maps are still very difficult to produce and take little advantage of the next generation sequencing technology. The aim of our project is to investigate the possibility to overcome this problem. The organism we chose as field test is Nannochloropsis gaditana a unicellular algae that could be very useful in biofuel production, because of its capacity to accumulate high amount of lipids under particular growth conditions, and because its genome is relatively small (32Mb). We obtained a BAC library of more than 11,000 clones with an average insert size of 120 kb. This BAC library is the starting point of our method: by selecting random clones from the library we produced 32 pools, each representing about 40% of the genome. Each pool was fragmented by sonication and sequenced with a SOLiD 5500XL. A high-coverage of the genome was also produced by an independent shotgun project. Non-repeated sequences (tags) can be identified taking into consideration their coverage and each of them can be considered as a genetic marker. The presence or absence of a tag in each pool can then be analyzed: the more two tags are close in the genome, the more they are expected to be present together in each pool. Analyzing the profiles of presence/absence of each tag in each pool it is possible to sort the tags according to their relative position, producing a high density and high quality physical map. We are currently in the process of analyzing 32 pools of 96 BAC clones randomly fragmented. A computer simulation indicated that 32 pools should be sufficient to produce a complete physical map of a genome equivalent to that of N. gaditana. If necessary, we could easily produce more pools. Larger genomes could also benefit from this approach, although the number of BACs per pool and the number of pools must be adjusted accordingly. The perspective of producing physical maps at a low cost is very important for the improvement of de novo assembly, and in particular for the scaffolding procedures that are now the limiting step of the entire process
    corecore