1,721,015 research outputs found

    Dynamic integration of distributed, Cloud-based HPC and HTC resources using JSON Web Tokens and the INDIGO IAM Service

    Full text link
    In the last couple of years, we have been actively developing the Dynamic On-Demand Analysis Service (DODAS) as an enabling technology to deploy container-based clusters over any Cloud infrastructure with almost zero effort. The DODAS engine is driven by high-level templates written in the TOSCA language, that allows to abstract the complexity of many configuration details. DODAS is particularly suitable for harvesting opportunistic computing resources; this is why several scientific communities already integrated their computing use cases into DODAS-instantiated clusters automating the instantiation, management and federation of HTCondor batch system. The increasing demand, availability and utilization of HPC by and for multidisciplinary user community, often mandates the possibility to transparently integrate, manage and mix HTC and HPC resources. In this paper, we discuss our experience extending and using DODAS to connect HPC and HTC resources in the context of a distributed Italian regional infrastructure involving multiple sites and communities. In this use case, DODAS automatically generates HTCondor-based clusters on-demand, dynamically and transparently federating sites that may also include HPC resources managed by SLURM; DODAS allows user workloads to make opportunistic and automated use of both HPC and HTC resources, thus effectively maximizing and optimizing resource utilization. We also report on our experience of using and federating HTCondor batch systems exploiting the JSON Web Token capabilities introduced in recent HTCondor versions, replacing the traditional X509 certificates in the whole chain of workload authorization. In this respect we also report on how we integrated HTCondor using OAuth with the INDIGO IAM service

    Open Access repositories for scientific literature and research data

    Full text link
    <p>The Open Access Repository (OAR) project was started to implement Open Access policies and to preserve and share the scientific research results, including research data, of INFN authors. With the approval of the “Disciplinare per l’accesso aperto ai prodotti della ricerca dell’INFN” document, in July 2023, OAR officially became the INFN's institutional repository. In the past two years we studied the optimization of our institutional repository through the bulk upload of both digital and scanned documents, such as INFN Technical Notes and documents related to the ADONE project (1969-1993). Moreover, since 2023, a collaboration with INFN-CNAF has been established to migrate the current instance of the repository (v3) to an updated  InvenioRDM (CERN) release (v11.0: the current latest stable release at the time of writing). To better support the OAR migration recent activities have been mainly focused on the record upload process, exploiting the following topics: authentication, metadata customization, author and entity names disambiguation and product approval flow management. In addition to the study of the repository structure, we worked on communication as well, introducing the tool to users through a specific website, and user training activities about the use of OAR. </p&gt

    Sperimentazione di tecniche di Machine Learning e Deep Learning per la previsione di Job Zombie in sistemi HTC

    Full text link
    Il CNAF (Centro Nazionale delle Tecnologie Informatiche e Telematiche) dell'INFN (Istituto Nazionale di Fisica Nucleare) gestisce uno dei più importanti centri di calcolo in Italia, utilizzato da gruppi di ricercatori di fisica delle particelle, astrofisica e altri campi. Questo centro è dotato di oltre 46000 core distribuiti su 960 host fisici. I job vengono accodati e schedulati dal sistema batch (HTCondor) attraverso l'uso di algoritmi di "fairshare". Durante l'esecuzione vengono monitorate alcune grandezze, che vengono campionate ogni tre minuti e raccolte in un database insieme ai dati di accounting relativi ai job terminati. Questo studio esplora l'uso di tecniche di Machine Learning e Deep Learning per prevedere il successo o il fallimento dei job, basandosi sull'evoluzione del loro stato nel tempo. In particolare, è stato identificato un sottoinsieme di job che falliscono, denominati zombie. Questi, pur smettendo di effettuare calcoli, non rilasciano l'host fisico, occupando improduttivamente delle risorse fino al loro timeout. L'obiettivo della tesi è stato quello di individuare questi job il più presto possibile, poiché identificarli nelle loro fasi iniziali risulta essere particolarmente vantaggioso in termini di risparmio di risorse derivante dalla loro rimozione. Sono stati proposti e validati due modelli capaci di identificare i job che, con buona probabilità, diventeranno zombie (1 su 2). Le predizioni fornite dal modello possono essere utilizzate per impostare un filtro o un avviso, permettendo così di controllare manualmente i job sospetti o di stabilire una regola per la loro eliminazione

    INFN-CNAF

    Full text link
    <p>Incontro "Flash talk challenge" presso bi-rex, con tavola rotonda, moderata da Francesca Masini, Delegata per la scienza aperta e dati della ricerca all'Università di Bologna; ogni membro della giuria avrà un paio di minuti per presentarsi ed introdurre la propria affiliazione, seguiti da 10 minuti di presentazione sul tema dell'Open Science. A seguire ci sarà un giro di domande concordate e domande dal pubblico.<br><br></p> <p> </p> <p> </p&gt

    JOB PACKING: OPTIMIZED CONFIGURATION FOR JOB SCHEDULING

    Full text link
    <p>The default behaviour of a batch system is to dispatch jobs to nodes having the lower<br>value of some load index. Whilst this causes jobs to be equally distributed among all the<br>nodes in the farm, there are cases when different types of behaviour may be desirable, such as<br>having a completely full node before dispatching jobs to another one, or having similar jobs<br>dispatched to nodes already running jobs of the same kind. This work defines the packing<br>concept, different packing policies and useful metrics to evaluate how good the policy is. A<br>simple farm simulator has been written to evaluate the expected impact on a farm of different<br>packing policy. The simulator is run against a sequence of real jobs, whose parameters have<br>been taken from the accounting database of INFN-Tier1. The effectiveness of two packing<br>policies of interest, namely relaxed and exclusive, are compared. The exclusive policy proves<br>to be better, at the cost of unused cores in the farm, whose number is estimated. The<br>possibility of implementing the exclusive policy on a specific batch system, LSF 7.06, is<br>exploited. Relevant configurations are shown and an overall description of the mechanism is<br>presented.</p&gt

    ACCOUNTING DATA RECOVERY. A CASE REPORT FROM INFN-T1

    Full text link
    <p>Starting from summer 2013, the amount of computational activity of the INFN-T1<br>\ncentre reported by the official accounting web portal of the EGI community<br>\naccounting.egi.eu, was found to be much lower than the real value. Deep investigation on the<br>\naccounting system pointed out a number of subtle concurrent causes, whose effects dated<br>\nback from May and were responsible for a loss of collected data records over a period of<br>\nabout 130 days. The ordinary recovery method would have required about one hundred days.<br>\nA different solution had thus to be designed and implemented. Applying it on the involved set<br>\nof raw log files (records, for an average production rate of jobs/day) required less than 4 hours<br>\nto reconstruct the Grid accounting records. The propagation of these records through the usual<br>\ndataflow up to the EGI portal was then a matter of a few days. This report describes the work<br>\ndone at INFN–T1 to achieve the aforementioned result. The procedure was then adopted to<br>\nsolve a similar problem affecting another site, INFN–PISA. This solution suggests a possible<br>\nalternative accounting model and provided us with a deep insight on the most subtle aspects<br>\nof this delicate subject.</p&gt

    Enabling HPC systems for HEP: the INFN-CINECA Experience

    Full text link
    <p>Slides from related work DOI: 10.22323/1.378.0003</p> <p> </p&gt

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Sanctorum: a solution for safe management of digital certificates and passwords in a farm

    No full text
    <p>Sanctorum is a python tool created to aid sitemanagers to safely manage digital certificates and passwords for hosts in a farm. Motivation for the tool: when releasing a new host certificate, the Italian Certification Authority recommends the site to maintain two backup copies of the private key in a safe place not network reachable. This makes it quite difficult to both respect the CA rule and to efficiently manage the site, especially large ones. The proposed solution makes it possible to manage digital certificates in a comfortable and error free manner while still respecting CA recommendations.</p><h2>Impact</h2> <p>Maintaining host certificates for a large number of hosts is a time consuming activity. If not strictly planned and performed it might let sporadic errors happen. These in turn may lead into possibly severe unscheduled down of a service. Common practice at site level is to "hide" all private keys into some local filesystem network reachable together with some script used to perform common management tasks. Should an attacker gain access to that hidden place all hosts identities should be considered compromised, all certificates should be revoked and requested again from scratch to the CA. This is why keeping keys not network reachable makes sense. Sanctorum offers a complete set of tools for certificate and password management (check, deploy, get, update etc.) and keeps private keys out of the operator hands. This can only be copied into its host or matched against its certificate. Any operation is strictly interactive. This avoids the need to build shell loops and gain or save knowledge on many hosts at once. Apart from the setup of new hosts, renewal operations are almost fully automated (the only human intervention is to sign a request mail and bounce it to local RA).</p><h3>Conclusions and Future Work</h3> <p>The Sanctorum tool has been designed and developed to allow for the easy management of digital certificates and host passwords in large scale computing grid farms. It complies with the security IGTF approved CA policies and CPS. It is currently adopted by the Italian Tier-1 facility and by several Tier-2 centres of the IGI infrastructure.</p><h3>Detailed analysis</h3> <p>The reliability of common Authentication methods depends on how confidential a piece of sensible information can be kept. Host's Private keys are valid until they are inaccessible to all but the owner. Sanctorum aims to provide a comfortable way to manage host passwords, certificates and private keys while ensuring an adequate security level and preventing risk of human error. The basic idea to achieve this is to keep key/cert pairs in a cyphered database using raid1 disks in a machine with a disabled network. The machine should only be accessed through console or avocent like solutions. When a certificate or other data has to be transferred to/from a host, the network card gets enabled, a firewall rule is added to let the transfer happen, then the network card is deactivated and the fw rule removed. Apart from adding new hosts to the db, there is no frequent need to log on the sanctorum's host: a tool is provided to create new requests for expiring certificates and mail it to an operator; a second one is able to download a newly generated certificate from the CA website and stores it in the db after checking that it matches its key, then it copies the cert/key couple on the target host.</p&gt
    corecore