1 INTRODUCTION
With the broad implementation of NGS technologies in the life sciences, genomics and transcriptomics sequencing data are generated at an unprecedented rate (Breese & Liu, 2013; Jung et al., 2019; Jung et al., 2020). Rapid progress in NGS technologies has brought massively high-throughput sequencing data to support research questions across many research fields, enabling a new era of genomic research (Jung et al., 2019; Jung et al., 2020). Simultaneously, this advancement has brought enormous challenges in data analysis, of which efficient, standardized and consistent analysis are fundamental steps for maintaining reproducibility, especially for biologists (Breese & Liu, 2013; Jung et al., 2019; Jung et al., 2020). However, many of the available tools for NGS data analysis require higher-order computational experience (e.g. various programming/scripting languages), expensive infrastructure (adequate HPC facilities and Cloud computing) and lack GUIs, making them inaccessible to many researchers, and cumbersome for even experienced biologists. Thus, the development of user-friendly standalone software for NGS data will accelerate the pace of research for scientists who have limited computer and bioinformatics experience.
NGS data processing often involves consecutive steps of trimming (including quality check), assembling, mapping, manipulating, converting and processing large files. FASTA (Pearson & Lipman, 1988) and FASTQ (Cock et al., 2010) file formats are generated by most NGS platforms, and further SAM/BAM (Li et al., 2009), BED (Kent et al, 2002), GFF/GTF (Pertea & Pertea, 2020), and VCF (Danecek et al., 2011) can be derived using FASTA and FASTQ files depending on the required analysis. The FASTA file, based on simple text, is the most basic format for reporting a sequence and is accepted by almost all sequence analysis programs. Each sequence starts with a “>” followed by the sequence name, a description of the sequence, and the sequence itself (nucleic acids or amino acids). The FASTQ file, a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores, is the most widely used format in sequence analysis and NGS sequencers. Each sequence requires at least 4 lines starting with “@” followed by the sequence, a “+” sequence identifier, and quality scores. Conveniently, FASTQ files can also be converted to FASTA files, the most commonly used file format for NGS data that enables direct sequencing of target genes. Many available tools, easySEARCH (Kim et al., 2012); BlasterJS (Blanco-Míguez et al., 2018); BlastGUI (Du et al., 2020); Sequenceserver (Priyam et al., 2019); orfipy (Singh & Wurtele, 2021); Samtools and BCFtools (Danecek et al., 2021) including easyfm , have not surprisingly focused on manipulating (analyse, collect, organise, interpret, and present data in meaningful ways) the FASTA file format to generate biologically relevant insights.
For the last decade, many HPC and Cloud-based NGS command-line programs or web-based platforms have wrapped popular high-level analysis and visualisation tools in an intuitive and appealing interface (Baker et al., 2020). Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org, Australia: https://usegalaxy.org.au/) in particular has been successful in establishing itself as an analytics hub and an e-learning platform with global scientists, intending to produce accessible, reproducible and collaborative biological analyses (Afgan et al., 2018; Serano-Solano et al., 2021). Even with the huge achievements made in many analytical software packages and pipelines, further improvements in user-friendly standalone software are still required to facilitate the rapid discovery of meaningful sequences in very large data sets for novice users. To help augment the functionality of existing tools and allow for user-friendliness and convenience of NGS file manipulation,easyfm enables end-to-end file filtering, extracting and converting (FASTQ to FASTA) with a simple mouse click on desktops.
The easyfm , implemented in Python 3.7+, was developed with four work modules (Basic Local Alignment Search Tool [BLAST], BLAST-Like Alignment Tool [BLAT], Open Reading Frames [ORF], and File Manipulation) and a secondary window (Project Folder, Help and Log). Together, these modules and secondary window cover different aspects of NGS data analysis (mainly focusing on FASTA files), including post-processing, filtering, format conversion, and generating results. The functionality of each module has been described in the Results and Discussion section to have an easy-to-follow parallel comparison.easyfm is a GUI-based, lightweight but powerful, free and open-source desktop software for querying/manipulating NGS data sources and generating various outcomes. Since everyone can use it from anywhere to analyse data and find target sequences easily without any coding, HPC and/or internet/web-server connection, we hope the usefulness ofeasyfm can extend its potential use in a wide range of bioinformatics applications in the life sciences including teaching/learning materials in the classroom.