Skip to main content

Research Data Toolkit

This guide is intended to help researchers, data creators or others who manage digital data as part of a research project, plan, organize, describe, share and preserve their research data for the long term.

Principles for Selecting File Formats

Non-Proprietary vs. Proprietary Formats

When saving your files it is recommended that you select non-proprietary or open source software formats that are royalty-free and without intellectual property restrictions or those that conform to standards that are in the public domain. Open, non-proprietary formats are more likely to remain usable even if the software that created them is not available or no longer functional.

Unencrypted Formats

It is also recommended that you use unencrypted formats because unlike their counterpart (encrypted software), they don't require the use of passwords or passphases. This ensures that if data are lost or forgotten, you will be able to retrieve the data from the file later. 

Compressed Files

Often compressing files can result in permanent or partial data loss, however, using "lossless" formats can prevent this from happening. Lossless compression is best for situations where it is important to maintain the integrity of the original dataset and where changes to original data limit data quality.

Examples of Open File Formats

 Type

 Description

 Container / Archive

 File type used for compressing and storing a collection of files and folders to a single file.

 GZIP/TAR- Two of the most common utilities for archiving and compressing files.

 ZIP (7-Zip, WinZip, ZipRAR)- Good for archiving many files, supports lossless data compression.

 Database

 Consists of collections of data organized so it can be easily accessed and managed.

 XML- A general-purpose markup language, standardized by W3C.

 CSV- Comma-separated values, commonly used for spreadsheets or simple database.

 Geospatial

 File type commonly used for encoding geographical information.

 SHP- Shapefile format for storing geometric location and associated attribute information.
 GeoTIFF- Allows geo-referencing information to be embedded within a TIFF file.
 KML- XML notation for expressing geographic annotation and visualization.

 Tabular Data / Spreadsheets

 File type for storing data elements arranged in tables.

 CSV- Comma-separated values, commonly used for spreadsheets or simple database.

 Still Images

 Files format for storing a single static image (e.g.,     photographs, graphs, scans, autoradiograms).

 JPG- Most used image file format.

 PDF/A- Differs from PDF by prohibiting features unsuitable for long-term archiving.

 JPEG/JPEG2000- “Lossy” format, meaning quality can easily be compromised in editing and saving.
 TIFF- Most used file format by photographers and designers.

 Audio / Sound

 File format for storing recorded digital audio data (e.g.,   music, sound effects, speech).

 MP3- “Lossy” format, moderate-quality audio, but may not be suitable for high-fidelity audio.
 AIFF, WAV, FLAC- Audio recording formats (lossless), best for maintaining audio quaility.

 Moving Images

 Files type used for saving motion pictures, film,   movies, video etc.

 AVI- Most popular and flexible of all public domain raster formats.
 M-JPEG2000- File format used to store video, audio, subtitles, images and is based on the MP4/QuickTime format.

Text

 File type for data viewed and edited on text terminals or in simple text editors.

 Plain text (ASCII, UTF)-  Most portable format, is supported by most machines and applications.

 JSON- Good for structured data (e.g.,. numbers, dates, groups of words).

 XML- Good for semi-structured plain text formats for non-tabular data (e.g., those used for nucleotide/protein sequences, alignments and phylogenies).

 Note: We recommend that a README be a plain text file, however, if text formatting is important, PDF is also acceptable.

 More information on file formats:

Sustainability of Digital Formats (Library of Congress)

Examples of open formats (Wikipedia)

Tips for Backing Up Files

Having duplicate copies of data files keeps them safe in case anything goes wrong with your local workstation. Loss of original data files can occur due to hardware and software failures, virus infection, malicious hacking, power failure and human errors. Developing strategies for backing up your data files ensures that data files can be restored and remain accessible for the long term should originals get damaged or go missing

Backup storage tips:

  • Make at least 3 copies (1 copy should be kept "local" i.e. saved on NCA&T workstation/laptops, external storage device), (1 copy can be kept "remote" using a cloud backup and data protection service i.e Carbonite, Backblaze, (1 copy can be kept "offsite" i.e. tape copy stored in a safety deposit box at a bank)
  • Here at NCA&T,  researchers have the option of backing up their data files using the One Drive for Business, this program offers the ability to store your files in Microsoft's secure cloud. It can be accessed through the Microsoft 365 download offered by the university. Click here to learn more about configuring One Drive on your computer. 
          
  • Consider an online backup-solution that continuously scans your computer for updates.
  • Use several different types of storage media including: external storage devices or networked storage.
  • Never use a thumb drive or flash drive as a permanent storage device, most reports estimate the lifespan of USB flash drive is 3-5 years.
  • Develop a regular backup routine for your data and synchronize among your backup copies. 
  • Create digital surrogates of print materials. 

For more information on backup storage options offered for researchers at NCA&T contact Information Technology Services.

How and Where to Store Data Securely? Pros and Cons

Data Storage & Security 

Data storage refers to where and how you keep your data, this includes selecting appropriate media for physical storage of data. On the other hand, data security refers to keeping your data safe, protecting it from malicious activity and preventing the breach of sensitive data.

It is important that you carefully consider storage options for your data as well as how you will control access. It is recommended that you save your data on several different mediums or devices, ensure that those devices are password-protected, keep human accessibility to data highly selective, and anonymize identifiable human subject information.

Here at NCA&T, the Information Security Services works with individuals across the campus to ensure the security of technology and data and manages the campus cybersecurity awareness program. 

The list below are the pros and cons of recommended storage options for your data: 

Physical Hardware 

  • Pros- Physical hardware such as laptops are affordable, convenient, portable, and can be password protected.
  • Cons- Because of their portabilty they are at risk of being lost or stolen or damaged, so a secure backup of your laptop data is recommended.

External Storage

  • Pros- External storage such as flash drives and external hard drives are affordable, easily portable, ease of use, and able to quickly restore files.
  • Cons- These devices are easily lost or damaged and have short longevity, they should not be used for master copies of data.

Institution provided Network Storage

  • ProsEasily accessible through your local workstation, networks are monitored by the university, the best option for securely collaborating with others at NCAT.
  • Cons- You cannot access your data without a university network connection, more portable options should be considered along with this option.

Cloud Storage

  • Pros- This option is affordable and all that you need is an Internet connection to access data.
  • Cons- Inexpensive for small amounts of storage, because of threats from online hacker and privacy issues, consider other options for highly senstive data.

 Note: If you suspect any incident of unauthorized access to and acquisition of your research data contact NCA&T's IT Security and Audit Department or follow university outlined  Data Security Breach Procedures.

The videos below were created by IBM Security, it explains the importance of data security and privacy.