Skip to Main Content

Research Data Toolkit

This guide is intended to help researchers, data creators or others who manage digital data as part of a research project, plan, organize, describe, share and preserve their research data for the long term.

What Are Data?

Research data refers to any factual information that is collected, observed, or generated through systematic investigation and analysis for the purpose of producing original research findings. This data can take various forms, including numerical data, textual data, images, audio recordings, or any other format relevant to the research inquiry. Research data serves as the foundation for scientific inquiry, enabling researchers to test hypotheses, validate theories, and draw conclusions. It may be gathered through primary data collection methods (such as surveys, experiments, or observations) or obtained from secondary sources (such as existing datasets, literature reviews, or archival records). Proper management, documentation, and sharing of research data are essential for transparency, reproducibility, and the advancement of knowledge within the scientific community.

Primary vs. Secondary Data

Primary data refers to information collected directly from its original source through methods such as surveys, interviews, experiments, or observations. This data is newly gathered for a specific research purpose and has not been previously collected or analyzed by others. It offers researchers the advantage of obtaining data tailored to their specific research questions but may require time and resources to collect.

Primary data can manifest in diverse forms, here are some examples below and these are things you create as the researcher.

  • Text documents and spreadsheets
  • Lab notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audiotapes, videotapes
  • Photographs
  • Films and other moving images
  • Protein or genetic sequences
  • Survey responses
  • Slides, artifacts, specimens, samples
  • Digital objects
  • Database contents (comprising video, audio, text, images)
  • Models, algorithms, scripts
  • Contents of applications (encompassing input, output, logfiles for analysis software, simulation software, schemas)
  • Methodologies and workflows

Secondary data, on the other hand, refers to information that has already been collected, processed, and analyzed by others for purposes, but it still may be useful for your research project depending on the topic. This data can include sources such as government statistics, academic research papers, industry reports, or data shared through public repositories. 

Below are links to websites where you can access secondary data for your project.

Social Sciences, Education, and Humanities

  • ICPSR (Inter-university Consortium for Political and Social Research): Offers a vast collection of social science datasets covering topics such as sociology, political science, economics, and more.
  • Pew Research Center: Provides datasets on various social and demographic trends, including public opinion polls, political attitudes, and social media usage.
  • UNICEF Data: Offers datasets related to child well-being, education, health, and demographics, valuable for fields like sociology and social work.
  • National Archive of Criminal Justice Data (NACJD): Collects, analyzes, and publishes data on crime, criminal offenders, victims of crime, and the operation of justice systems.
  • Data is Plural: Weekly newsletter of useful/curious datasets, covering various topics including history, literature, and cultural studies.
  • Buzzfeed: Open-source data, analysis, libraries, tools, and guides from BuzzFeed's newsroom, potentially useful for media studies and journalism research.
  • Pew Research Center: Provides freely usable datasets on various topics including US politics, journalism, internet, science, religion, etc., beneficial for social science and humanities research.
  • Digital Public Library of America (DPLA): Provides access to digital collections from libraries, archives, and museums across the United States, covering various humanities disciplines.
  • Data.gov - Education: Provides access to datasets on various education-related topics from U.S. federal agencies, including school performance, demographics, and funding.

Natural, Environmental, and Biological Sciences:

  • Earth Data: Provides full and open access to NASA’s Earth science data, valuable for environmental science and geography research.
  • NASA Planetary Data Systems: Long-term archive of digital data products from NASA's planetary missions, beneficial for planetary science and astronomy research.
  • National Centers for Environmental Information (NCEI): Offers environmental datasets including weather and climate data, oceanic data, and geological information.
  • FAO Datasets: Offers agricultural and food-related datasets, including livestock production, fisheries, and land use statistics from the Food and Agriculture Organization of the United Nations.
  • Animal Genome Size Database: Provides genome size data for various animal species, useful for genetics and evolutionary research.
  • NCBI Datasets: Provides access to biological datasets, genomic data, and biomedical literature curated by the National Center for Biotechnology Information.
  • Global Biodiversity Information Facility (GBIF): Offers access to biodiversity data, species occurrence records, and environmental information from around the world.

Computer Science, Engineering, and Information Technology:

  • Kaggle: Site where users can explore, analyze, and share quality datasets, beneficial for machine learning, data analysis, and computer science projects.
  • OpenML: An open platform for sharing datasets, algorithms, and experiments, suitable for machine learning research and data science projects.
  • GitHub: Offers access to open-source datasets, libraries, and code repositories useful for research in computer science and data science.
  • IEEE DataPort: Provides datasets and data repositories curated by the Institute of Electrical and Electronics Engineers (IEEE) for research and development in engineering and technology.
  • NIST Mechanical Engineering Data: Provides datasets and resources related to mechanical engineering research and standards from the National Institute of Standards and Technology.

Business and Economics:

  • Data.gov: US government-run site providing data, tools, and resources for research, applications, visualizations, etc., useful for economic research and policy analysis.
  • Nasdaq Data Link: Premier source for financial, economic, and alternative datasets, valuable for financial analysis and economics research.
  • World Bank Data: Offers economic, financial, and social datasets from around the world, beneficial for economic research and business analytics.
  • Federal Reserve Economic Data (FRED): Provides access to economic data, including GDP, inflation rates, employment statistics, and more.
  • Statista: Offers statistics, market research, and business intelligence data on various industries and markets.

Health and Human Sciences

  • World Health Organization (WHO): The WHO provides comprehensive health data and statistics on global health issues, including disease outbreaks, healthcare access, and health systems performance.
  • Centers for Disease Control and Prevention (CDC): The CDC offers a wide range of health-related data and statistics, covering topics such as infectious diseases, chronic conditions, and public health surveillance.
  • National Institutes of Health (NIH): The NIH hosts various data sharing repositories containing research data from NIH-funded projects across different health disciplines, facilitating data access and reuse.
  • HealthData.gov: Central repository for health-related datasets from various U.S. government agencies, providing access to public health data, research findings, and health-related applications.
  • National Center for Health Statistics (NCHS): The NCHS is the principal U.S. government agency responsible for collecting, analyzing, and disseminating health statistics, providing vital health data for research and policy development.

Research Data Lifecycle

Understanding the Research Data Lifecycle:

The research data lifecycle delineates the stages encompassing the collection, recording, processing, publication, sharing, and preservation of research data.

Employing the research data lifecycle as a navigational compass in data management is invaluable. It not only fosters meticulous planning but also ensures comprehensive coverage across all facets of the research data journey, including creation, processing, analysis, preservation, sharing, and potential reuse. You can find a description of each phase below:

  • Planning Phase:

    • Define research objectives and questions.
    • Determine data needs and sources.
    • Develop data collection methods.
  • Data Collection:

    • Gather primary or secondary data.
    • Ensure data quality and integrity.
    • Organize and store collected data securely.
  • Data Processing and Analysis:

    • Clean and preprocess raw data.
    • Analyze data using statistical or computational methods.
    • Interpret results and derive insights.
  • Data Sharing and Preservation:

    • Prepare data for sharing (anonymization, metadata creation).
    • Publish data through repositories or platforms.
    • Preserve data for long-term access and reuse.
  • Publication and Dissemination:

    • Write research findings and conclusions.
    • Submit papers to journals or conferences.
    • Present results at seminars or conferences.
    • Share findings through various channels (websites, social media).
  • Reuse and Repurposing:

    • Promote data reuse by other researchers.
    • Collaborate on further analyses or extensions.
    • Incorporate findings into policymaking or industry practices.
  • Evaluation and Feedback:

    • Reflect on the research process and outcomes.
    • Seek peer review and feedback.
    • Iterate and refine methodologies based on feedback received.

Tips for Writing A Data Management Plan

The Research Data Management Plan (DMP) is regarded as a formal "living document" because it should be continuously updated and modified throughout the research process. This ensures it remains relevant and useful. It is essential that the DMP is accessible to the entire research team both during and after the project, serving as a convenient reference guide. Furthermore, maintaining an up-to-date DMP is crucial because it typically extends beyond the lifespan of the research project itself.

Data Management Plan (DMP) Guidelines

The requirements for the information to be included in your Data Management Plan (DMP) can vary by funding agency and institution. The required length and structure of your plan will also differ. The best way to start is to check your funder’s specific expectations, which can be found on the agencies website. Following a comprehensive approach, will help ensure your DMP is thorough, compliant, and useful throughout and beyond the duration of your research project.

Online Resources

DMPTool: This free, interactive online tool assists researchers in creating data management plans. It provides step-by-step guidance, funder-specific templates, and sample data management plans.

DMPonline: DMPonline helps you to create, review, and share data management plans that meet institutional and funder requirements. It is provided by the Digital Curation Centre (DCC).

Public DMPs: Public DMPs are sample plans created using the DMPTool and shared publicly by their authors. Note that these plans are not vetted for quality, completeness, or adherence to funder guidelines.

Key Questions to Consider Before Writing Your Data Management Plan

  1. Data Production and Collection:

    • What type of data will my project produce or collect? (Description, data size & formats)
  2. Data Organization and Description:

    • How will I organize and describe my data files? (File naming, versioning, metadata)
  3. Data Backup and Storage:

    • How will I back up or store my data? (Hardware or software needed, sufficient storage, data recovery)
  4. Access and Security Management:

    • How will I manage access and security? (Access control, data vulnerability, confidentiality level)
  5. Communication with Collaborators:

    • How will I communicate data with collaborators? (Email, cloud-based communication)
  6. Data Sharing Plans:

    • What are my plans for data sharing? (Institutional repository, open access journals, domain repository, website)
  7. Sensitive or Restricted Data:

    • Will I have data that are sensitive or need to be restricted? (Anonymizing data, endangered species geo-data, embargoes)
  8. Copyright and Intellectual Property Rights:

    • What are the copyright and intellectual property rights of the data? (Who owns the data, authors addendum completed)