2  Sondr

2.1 Overview

This R package provides a comprehensive suite of utilities for data analysis, focusing on survey data processing, statistical analysis, and data visualization. The package is organized into several key modules:

  • Analysis (analysis.R)
  • Data Cleaning (cleaning.R)
  • Data Merging (merging.R)
  • Weighting (weighting.R)
  • Useful Data (useful_data.R)

2.2 Installation

# Install from GitHub (requires devtools)
devtools::install_github("your-username/package-name")

2.3 Core Functionalities

2.3.1 Analysis Module

2.3.1.1 glimpse_with_table()

A powerful data frame inspection function that extends dplyr::glimpse() functionality.

Usage:

glimpse_with_table(df, n_values = 5)

Arguments: - df: Data frame to be summarized - n_values: Number of unique values to display (default: 5)

Features: - Displays row and column counts - Shows column types - Highlights missing data with color-coded percentages - Presents frequency tables for categorical variables

Example:

df <- data.frame(
  x = sample(c("A", "B", "C"), size = 100, replace = TRUE),
  y = sample(c("D", "E", "F"), size = 100, replace = TRUE)
)
glimpse_with_table(df)

2.3.1.2 topdown_fa()

Performs top-down factor analysis with visualization.

Usage:

topdown_fa(df, nfactors = 1)

Arguments: - df: Data frame for factor analysis - nfactors: Number of factors to extract (default: 1)

Returns: - ggplot visualization - List containing: - Cronbach’s alpha - First eigenvalue - Factor loadings - Factor analysis plot

2.3.1.3 qualtrics_na_counter()

Analyzes missing data patterns in Qualtrics survey data.

Usage:

qualtrics_na_counter(data)

Features: - Handles text columns appropriately - Groups variables by common prefixes - Generates visualizations of missing data patterns - Returns detailed missing data statistics

2.3.2 Cleaning Module

2.3.2.1 parse_money_range()

Parses string monetary ranges into numeric vectors.

Usage:

parse_money_range(value, sep = NULL, limit = NULL, ceiling_increment = 10000)

Arguments: - value: String containing monetary range - sep: Separator between range values - limit: How to handle unbounded ranges (“floor”/“ceiling”) - ceiling_increment: Value to add for ceiling limits

Examples:

parse_money_range("$1,000 to $1,999", sep = "to")
parse_money_range("under $1,000", limit = "floor")
parse_money_range("$15,000 or over", limit = "ceiling")

2.3.2.2 clean_likert_numeric_vector()

Normalizes Likert scale responses.

Usage:

clean_likert_numeric_vector(raw_vector)

Features: - Scales data to 0-1 range - Preserves original response spacing - Handles missing values appropriately

2.3.2.3 finverser()

Inverts the order of unique values while preserving spacing.

Usage:

finverser(vec_col)

2.3.2.4 min_max_normalization()

Performs min-max normalization on numeric vectors.

Usage:

min_max_normalization(x)

2.3.3 Merging Module

2.3.3.1 read_any_csv()

Robust CSV file reader with multiple fallback methods.

Usage:

read_any_csv(file_path, ...)

Features: - Multiple reading attempts with different methods - Handles problematic headers - Supports various delimiters - Error handling and informative messages

2.3.3.2 read_survey()

Universal survey data reader supporting multiple formats.

Usage:

read_survey(file)

Supported formats: - CSV - XLSX - SPSS (.sav) - RDS

2.3.3.3 Utility Functions

  • load_variable(): Extracts specific variables from files
  • generate_survey_ids(): Creates unique survey identifiers
  • match_and_update(): Updates vectors based on matching names
  • extract_elements_with_prefix(): Extracts elements with matching prefixes

2.3.4 Weighting Module

2.3.4.1 stratification_table()

Creates weighted stratified tables for demographic analysis.

Usage:

stratification_table(data, strata_level = "ses_state", strata_vars, weight_var = "weight")

2.4 Included Datasets

2.4.1 US Counties Data

Contains county-level information including: - FIPS codes - Population density - Electoral projections

2.4.2 US Counties Map Data

Geographic boundary data for US counties: - Longitude/latitude coordinates - FIPS codes - Boundary definitions

2.5 Best Practices and Tips

  1. Data Loading:
    • Use read_survey() as the primary data loading function
    • For problematic CSVs, try read_any_csv() with different parameters
  2. Survey Data Processing:
    • Clean Likert scales with clean_likert_numeric_vector()
    • Handle missing data using qualtrics_na_counter()
    • Use parse_money_range() for income/monetary data
  3. Analysis Workflow:
    • Start with glimpse_with_table() for data exploration
    • Use topdown_fa() for factor analysis
    • Apply appropriate normalization functions
  4. Performance Considerations:
    • Large datasets (>1M rows) may require chunked processing
    • Consider using data.table for very large datasets
    • Use appropriate encoding settings when reading files

2.6 Potential Improvements

  1. Function Enhancements:
    • Add parallel processing support for large datasets
    • Implement more robust error handling
    • Add support for more file formats
    • Enhance visualization options
  2. Documentation:
    • Add more examples
    • Create vignettes for common use cases
    • Improve function cross-referencing
  3. Testing:
    • Increase test coverage
    • Add stress tests for large datasets
    • Include more edge cases
  4. New Features:
    • Add support for longitudinal data analysis
    • Implement more statistical methods
    • Add export functions for various formats
    • Create interactive visualizations

2.7 Dependencies

  • dplyr
  • ggplot2
  • readr
  • haven
  • readxl
  • psych
  • stringr
  • tidyr

2.8 Contributing

Contributions are welcome! Please consider:

  1. Following the existing code style
  2. Adding tests for new features
  3. Updating documentation
  4. Creating detailed pull requests

2.9 License

[Add your license information here]

2.10 Contact

[Add your contact information here]

2.11 Citation

[Add citation information here]