2 Sondr
2.1 Overview
This R package provides a comprehensive suite of utilities for data analysis, focusing on survey data processing, statistical analysis, and data visualization. The package is organized into several key modules:
- Analysis (
analysis.R
) - Data Cleaning (
cleaning.R
) - Data Merging (
merging.R
) - Weighting (
weighting.R
) - Useful Data (
useful_data.R
)
2.2 Installation
# Install from GitHub (requires devtools)
::install_github("your-username/package-name") devtools
2.3 Core Functionalities
2.3.1 Analysis Module
2.3.1.1 glimpse_with_table()
A powerful data frame inspection function that extends dplyr::glimpse()
functionality.
Usage:
glimpse_with_table(df, n_values = 5)
Arguments: - df
: Data frame to be summarized - n_values
: Number of unique values to display (default: 5)
Features: - Displays row and column counts - Shows column types - Highlights missing data with color-coded percentages - Presents frequency tables for categorical variables
Example:
<- data.frame(
df x = sample(c("A", "B", "C"), size = 100, replace = TRUE),
y = sample(c("D", "E", "F"), size = 100, replace = TRUE)
)glimpse_with_table(df)
2.3.1.2 topdown_fa()
Performs top-down factor analysis with visualization.
Usage:
topdown_fa(df, nfactors = 1)
Arguments: - df
: Data frame for factor analysis - nfactors
: Number of factors to extract (default: 1)
Returns: - ggplot visualization - List containing: - Cronbach’s alpha - First eigenvalue - Factor loadings - Factor analysis plot
2.3.1.3 qualtrics_na_counter()
Analyzes missing data patterns in Qualtrics survey data.
Usage:
qualtrics_na_counter(data)
Features: - Handles text columns appropriately - Groups variables by common prefixes - Generates visualizations of missing data patterns - Returns detailed missing data statistics
2.3.2 Cleaning Module
2.3.2.1 parse_money_range()
Parses string monetary ranges into numeric vectors.
Usage:
parse_money_range(value, sep = NULL, limit = NULL, ceiling_increment = 10000)
Arguments: - value
: String containing monetary range - sep
: Separator between range values - limit
: How to handle unbounded ranges (“floor”/“ceiling”) - ceiling_increment
: Value to add for ceiling limits
Examples:
parse_money_range("$1,000 to $1,999", sep = "to")
parse_money_range("under $1,000", limit = "floor")
parse_money_range("$15,000 or over", limit = "ceiling")
2.3.2.2 clean_likert_numeric_vector()
Normalizes Likert scale responses.
Usage:
clean_likert_numeric_vector(raw_vector)
Features: - Scales data to 0-1 range - Preserves original response spacing - Handles missing values appropriately
2.3.2.3 finverser()
Inverts the order of unique values while preserving spacing.
Usage:
finverser(vec_col)
2.3.2.4 min_max_normalization()
Performs min-max normalization on numeric vectors.
Usage:
min_max_normalization(x)
2.3.3 Merging Module
2.3.3.1 read_any_csv()
Robust CSV file reader with multiple fallback methods.
Usage:
read_any_csv(file_path, ...)
Features: - Multiple reading attempts with different methods - Handles problematic headers - Supports various delimiters - Error handling and informative messages
2.3.3.2 read_survey()
Universal survey data reader supporting multiple formats.
Usage:
read_survey(file)
Supported formats: - CSV - XLSX - SPSS (.sav) - RDS
2.3.3.3 Utility Functions
load_variable()
: Extracts specific variables from filesgenerate_survey_ids()
: Creates unique survey identifiersmatch_and_update()
: Updates vectors based on matching namesextract_elements_with_prefix()
: Extracts elements with matching prefixes
2.3.4 Weighting Module
2.3.4.1 stratification_table()
Creates weighted stratified tables for demographic analysis.
Usage:
stratification_table(data, strata_level = "ses_state", strata_vars, weight_var = "weight")
2.4 Included Datasets
2.4.1 US Counties Data
Contains county-level information including: - FIPS codes - Population density - Electoral projections
2.4.2 US Counties Map Data
Geographic boundary data for US counties: - Longitude/latitude coordinates - FIPS codes - Boundary definitions
2.5 Best Practices and Tips
- Data Loading:
- Use
read_survey()
as the primary data loading function - For problematic CSVs, try
read_any_csv()
with different parameters
- Use
- Survey Data Processing:
- Clean Likert scales with
clean_likert_numeric_vector()
- Handle missing data using
qualtrics_na_counter()
- Use
parse_money_range()
for income/monetary data
- Clean Likert scales with
- Analysis Workflow:
- Start with
glimpse_with_table()
for data exploration - Use
topdown_fa()
for factor analysis - Apply appropriate normalization functions
- Start with
- Performance Considerations:
- Large datasets (>1M rows) may require chunked processing
- Consider using data.table for very large datasets
- Use appropriate encoding settings when reading files
2.6 Potential Improvements
- Function Enhancements:
- Add parallel processing support for large datasets
- Implement more robust error handling
- Add support for more file formats
- Enhance visualization options
- Documentation:
- Add more examples
- Create vignettes for common use cases
- Improve function cross-referencing
- Testing:
- Increase test coverage
- Add stress tests for large datasets
- Include more edge cases
- New Features:
- Add support for longitudinal data analysis
- Implement more statistical methods
- Add export functions for various formats
- Create interactive visualizations
2.7 Dependencies
- dplyr
- ggplot2
- readr
- haven
- readxl
- psych
- stringr
- tidyr
2.8 Contributing
Contributions are welcome! Please consider:
- Following the existing code style
- Adding tests for new features
- Updating documentation
- Creating detailed pull requests
2.9 License
[Add your license information here]
2.10 Contact
[Add your contact information here]
2.11 Citation
[Add citation information here]