Skip to main content

Methodology & Psychometric Standards

How we build, validate, and maintain our assessments — and what we cannot do.

Last reviewed: May 2026

Evidence-Informed Assessment on My Path

My Path is built on the premise that psychological measurement is most useful when it is transparent, technically defensible, and interpreted with appropriate humility. We combine classical test theory metrics with modern item response theory (IRT) methods, normative reference information where available, and carefully constrained artificial intelligence to help people turn scores into narratives they can act on—without pretending that a screen-based questionnaire can replace clinical judgment, occupational licensure, or individualized professional advice.

This methodology overview explains how we approach scoring, calibration, validity evidence, model-assisted reporting, data ethics, cross-cultural adaptation, and ongoing research. It is written for readers who want more than marketing language: educators, researchers, HR partners, and curious test-takers who wish to understand what is—and is not—being claimed when a profile is generated from their responses.

Throughout, we distinguish between traits (relatively stable patterns), states (transient shifts in mood, fatigue, or context), and behaviors (observable actions that may or may not align with self-reported tendencies). Our instruments are primarily self-report measures; they access verbalized identity and phenomenology, not neurology, genetics, or immutable destiny. The sections that follow map each technical choice to a psychometric rationale and to the limits of inference that responsible users should keep in mind.

Dimensional Scoring and Likert Formats vs. Forced-Choice Designs

Many personality and interest inventories use multi-point rating scales (often called Likert-type items) because they efficiently capture gradations of agreement, frequency, or preference. Dimensional scoring treats each construct as a continuum: people differ in degree, not only in category membership. That continuity aligns with how most contemporary trait models are theorized (for example, broad personality domains in the Big Five tradition) and with how scores are used in practice—ranking relative standing, tracking change over time, or comparing profiles across contexts.

Forced-choice and ipsative formats (e.g., “pick the statement that is most like you” among equally desirable options) can reduce certain response biases such as acquiescence or extreme responding, but they introduce different challenges. Ipsative scores are often expressed as within-person allocations: elevating one scale can mathematically depress another even when the underlying traits do not change. That property complicates normative interpretation—knowing how high someone is in absolute terms relative to a population—and can distort correlations between scales in ways that are unintuitive to end users.

My Path emphasizes normative and dimensional interpretation where appropriate: estimated trait levels or construct scores are referenced to population or sample distributions when norms exist, and uncertainty is communicated rather than hidden behind single-point labels. When we report percentiles, standard scores, or continuous estimates, we intend them as guides to relative standing, not as clinical cutoffs unless a specific instrument has been validated for that purpose.

Likert-type items are not bias-free. Social desirability, mood, and comprehension still matter. We mitigate these issues through item design (balanced keying where applicable, clear behavioral referents), quality filters (speeding checks, attention flags when supported by the instrument), and by pairing scores with narrative caveats. The goal is not to claim perfect objectivity, but to make the measurement model explicit: we measure reported tendencies on defined dimensions, under stated instructions, in a given language and context.

Item Response Theory (IRT) Calibration and Rasch-Style Models

Classical test theory summarizes an instrument with statistics defined at the test level—item difficulties and discriminations can be computed, but the model often treats all items as equally informative near the cut score. Item response theory (IRT) instead models the probability of a response as a function of latent ability or trait level and item parameters. For ordered response categories (typical of Likert scales), polytomous IRT models such as the graded response model or the generalized partial credit model specify category thresholds and slope parameters that describe how sharply an item differentiates along the latent continuum.

The Rasch family of models can be viewed as a simplified IRT framework with item discrimination fixed (often to a common value), yielding conjoint measurement properties that are attractive for scale construction: item locations on a common linear logit metric, person parameters on the same metric, and separability of item and person estimates under suitable designs. In practice, we use Rasch and related models when their assumptions are reasonable for the item set and when parsimony aids interpretability; we use more flexible IRT parameterizations when items vary meaningfully in discrimination or when category functioning is asymmetric.

Calibration begins with empirical data collected under standardized administration. We estimate item parameters, evaluate item fit (infit/outfit statistics in Rasch traditions; chi-square-based or residual-based checks in broader IRT), and review differential item functioning (DIF) preliminary indicators when samples permit. Poorly behaving items may be revised, re-weighted administratively (for example, down-weighted or removed from live scoring pools), or flagged for replication studies before they inform high-stakes decisions.

For operational scoring, calibrated item parameters contribute to trait estimates via maximum likelihood, maximum a posteriori, or weighted likelihood estimators depending on instrument length and prior information. Shorter scales may borrow information across items through priors on the latent distribution; longer scales may approach nearly unique estimates with narrow standard errors. Throughout, we treat IRT outputs as statistical estimates with uncertainty—not as oracle numbers.

We also monitor drift: if wording, cultural usage, or platform interaction patterns shift response probabilities, periodic re-calibration updates the operational item bank so that longitudinal comparisons remain meaningful. Transparency means we acknowledge when revisions change metric continuity and how we bridge old and new scales when necessary.

Construct Validity, Reliability, and Alignment with External Criteria

A score is only as trustworthy as the evidence linking it to the construct it claims to represent. Construct validity is a program of inquiry, not a single coefficient. Our internal standards prioritize multiple strands of evidence: content validity (coverage of the domain through principled item generation and expert review), structural validity (factor structure congruent with theory), convergent and discriminant validity (expected correlations with related and unrelated measures), and criterion-related validity where outcomes are ethically and practically available.

Internal consistency reliability—often summarized with Cronbach’s coefficient alpha or coefficient omega for multidimensional scales—addresses whether items covary as if sampling a common latent variable. We report these metrics conservatively and avoid treating alpha as sufficient for validity. Very high alpha can indicate redundancy rather than fidelity; low alpha invites revision or abandonment of composite reporting. Where scales are multidimensional, we examine subscale alphas and factor models to guard against artificially inflated composites.

Test–retest reliability quantifies temporal stability across a suitable interval. For traits, modest stability over weeks to months supports the interpretation of enduring patterns; for states or situational judgments, instability may be intrinsic rather than “error.” We therefore align retest intervals and expectations with construct definitions. Stability coefficients are interpreted alongside mean-level change: two identical rank orders can still mask systematic shifts if the construct is state-like.

Convergent validity with external “gold standard” instruments is pursued when licenses, access, and study design allow. We compare our estimates to well-established measures in representative subsamples, documenting effect sizes rather than overfitting narrative claims. Discriminant checks ensure that, for example, a purported measure of interest does not merely recapitulate general cognitive ability or mood unless theory predicts overlap.

Finally, we attend to incremental validity: does the instrument add predictive or explanatory value beyond simpler predictors? When evidence is still emerging, we say so plainly. Our platform prefers calibrated modesty to marketing superlatives.

AI-Assisted Report Generation: From Numerical Vectors to Narrative

Large language models (LLMs) can translate quantitative profiles into readable summaries, examples, and integration across multiple tests. On My Path, AI report generation is architected as a constrained pipeline: structured numeric inputs (scale scores, confidence intervals or standard errors where available, within-person contrasts, and permitted interpretive frames) are serialized into a schema the model must respect. The system prompt and tool-level contracts specify prohibited claims—no fabricated citations, no diagnostic language outside licensed contexts, no invented biographical details—and require uncertainty language when evidence is weak.

Temperature and related sampling parameters are set conservatively for factual synthesis tasks. For narrative elaboration that must remain closely tethered to the provided profile, we favor lower temperature and bounded decoding to reduce drift. For optional brainstorming modules clearly labeled as speculative, slightly higher creativity settings may be used with explicit user framing. In all cases, post-generation checks can flag disallowed patterns (e.g., medical advice, certainty about future behavior) for regeneration or human review workflows when available.

Hallucination mitigation is treated as an engineering and psychometric problem, not a prompt footnote. We combine retrieval of vetted interpretive content where appropriate, template-backed sentence scaffolds for high-risk clauses, refusal behaviors when inputs are incomplete, and logging that separates model outputs from authoritative score computations. Numerical results reported to users originate from deterministic scoring pathways; LLMs do not recompute latent traits.

Cross-test profiles integrate vectors from distinct instruments under explicit compatibility assumptions. Where constructs overlap across batteries, we state the theoretical mapping; where they diverge, we avoid false unification. The AI’s role is to communicate trade-offs—for example, when high Openness aligns with Artistic interests in one framework but clashes with conscientious regimentation in situational judgments—rather than to collapse multidimensional evidence into a single identity slogan.

User-facing transparency includes indicating when text is model-generated, how inputs were derived, and how to request human support for concerns about interpretation.

Known Limitations of Self-Report Personality and Aptitude Measures

Personality questionnaires measure self-reported tendencies, values, preferences, and self-concept fragments. They do not directly measure neurons, hormonal states, parental attachment history, occupational success, morality, criminal propensity, or immutable potential. Inferring those entities from scales without independent evidence is extrapolation—not measurement.

States fluctuate: sleep deprivation, caffeine, acute stress, euphoria, illness, grief, seasonal patterns, organizational culture, economic pressure—all can reshape item endorsement even when latent traits remain similar. Repeated testing without careful spacing can induce practice effects or reflect real change; both complicate simplistic “trait” narratives. Instructions ask respondents to summarize typical patterns precisely to reduce—but not erase—state contamination.

Self-report introduces social desirability, blind spots (lack of introspective access), intentional distortion (impression management), and linguistic or educational barriers affecting item comprehension. Procedural remedies include reversals, realism instructions, randomized item order within constraints, latency analytics where ethically collected and disclosed, and cross-informant designs when feasible in research contexts—not always in consumer-facing flows.

Our instruments are not substitutes for clinical interviews, psychoeducational assessments, forensic evaluations, licensing exams, neuropsychological batteries, workplace certification, or ADA-related determinations. Any similarity to diagnoses or job-fit labels is illustrative and non-authoritative unless a specific validated use case explicitly supports it—and even then outside official channels caution applies.

Finally, correlational architectures have boundary conditions: group differences must be contextualized ethically; stereotype amplification is an active risk when scores are generalized across cultures without local norms or when stereotypes are mistakenly treated as causal mechanisms. Responsible communication foregrounds individuality and measurement error rather than deterministic typing.

Data Privacy, Ethics, and Stewardship

Trust is prerequisite to voluntary psychological measurement. My Path maintains a fiduciary posture toward respondent data: we do not sell personal data to advertisers or brokers. Operational funding comes from subscriptions and ethically scoped services—not from monetizing private answers as standalone commodities.

Raw item responses undergo cryptographic hashing and strict access controls consistent with principle of least privilege. Identifiers useful for longitudinal service delivery are segregated where feasible from analytic replicas; aggregates for model improvement exclude direct identifiers unless users provide informed, specific consent aligned with jurisdictional norms.

Anonymized or de-identified aggregate statistics may support calibration, fairness monitoring, linguistic adaptation, security anomaly detection, and scientific communication. Aggregation is implemented with safeguards against trivial re-identification in small cells; suppressed counts and noise infusion may be employed when reporting edge distributions.

Users retain substantive rights aligned with GDPR-like expectations where applicable including access, portability of derived summaries where technically feasible, correction of account metadata, objection to certain processing bases, restriction, and deletion. Deletion cascades operational data subject to lawful retention carve-outs documented in formal policies.

We refuse uses that amplify coercion or undue influence: covert testing, deceptive framing, covert surveillance via tests, discriminatory slicing without safeguards, or model training schemes that incentivize deceptive answer patterns. Transparency documents describe retention windows, subprocessors under contract, jurisdictions of processing, and how to escalate privacy concerns.

Ethics review for special research deployments—especially involving minors, workplaces, educational institutions—expects proportionality and clear benefit-to-risk calculus. Institutional partners are expected to uphold parallel duties under applicable IRB-equivalent norms.

Cross-Cultural Adaptation: Translation, Localization, and Norms

Psychological constructs travel imperfectly across languages and cultures. Direct translation seldom suffices; linguistic adaptation must preserve psychological distance, idiom neutrality, grammatical symmetry, reading level, and appropriateness of behavioral exemplars across regions. Industry standards such as iterative forward–back translation, adjudication committees, bilingual cognitive interviewing, pilot small-sample fielding, DIF scrutiny, and metric invariance testing inform our localization pipeline—not a single glossary pass.

Localization expands beyond lexical substitution: norms, illustrative occupations, lawful concepts, etiquette around self-disclosure, and platform UI metaphors interact with endorsement patterns. A scale may function structurally equivalently yet exhibit uniform shifts in thresholds (metric non-invariance) or item-level bias that requires rewriting rather than renaming.

Norming stratifies distributions by geography, age, education, occupation, gender identity and related covariates when appropriate and legally permissible—with explicit acknowledgement that simplistic demographic buckets conceal heterogeneity inside cells. Adaptive reporting may switch between local norms and global composites when justified by statistical evidence and fairness checks.

We avoid cultural essentialism in narrative outputs: stereotypes about nations or ethnolinguistic groups are neither inputs nor sanctioned inferences from scores. Comparative statements reference normative anchors transparently disclosed to the respondent.

When local validation data are sparse, we communicate wider uncertainty intervals and withhold fine-grained comparisons that would overfit noise. Conversely, accumulating evidence tightens thresholds and strengthens claims over time—a commitment articulated in longitudinal research agendas.

Longitudinal Research, Repeated Testing, and Separating Trait from State Drift

Repeat administration is scientifically valuable yet interpretively delicate. With appropriate spacing, longitudinal designs estimate stability coefficients, quantify practice effects and attrition biases, probe intervention sensitivity, and test whether theoretically state-like fluctuations dampen whereas trait cores persist. Poorly spaced retests inflate artificial stability or, conversely, capture acute mood shocks misread as enduring change.

Our platform architecture distinguishes within-person deltas on each construct from cohort-level secular trends attributable to linguistic drift, societal events, instructional tone changes across app versions, or norm updates. Transparent change scores include standard errors to discourage over-interpretation of tiny movements near measurement noise floors.

State–trait partitioning leverages multi-wave models when sample sizes suffice: latent trait factor models with occasion-specific residuals, latent curve models capturing growth trajectories, and mixture models probing heterogeneity subgroups—for example plateaued vs abruptly shifting profiles—to avoid averaging away meaningful divergence.

Research ethics govern notifications about data reuse beyond core service analytics: distinct consents delineate personalization, aggregated science, benchmark publications, external collaboration, synthetic data generation, red-team exercises, localization experiments, fairness audits—and users can withhold participation without losing foundational access guarantees clearly enumerated in privacy documentation.

Public-facing science communications summarize findings with effect sizes and confidence—not p-hacked peaks—while guarding against undue hype linking personality metrics to deterministic life outcomes.

In sum, My Path treats longitudinal intelligence as iterative evidence accumulation: each wave refines norms, interrogates fairness, strengthens or weakens theoretical bridges, sharpens constrained AI narratives—and deepens humility about how much any single questionnaire can ever say about a richly situated human life.