Package picard.vcf
Class GenotypeConcordance
java.lang.Object
picard.cmdline.CommandLineProgram
picard.vcf.GenotypeConcordance
Summary
Calculates the concordance between genotype data of one sample in each of two VCFs - one being considered the truth (or reference) the other being the call. The concordance is broken into separate results sections for SNPs and indels. Satistics are reported in three different files.Details
This tool evaluates the concordance between genotype calls for a sample in different callsets where one is being considered as the \"truth\" (aka standard, or reference) and the other as the \"call\" that is being evaluated for accuracy. The Comparison can be restricted to a confidence interval which is typically used in order to enable proper assessment of False Positives and the False-Positive Rate (FPR).Usage example
Compare two VCFs within a confidence region
java -jar picard.jar GenotypeConcordance \\ CALL_VCF=input.vcf \\ CALL_SAMPLE=sample_name \\ O=gc_concordance.vcf \\ TRUTH_VCF=truth_set.vcf \\ TRUTH_SAMPLE=sample_in_truth \\ INTERVALS=confident.interval_list \\ MISSING_SITES_HOM_REF = true
Output Metrics:
Output metrics consists of GenotypeConcordanceContingencyMetrics, GenotypeConcordanceSummaryMetrics, and GenotypeConcordanceDetailMetrics. For each set of metrics, the data is broken into separate sections for SNPs and INDELs. Note that only SNP and INDEL variants are considered, MNP, Symbolic, and Mixed classes of variants are not included.GenotypeConcordanceContingencyMetrics
enumerate the constituents of each contingent in a callset including true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) calls.GenotypeConcordanceDetailMetrics
include the numbers of SNPs and INDELs for each contingent genotype as well as the number of validated genotypes.GenotypeConcordanceSummaryMetrics
provide specific details for the variant caller performance on a callset including values for sensitivity, specificity, and positive predictive values.
Useful definitions applicable to alleles and genotypes:
- Truthset - A callset (typically in VCF format) containing variant calls and genotypes that have been cross-validated with multiple technologies e.g. Genome In A Bottle Consortium (GIAB) (https://sites.stanford.edu/abms/giab)
- TP - True-positives are variant sites that match against the truth-set
- FP - False-positives are reference sites miscalled as variant
- FN - False-negatives are variant sites miscalled as reference
- TN - True-negatives are correctly called as reference
- Validated genotypes - are TP sites where the exact genotype (HET or HOM-VAR) appears in the truth-set
VCF Output:
- The concordance state will be stored in the CONC_ST tag in the INFO field
- The truth sample name will be \"truth\" and call sample name will be \"call\"
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprotected static class
A simple structure to return the results of getAlleles. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final String
static final htsjdk.variant.vcf.VCFHeaderLine
static final String
static final String
boolean
protected GenotypeConcordanceCounts
boolean
int
int
boolean
boolean
boolean
static final String
static final String
static final String
protected GenotypeConcordanceCounts
static final String
boolean
Fields inherited from class picard.cmdline.CommandLineProgram
COMPRESSION_LEVEL, CREATE_INDEX, CREATE_MD5_FILE, MAX_ALLOWABLE_ONE_LINE_SUMMARY_LENGTH, MAX_RECORDS_IN_RAM, QUIET, REFERENCE_SEQUENCE, referenceSequence, specialArgumentsCollection, SYNTAX_TRANSITION_URL, TMP_DIR, USE_JDK_DEFLATER, USE_JDK_INFLATER, VALIDATION_STRINGENCY, VERBOSITY
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic void
addMissingTruthAndMissingCallStates
(double numVariants, long intervalBaseCount, GenotypeConcordanceCounts counter) Method to add missing sites that are KNOWN to be HOM_REF in the case of the NIST truth data set.static boolean
classifyVariants
(Optional<htsjdk.variant.variantcontext.VariantContext> truthContext, String truthSample, Optional<htsjdk.variant.variantcontext.VariantContext> callContext, String callSample, int minGq, int minDp, boolean ignoreFilterStatus) static boolean
classifyVariants
(Optional<htsjdk.variant.variantcontext.VariantContext> truthContext, String truthSample, Optional<htsjdk.variant.variantcontext.VariantContext> callContext, String callSample, Optional<GenotypeConcordanceCounts> snpCounter, Optional<GenotypeConcordanceCounts> indelCounter, int minGq, int minDp, boolean ignoreFilteredStatus) Attempts to determine the concordance state given the truth and all variant context and optionally increments the genotype concordance count for the given variant type (SNP or INDEL).protected String[]
Put any custom command-line validation in an override of this method.static final GenotypeConcordanceStates.TruthAndCallStates
determineState
(htsjdk.variant.variantcontext.VariantContext truthContext, String truthSample, htsjdk.variant.variantcontext.VariantContext callContext, String callSample, int minGq, int minDp, Boolean ignoreFilteredStatus) A method to determine the truth and call states for a pair of variant contexts representing truth and call.protected int
doWork()
Do the work after command line has been parsed.protected static final GenotypeConcordance.Alleles
normalizeAlleles
(htsjdk.variant.variantcontext.VariantContext truthContext, String truthSample, htsjdk.variant.variantcontext.VariantContext callContext, String callSample, Boolean ignoreFilteredStatus) Gets the alleles for the truth and call genotypes.static void
outputDetailMetricsFile
(htsjdk.variant.variantcontext.VariantContext.Type variantType, htsjdk.samtools.metrics.MetricsFile<GenotypeConcordanceDetailMetrics, ?> genotypeConcordanceDetailMetricsFile, GenotypeConcordanceCounts counter, String truthSampleName, String callSampleName, boolean missingSitesHomRef, boolean outputAllRows) Outputs the detailed statistics tables for SNP and Indel match categories.Methods inherited from class picard.cmdline.CommandLineProgram
checkRInstallation, getCommandLine, getCommandLineParser, getCommandLineParserForArgs, getDefaultHeaders, getFaqLink, getMetricsFile, getPGRecord, getStandardUsagePreamble, getStandardUsagePreamble, getVersion, hasWebDocumentation, instanceMain, instanceMainWithExit, makeReferenceArgumentCollection, parseArgs, requiresReference, setDefaultHeaders, useLegacyParser
-
Field Details
-
TRUTH_VCF
-
CALL_VCF
-
OUTPUT
@Argument(shortName="O", doc="Basename for the three metrics files that are to be written. Resulting files will be <OUTPUT>.genotype_concordance_summary_metrics, <OUTPUT>.genotype_concordance_detail_metrics, and <OUTPUT>.genotype_concordance_contingency_metrics.") public File OUTPUT -
OUTPUT_VCF
@Argument(doc="Output a VCF annotated with concordance information.") public boolean OUTPUT_VCF -
TRUTH_SAMPLE
@Argument(shortName="TS", doc="The name of the truth sample within the truth VCF. Not required if only one sample exists.", optional=true) public String TRUTH_SAMPLE -
CALL_SAMPLE
@Argument(shortName="CS", doc="The name of the call sample within the call VCF. Not required if only one sample exists.", optional=true) public String CALL_SAMPLE -
INTERVALS
@Argument(doc="One or more interval list files that will be used to limit the genotype concordance. Note - if intervals are specified, the VCF files must be indexed.", optional=true) public List<PicardHtsPath> INTERVALS -
INTERSECT_INTERVALS
@Argument(doc="If true, multiple interval lists will be intersected. If false multiple lists will be unioned.") public boolean INTERSECT_INTERVALS -
MIN_GQ
@Argument(doc="Genotypes below this genotype quality will have genotypes classified as LowGq.") public int MIN_GQ -
MIN_DP
@Argument(doc="Genotypes below this depth will have genotypes classified as LowDp.") public int MIN_DP -
OUTPUT_ALL_ROWS
@Argument(doc="If true, output all rows in detailed statistics even when count == 0. When false only output rows with non-zero counts.") public boolean OUTPUT_ALL_ROWS -
USE_VCF_INDEX
@Argument(doc="If true, use the VCF index, else iterate over the entire VCF.", optional=true) public boolean USE_VCF_INDEX -
MISSING_SITES_HOM_REF
@Argument(shortName="MISSING_HOM", doc="Default is false, which follows the GA4GH Scheme. If true, missing sites in the truth set will be treated as HOM_REF sites and sites missing in both the truth and call sets will be true negatives. Useful when hom ref sites are left out of the truth set. This flag can only be used with a high confidence interval list.") public boolean MISSING_SITES_HOM_REF -
IGNORE_FILTER_STATUS
@Argument(doc="Default is false. If true, filter status of sites will be ignored so that we include filtered sites when calculating genotype concordance. ", optional=true) public boolean IGNORE_FILTER_STATUS -
SUMMARY_METRICS_FILE_EXTENSION
- See Also:
-
DETAILED_METRICS_FILE_EXTENSION
- See Also:
-
CONTINGENCY_METRICS_FILE_EXTENSION
- See Also:
-
OUTPUT_VCF_FILE_EXTENSION
- See Also:
-
snpCounter
-
indelCounter
-
CONTINGENCY_STATE_TAG
- See Also:
-
CONTINGENCY_STATE_HEADER_LINE
public static final htsjdk.variant.vcf.VCFHeaderLine CONTINGENCY_STATE_HEADER_LINE -
OUTPUT_VCF_TRUTH_SAMPLE_NAME
- See Also:
-
OUTPUT_VCF_CALL_SAMPLE_NAME
- See Also:
-
-
Constructor Details
-
GenotypeConcordance
public GenotypeConcordance()
-
-
Method Details
-
getSnpCounter
-
getIndelCounter
-
customCommandLineValidation
Description copied from class:CommandLineProgram
Put any custom command-line validation in an override of this method. clp is initialized at this point and can be used to print usage and access argv. Any options set by command-line parser can be validated.- Overrides:
customCommandLineValidation
in classCommandLineProgram
- Returns:
- null if command line is valid. If command line is invalid, returns an array of error message to be written to the appropriate place.
-
doWork
protected int doWork()Description copied from class:CommandLineProgram
Do the work after command line has been parsed. RuntimeException may be thrown by this method, and are reported appropriately.- Specified by:
doWork
in classCommandLineProgram
- Returns:
- program exit status.
-
classifyVariants
-
classifyVariants
public static boolean classifyVariants(Optional<htsjdk.variant.variantcontext.VariantContext> truthContext, String truthSample, Optional<htsjdk.variant.variantcontext.VariantContext> callContext, String callSample, Optional<GenotypeConcordanceCounts> snpCounter, Optional<GenotypeConcordanceCounts> indelCounter, int minGq, int minDp, boolean ignoreFilteredStatus) Attempts to determine the concordance state given the truth and all variant context and optionally increments the genotype concordance count for the given variant type (SNP or INDEL). This will ignore cases where an indel was found in the truth and a SNP was found in the call, and vice versa. We typically fail to classify Mixed, Symbolic variants, or MNPs.- Parameters:
truthContext
- A variant context representing truthtruthSample
- The name of the truth samplecallContext
- A variant context representing the callcallSample
- The name of the call samplesnpCounter
- optionally a place to increment the counts for SNP truth/call statesindelCounter
- optionally a place to increment the counts for INDEL truth/call statesminGq
- Threshold for filtering by genotype attribute GQminDp
- Threshold for filtering by genotype attribute DP- Returns:
- true if the concordance state could be classified.
-
addMissingTruthAndMissingCallStates
public static void addMissingTruthAndMissingCallStates(double numVariants, long intervalBaseCount, GenotypeConcordanceCounts counter) Method to add missing sites that are KNOWN to be HOM_REF in the case of the NIST truth data set. -
outputDetailMetricsFile
public static void outputDetailMetricsFile(htsjdk.variant.variantcontext.VariantContext.Type variantType, htsjdk.samtools.metrics.MetricsFile<GenotypeConcordanceDetailMetrics, ?> genotypeConcordanceDetailMetricsFile, GenotypeConcordanceCounts counter, String truthSampleName, String callSampleName, boolean missingSitesHomRef, boolean outputAllRows) Outputs the detailed statistics tables for SNP and Indel match categories. -
normalizeAlleles
protected static final GenotypeConcordance.Alleles normalizeAlleles(htsjdk.variant.variantcontext.VariantContext truthContext, String truthSample, htsjdk.variant.variantcontext.VariantContext callContext, String callSample, Boolean ignoreFilteredStatus) Gets the alleles for the truth and call genotypes. In particular, this handles the case where indels can have different reference alleles. -
determineState
public static final GenotypeConcordanceStates.TruthAndCallStates determineState(htsjdk.variant.variantcontext.VariantContext truthContext, String truthSample, htsjdk.variant.variantcontext.VariantContext callContext, String callSample, int minGq, int minDp, Boolean ignoreFilteredStatus) A method to determine the truth and call states for a pair of variant contexts representing truth and call. A variety of variant and genotype-level checks are first used to determine if either of the the variant contexts are filtered and after that a comparison of the called genotype alleles to determine appropriate truth and call state Note that this method does NOT check for SNP versus Indel. It is assumed that that check is done by the caller and the results of this method are stored by SNP/Indel. Note that if a variant context has BOTH GQ and DP less than the specified threshold, then it will be of Truth/Call State LOW_GQ- Parameters:
truthContext
- A variant context representing truthtruthSample
- The name of the truth samplecallContext
- A variant context representing the callcallSample
- The name of the call sampleminGq
- Threshold for filtering by genotype attribute GQminDp
- Threshold for filtering by genotype attribute DP- Returns:
- TruthAndCallStates object containing the TruthState and CallState determined here.
-