Class RowMatcher


  • public class RowMatcher
    extends java.lang.Object
    Performs matching on the rows of one or more tables. The specifics of what constitutes a matched row, and some additional intelligence about how to determine this, are supplied by an associated MatchEngine object, but the generic parts of the matching algorithms are done here.

    Note that since the LinkSets and other objects handled by this class may be very large when large tables are being matched, the algorithms in this class are coded carefully to use as little memory as possible. Techniques include removing items from one collection as they are added to another. This means that in many cases input values may be modified by the methods.

    Some of the computationally intensive work done by this abstract class is defined as abstract methods to be implemented by concrete subclasses.

    Since:
    13 Jan 2004
    Author:
    Mark Taylor (Starlink)
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int DFLT_PARALLELISM
      Actual value for default parallelism (also limited by machine).
      static int DFLT_PARALLELISM_LIMIT
      Maximum suggested value for parallelism.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      LinkSet createLinkSet()
      Constructs a new empty LinkSet for use by this matcher.
      static RowMatcher createMatcher​(MatchEngine engine, uk.ac.starlink.table.StarTable[] tables, uk.ac.starlink.table.RowRunner runner)
      Creates a RowMatcher instance.
      LinkSet findGroupMatches​(MultiJoinType[] joinTypes)
      Returns a list of RowLink objects corresponding to a match performed with this matcher's tables using its match engine.
      LinkSet findInternalMatches​(boolean includeSingles)
      Returns a list of RowLink objects corresponding to all the internal matches in this matcher's sole table using its match engine.
      LinkSet findMultiPairMatches​(int index0, boolean bestOnly, MultiJoinType[] joinTypes)
      Returns a set of RowLink objects each of which represents matches between one of the rows of a reference table and any of the other tables which can provide matches.
      LinkSet findPairMatches​(PairMode pairMode)
      Returns a set of RowLink objects corresponding to a pairwise match between this matcher's two tables performed with its match engine.
      ProgressIndicator getIndicator()
      Returns the current progress indicator for this matcher.
      void setIndicator​(ProgressIndicator indicator)
      Sets the progress indicator for this matcher.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DFLT_PARALLELISM_LIMIT

        public static final int DFLT_PARALLELISM_LIMIT
        Maximum suggested value for parallelism. This value is limited mainly because not all the steps of matching have been (can be?) parallelised, and so in accordance with Amdahl's Law there are diminishing returns as the number of processors increases. Some steps in fact slow down as more threads are added, because of additional data structure combination work required. The current value of 6 is somewhat, though not completely, arbitrary, having been set following some experimentation.
        See Also:
        Constant Field Values
      • DFLT_PARALLELISM

        public static final int DFLT_PARALLELISM
        Actual value for default parallelism (also limited by machine).
    • Method Detail

      • setIndicator

        public void setIndicator​(ProgressIndicator indicator)
        Sets the progress indicator for this matcher.
        Parameters:
        indicator - new indicator
      • getIndicator

        public ProgressIndicator getIndicator()
        Returns the current progress indicator for this matcher.
        Returns:
        indicator
      • createLinkSet

        public LinkSet createLinkSet()
        Constructs a new empty LinkSet for use by this matcher. The current implementation returns one based on a HashSet, but future implementations may provide the option of LinkSet implementations backed by disk.
        Returns:
        new LinkSet
      • findPairMatches

        public LinkSet findPairMatches​(PairMode pairMode)
                                throws java.io.IOException,
                                       java.lang.InterruptedException
        Returns a set of RowLink objects corresponding to a pairwise match between this matcher's two tables performed with its match engine. Each element in the returned list corresponds to a matched pair with one entry from each of the input tables.
        Parameters:
        pairMode - matching mode to determine which rows appear in the result
        Returns:
        links representing matched rows
        Throws:
        java.io.IOException
        java.lang.InterruptedException
      • findMultiPairMatches

        public LinkSet findMultiPairMatches​(int index0,
                                            boolean bestOnly,
                                            MultiJoinType[] joinTypes)
                                     throws java.io.IOException,
                                            java.lang.InterruptedException
        Returns a set of RowLink objects each of which represents matches between one of the rows of a reference table and any of the other tables which can provide matches. Elements of the result set will be instances of PairsRowLink.
        Parameters:
        index0 - index of the reference table in the list of tables owned by this row matcher
        bestOnly - true if only the best match between the reference table and any other table should be retained
        joinTypes - inclusion criteria for output table rows
        Returns:
        set of PairsRowLink objects representing multi-pair matches
        Throws:
        java.io.IOException
        java.lang.InterruptedException
      • findGroupMatches

        public LinkSet findGroupMatches​(MultiJoinType[] joinTypes)
                                 throws java.io.IOException,
                                        java.lang.InterruptedException
        Returns a list of RowLink objects corresponding to a match performed with this matcher's tables using its match engine. Each element in the returned list corresponds to a matched group of input rows, with no more than one entry from each table. Each input table row appears in no more than one RowLink in the returned list. Any number of tables can be matched.
        Parameters:
        joinTypes - inclusion criteria for output table rows
        Returns:
        list of RowLinks corresponding to the selected rows
        Throws:
        java.io.IOException
        java.lang.InterruptedException
      • findInternalMatches

        public LinkSet findInternalMatches​(boolean includeSingles)
                                    throws java.io.IOException,
                                           java.lang.InterruptedException
        Returns a list of RowLink objects corresponding to all the internal matches in this matcher's sole table using its match engine.
        Parameters:
        includeSingles - whether to include unmatched (singleton) row links in the returned link set
        Returns:
        a set of RowLink objects giving all the groups of matched objects in this matcher's sole table
        Throws:
        java.io.IOException
        java.lang.InterruptedException
      • createMatcher

        public static RowMatcher createMatcher​(MatchEngine engine,
                                               uk.ac.starlink.table.StarTable[] tables,
                                               uk.ac.starlink.table.RowRunner runner)
        Creates a RowMatcher instance.
        Parameters:
        engine - matching engine
        tables - the array of tables on which matches are to be done
        runner - RowRunner to control multithreading, or null to fall back to sequential implementation
        Returns:
        new RowMatcher