org.apache.pig.data
Class InternalDistinctBag
java.lang.Object
org.apache.pig.data.DefaultAbstractBag
org.apache.pig.data.SelfSpillBag
org.apache.pig.data.SortedSpillBag
org.apache.pig.data.InternalDistinctBag
- All Implemented Interfaces:
- Serializable, Comparable, Iterable<Tuple>, org.apache.hadoop.io.Writable, org.apache.hadoop.io.WritableComparable, DataBag, Spillable
@InterfaceAudience.Private
@InterfaceStability.Evolving
public class InternalDistinctBag
- extends SortedSpillBag
An unordered collection of Tuples with no multiples. Data is
stored without duplicates as it comes in. When it is time to spill,
that data is sorted and written to disk. The data is
stored in a HashSet. When it is time to sort it is placed in an
ArrayList and then sorted. Dispite all these machinations, this was
found to be faster than storing it in a TreeSet.
This bag spills pro-actively when the number of tuples in memory
reaches a limit
- See Also:
- Serialized Form
Method Summary |
void |
add(Tuple t)
Add a tuple to the bag. |
boolean |
isDistinct()
Find out if the bag is distinct. |
boolean |
isSorted()
Find out if the bag is sorted. |
Iterator<Tuple> |
iterator()
Get an iterator to the bag. |
long |
size()
Get the number of elements in the bag, both in memory and on disk. |
long |
spill()
Instructs an object to spill whatever it can to disk and release
references to any data structures it spills. |
Methods inherited from class org.apache.pig.data.DefaultAbstractBag |
addAll, addAll, addAll, clear, compareTo, equals, getMemorySize, getSpillFile, hashCode, incSpillCount, incSpillCount, markSpillableIfNecessary, markStale, readFields, reportProgress, sampleContents, toString, warn, write |
InternalDistinctBag
public InternalDistinctBag()
InternalDistinctBag
public InternalDistinctBag(int bagCount)
InternalDistinctBag
public InternalDistinctBag(int bagCount,
float percent)
isSorted
public boolean isSorted()
- Description copied from interface:
DataBag
- Find out if the bag is sorted.
- Returns:
- true if this is a sorted data bag, false otherwise.
isDistinct
public boolean isDistinct()
- Description copied from interface:
DataBag
- Find out if the bag is distinct.
- Returns:
- true if the bag is a distinct bag, false otherwise.
size
public long size()
- Description copied from class:
DefaultAbstractBag
- Get the number of elements in the bag, both in memory and on disk.
- Specified by:
size
in interface DataBag
- Overrides:
size
in class DefaultAbstractBag
- Returns:
- number of elements in the bag
iterator
public Iterator<Tuple> iterator()
- Description copied from interface:
DataBag
- Get an iterator to the bag. For default and distinct bags,
no particular order is guaranteed. For sorted bags the order
is guaranteed to be sorted according
to the provided comparator.
- Returns:
- tuple iterator
add
public void add(Tuple t)
- Description copied from class:
DefaultAbstractBag
- Add a tuple to the bag.
- Specified by:
add
in interface DataBag
- Overrides:
add
in class DefaultAbstractBag
- Parameters:
t
- tuple to add.
spill
public long spill()
- Description copied from interface:
Spillable
- Instructs an object to spill whatever it can to disk and release
references to any data structures it spills.
- Returns:
- number of objects spilled.
Copyright © 2007-2012 The Apache Software Foundation