org.apache.pig.builtin
Class Bloom
java.lang.Object
org.apache.pig.EvalFunc<Boolean>
org.apache.pig.FilterFunc
org.apache.pig.builtin.Bloom
public class Bloom
- extends FilterFunc
Use a Bloom filter build previously by BuildBloom. You would first
build a bloom filter in a group all job. For example:
in a group all job. For example:
define bb BuildBloom('jenkins', '100', '0.1');
A = load 'foo' as (x, y);
B = group A all;
C = foreach B generate bb(A.x);
store C into 'mybloom';
The bloom filter can be on multiple keys by passing more than one field
(or the entire bag) to BuildBloom.
The resulting file can then be used in a Bloom filter as:
define bloom Bloom(mybloom);
A = load 'foo' as (x, y);
B = load 'bar' as (z);
C = filter B by bloom(z);
D = join C by z, A by x;
It uses BloomFilter
.
Field Summary |
org.apache.hadoop.util.bloom.BloomFilter |
filter
|
Methods inherited from class org.apache.pig.EvalFunc |
getArgToFuncMapping, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
filter
public org.apache.hadoop.util.bloom.BloomFilter filter
Bloom
public Bloom(String filename)
- Parameters:
filename
- file containing the serialized Bloom filter
exec
public Boolean exec(Tuple input)
throws IOException
- Description copied from class:
EvalFunc
- This callback method must be implemented by all subclasses. This
is the method that will be invoked on every Tuple of a given dataset.
Since the dataset may be divided up in a variety of ways the programmer
should not make assumptions about state that is maintained between
invocations of this method.
- Specified by:
exec
in class EvalFunc<Boolean>
- Parameters:
input
- the Tuple to be processed.
- Returns:
- result, of type T.
- Throws:
IOException
getCacheFiles
public List<String> getCacheFiles()
- Description copied from class:
EvalFunc
- Allow a UDF to specify a list of files it would like placed in the distributed
cache. These files will be put in the cache for every job the UDF is used in.
The default implementation returns null.
- Overrides:
getCacheFiles
in class EvalFunc<Boolean>
- Returns:
- A list of files
setFilter
public void setFilter(DataByteArray dba)
throws IOException
- For testing only, do not use directly.
- Throws:
IOException
Copyright © 2007-2012 The Apache Software Foundation